Eliminating Cold Starts 2: Shard and Conquer

Posted3 months agoActive3 months ago

cmsparks

62 points

18 comments

blog.cloudflare.comTechstory

supportivepositive

Debate

40/100

CloudflareServerless ComputingPerformance Optimization

Key topics

Cloudflare

Serverless Computing

Performance Optimization

Cloudflare's engineering blog post on eliminating cold starts in their Workers platform sparks discussion on the effectiveness of their approach and the trade-offs of using a service that scales to zero.

Snapshot generated from the HN discussion

Discussion Activity

Moderate engagement

First comment

Peak period

60-66h

Avg / period

4.5

Comment distribution18 data points

Loading chart...

Based on 18 loaded comments

Key moments

01Story posted
Sep 26, 2025 at 5:40 PM EDT
3 months ago
Step 01
02First comment
Sep 29, 2025 at 5:54 AM EDT
3d after posting
Step 02
03Peak activity
10 comments in 60-66h
Hottest window of the conversation
Step 03
04Latest activity
Sep 30, 2025 at 12:45 AM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (18 comments)

Showing 18 comments

bluelightning2k

3 months ago

1 reply

I love Cloudflare engineering writeups & the workers platform

jtbaker

3 months ago

1 reply

Same. Really wish they had a managed k8s offering to complement it though. Some workloads just don't fit into the workers paradigm.

arcfour

3 months ago

1 reply

Containers+Workers is now a thing, at least.

It's very much still maturing as an offering. But it does exist!

jtbaker

3 months ago

1 reply

Yeah, I've tinkered with it when it first rolled out. The dev experience and performance left a lot to be desired. Poor visibility on the status of containers, weird phantom bugs, having to learn new platform idiosyncrasies. Just embrace k8s. Just roll it up into a wrapper that runs on GCP or whatever under the hood that would allow people that need more control over the infra to adopt the platform.

arcfour

3 months ago

I agree there's a lot of room for improvement, but it's definitely worth keeping an eye on; the experience of Workers now vs when I started using them 2ish years ago is massively improved. They definitely like to ship their MVPs, but CF is pretty good about actually improving them IME.

Havoc

3 months ago

1 reply

Surprised none of the big clouds have duplicated the handshake delay thing yet - CF is noticeably better at cold starts than the rest for small scripts. I guess the big cloud functions are aimed at bigger workloads perhaps

no_wizard

3 months ago

I never understood this either, even if due to constraints it has to be a different product line.

I thought lamda@edge was going in this direction but it’s a slightly faster, more constrained version of lambdas with all the same potential downsides

smacker

3 months ago

3 replies

While I really appreciate Workers platform "eleminated cold starts" advertising was always bothering me.

This is a curl request from my machine right now to SSR react app hosted on CF Worker: ``` DNS lookup: 0.296826s Connect: 0.320031s Start transfer: 2.710684s Total: 2.710969s ```

Second request: ``` DNS lookup: 0.002970s Connect: 0.015917s Start transfer: 0.176399s Total: 0.176621s ```

2.5 seconds difference.

omk

3 months ago

2 replies

I'm not well versed with CURL design, but curious - is your first connection handling TLS while second relying on the previously established handshake?

swiftcoder

3 months ago

1 reply

TLS handshakes (outside of embedded hardware) should be measured in milliseconds, not seconds

Edit: you can kind of tell this from the connect timings listed above. TLS is faster the second time around, but not enough to make much difference to the overall speedup

scottlamb

3 months ago

I think you're right that TLS doesn't explain the difference shown above, but for completeness: TLS 1.3 can reduce the round trips from 3 to 2 on session resumption. [1] Depending on your Internet connection, that could be a lot more than milliseconds. I don't think `curl` uses it by default though.

[1] https://blog.cloudflare.com/introducing-0-rtt/

smacker

3 months ago

I'm not very well versed with CURL design too but afaik it does reuse connections but only inside the same process (e.g. downloading 10 files with 1 command). In this case it shouldn't be re-using them as I ran 2 different commands. I should have included TLS handshake time in the output, though. You can see it here (overall time is lower because I hit preview env that is slightly different from staging/prod):

First hit: ``` DNS Lookup: 0.026284s Connect (TCP): 0.036498s Time app connect (TLS): 0.059136s Start Transfer: 1.282819s Total: 1.282928s ```

Second hit: ``` DNS Lookup: 0.003575s Connect (TCP): 0.016697s Time app connect (TLS): 0.032679s Start Transfer: 0.242647s Total: 0.242733s ```

Metrics description:

time_namelookup: The time, in seconds, it took from the start until the name resolving was completed.

time_connect: The time, in seconds, it took from the start until the TCP connect to the remote host (or proxy) was completed.

time_appconnect: The time, in seconds, it took from the start until the SSL/SSH/etc connect/handshake to the remote host was completed.

time_starttransfer: The time, in seconds, it took from the start until the first byte was just about to be transferred. This includes time_pretransfer and also the time the server needed to calculate the result.

kentonv

3 months ago

1 reply

Does this app make any network requests that might have their own cold start or caching effects?

2.5 seconds seems way too long to be attributed to the Worker cold start alone.

smacker

3 months ago

1 reply

It makes requests to API server that is deployed to k8s, which doesn't have a cold start. Clearly, some caching by the runtime and framework is involved here.

My point is that "cold start" is often more than just booting VM instance.

And I noticed not everybody understands it. I used to have conversations in which people argue that there is no difference in deploying web frontend to Cloudflare vs a stateful solution because of this confusing advertisement.

0x696C6961

3 months ago

Likely reusing http keep-alive connections.

samschooler

3 months ago

I would say Cloudflare "eliminated cold starts" in the context of bringing the server online, not in the form of rendering + caching the SSR page.

candiddevmike

3 months ago

1 reply

Maybe a hot take, but if you're that concerned about cold starts, you probably shouldn't use a service that scales to zero. If your service is really not used that heavily, it's like $5-10/month for an instance that can handle more than your traffic needs 24/7, hell you can even host other similarly unused services on it!

stackskipton

3 months ago

Because attractiveness of Workers/Lambdas/Functions is whole "write simple amount of code and pay pennies to run it." Downside is cold starts, twisting yourself into knots you will do at scale to make them work and vendor lock in.

If you start to say "If you are using this for production, 5-10/month is real cost you need to pay + transactions" Well, now the cost is about to same to deploy fly.io shared CPU container and it does not come with cold starts, vendor lock in and can run as long as you want. Cloudflare knows that so they don't want to introduce that charge or even talk about it.

View full discussion on Hacker News

ID: 45391302Type: storyLast synced: 11/20/2025, 2:09:11 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN