How to escape the Linux networking stack
Mood
thoughtful
Sentiment
positive
Category
tech
Key topics
Linux networking
network optimization
Cloudflare
Cloudflare's blog post explains how they optimized their Linux networking stack, sparking discussion on alternative approaches and the company's technology choices.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
3h
Peak period
47
Day 1
Avg / period
27.5
Based on 55 loaded comments
Key moments
- 01Story posted
11/17/2025, 3:49:38 PM
2d ago
Step 01 - 02First comment
11/17/2025, 6:52:49 PM
3h after posting
Step 02 - 03Peak activity
47 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
11/19/2025, 1:06:01 AM
18h ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Creating the universe being regarded as a mistake and making many unhappy is from those books. Whenever someone figures out the universe it gets replaced with something stranger and having evidence that’s happened repeatedly is too. The Restaurant at the End of the Universe is reference in the article.
I’m a bit surprised nothing in the article was mentioned as being “mostly harmless”.
It also makes me wonder, why is tcp/ip special? The kernel should expose a raw network device. I get physical or layer 2 configuration happening in the kernel, but if it is supposed to do IP, then why stop there, why not TLS as well? Why run a complex network protocol stack in the kernel when you can just expose a configured layer 2 device to a user space process? It sounds like "that's just the way it's always been done" type of a scenario.
why is tcp/ip special? The kernel should expose a raw network device. ... Why run a complex network protocol stack in the kernel when you can just expose a configured layer 2 device to a user space process?
Check out the MIT Exokernel project and Solarflare OpenOnload that used this approach. It never really caught on because the old school way is good enough for almost everyone.
why stop there, why not TLS as well?
kTLS is a thing now (mostly used by Netflix). Back in the day we also had kernel-mode Web servers to save every cycle.
Aren't neither required these days, with all the zero copy interfaces that are now available?
This is very much newbie way of thinking. How do you know? Did you profile it?
It turns out there is surprisingly little dumb zero-copy potential at CF. Most of the stuff is TLS, so stuff needs to go through userspace anyway (kTLS exists, but I failed to actually use it, and what about QUIC).
Most of the cpu is burned on dumb things, like application logic. Turns out data copying and encryption and compression are actually pretty fast. I'm not saying these areas aren't ripe for optimization - but the majority of the cost was historically in much more obvious areas.
Does it matter? less syscalls is better. Whatever is being done in kernel mode can be replicated (or improved upon much more) in a user-space stack. It is easier to add/manage api's in user space than kernel apis. You can debug, patch, etc.. a user space stack much more easily. You can have multiple processes for redundancy, ensure crashes don't take out the whole system. I've had situations where rebooting the system was the only solution to routing or arp resolution issues (even after clearing caches). Same with netfilter/iptables "being stuck" or exhibiting performance degradation over time. if you're lucky a module reload can fix it, if it was a process I could have just killed/restarted it with minimal disruption.
> Most of the cpu is burned on dumb things, like application logic. Turns out data copying and encryption and compression are actually pretty fast. I'm not saying these areas aren't ripe for optimization - but the majority of the cost was historically in much more obvious areas.
I won't disagree with that, but one optimization does not preclude the other. if ip/tcp were user-space, they could be optimized better by engineers to fit their use cases. The type of load matters too, you can optimize your app well, but one corner case could tie up your app logic in cpu cycles, if that happens to include a syscall, and if there is no better way to handle it, those context switch cycles might start mattering.
In general, I don't think it makes much difference..but I expected companies like CF that are performance and outage sensitive to strain every last drop of performance and reliability out of their system.
https://news.ycombinator.com/item?id=28584738
(I don't consider this "the answer" as much as one example.)
AFAIK, they were the first to implement BPF for production ready code almost 3 decades ago.
https://en.wikipedia.org/wiki/Berkeley_Packet_Filter
But all this is opinion and anecdotal. Just pick a random network feature and compare by yourself the Linux and the FreeBSD code.
Exactly.
Why did you take out of context my self-criticism and omitted the second part of the line showing how you can see this by yourself?
Would anyone be interested if I polished it up and maybe added a refresher on the relevant layer 2 networking needed to reason about it? It's a fair bit of work and it's a niche topic, so I'm trying to poll a bit to see if the juice is worth the squeeze.
I'm not sure why people are replying to my comment with solutioning and trivial suggestions. All I did was encourage the thread OP to publish their notes. FWIW I've already been through a lot of options for solving my issue, and I've settled on one for now.
Because your comment didn’t say you solved it and you asked for notes without any polish as if that would help.
If you want multiple participants, you use bridges, which are roughly analogous to switches.
Yes, if they need to talk, share namespaces.
It's about time someone write a new linux networking book covering layer 2 and 3.
The existing books are already more than two decades old namely Linux Routing and Linux Routers (2nd edition).
https://learn.microsoft.com/en-us/azure/azure-boost/overview...
https://learn.microsoft.com/en-us/azure/virtual-network/acce...
4 more comments available on Hacker News
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.