It&#x27;s Always Tcp_nodelay

11 days ago

5 replies

The Nagle algorithm was created back in the day of multi-point networking. Multiple hosts were all tied to the same communications (Ethernet) channel, so they would use CSMA (https://en.wikipedia.org/wiki/Carrier-sense_multiple_access_...) to avoid collisions. CSMA is no longer necessary on Ethernet today because all modern connections are point-to-point with only two hosts per channel. In fact, most modern (Gigabit+) Ethernet connections have both ends both transmitting and receiving AT THE SAME TIME ON THE SAME WIRES. A hybrid is used on the PHY at each end to subtract what is being transmitted from what is being received.

Dumping the Nagle algorithm (by setting TCP_NODELAY) almost always makes sense and should be enabled by default.

Hikikomori

11 days ago

2 replies

Just to add, ethernet uses csma/cd, WiFi uses csma/ca.

11 days ago

1 reply

Thanks for the correction. They're so close to being the same thing that I always call it CSMA/CD. Avoiding a collision is far more preferable than just detecting one.

Yeah, many enterprise switches don't even support 100Base-T or 10Base-T anymore. I've had to daisy chain an old 100Base-T switch onto a modern one a few times myself.

reorder9695

10 days ago

1 reply

Is avoiding a collision always preferable? CSMA/CA has significant overhead (backoff period) for every single frame sent, on a less congested line CSMA/CD has less overhead.

10 days ago

1 reply

But CSMA/CD also has a backoff period.

reorder9695

10 days ago

1 reply

CSMA/CD only requires that you back off if there actually is a collision. CSMA/CA additionally requires that for every frame sent, after sensing the medium as clear, that you wait for a random amount of time before sending it to avoid collisions. If the medium is frequently clear, CA will still have the overhead of this initial wait where CD will not.

10 days ago

Depending upon how it's actually implemented, CSMA/CA may have the same (untended?) behavior of CSMA/CD in the sense that setting TCP_NODELAY will also set the backoff timer to zero. It would be interesting to test.

(My other post in this thread mentions it.) https://news.ycombinator.com/item?id=46360209#46361580

mikestorrent

10 days ago

8 replies

What did you still need to connect with 10mbit half duplex in 2014? I had gigabit to the desktop for a relatively small company in 2007, by 2014 10mb was pretty dead unless you had something Really Interesting connected....

10 days ago

1 reply

There's plenty of use cases for small things which don't need any sorts of speeds, where you might as well have used a 115200 baud serial connection but ethernet is more useful. Designing electronics for 10Mbit/s is infinitely easier and cheaper than designing electronics for 100Mbit/s, so if you don't need 100Mbit/s, why would you spend the extra effort and expense?

throw9023093209

10 days ago

There is also power consumption and reliability. I have part of my home network on 100Mbps. It eats about 60% less energy compared to Gb Ethernet. Less prone to interference from PoE.

linohh

10 days ago

1 reply

Technical debt goes hard, I had a discussion with a facilities guy why they never got around to ditch the last remnants of token ring in an office park. Fortunately in 2020 they had plenty of time to rip that stuff out without disturbing facility operation. Building automation, security and so on often lives way longer than you'd dare planning.

BikiniPrince

10 days ago

Everyone is forgetting the no delay is per application and not a system configuration. Yep, old things will still be old and that’s ok. That new fangled packet farter will need to set no delay which is a default in many scenarios. This article reminds us it is a thing and especially true for home grown applications.

myself248

10 days ago

APC UPS SmartSlot network monitoring cards. Only the very newest support 100Mbps....

kayfox

10 days ago

Things I have found that only do 10mbit:

Old CNC equipment.

Older Zebra label printers.

Some older Motorola radio stuff.

That SGI Indy we keep around for Jurassic Park jokes.

The LaserJet 5 thats still going after 30 years or something.

Some modern embedded stuff that does not have enough chooch to deal with 100mbit.

Hikikomori

10 days ago

Some old DEC devices used to connect console ports of servers. Didn't need it per say but also didn't need to spend $3k on multiple new console routers.

Was an old isp/mobile carrier so could find all kinds of old stuff. Even the first SMSC from the 80s (also DEC, 386 or similar cpu?) was still in it's racks because they didn't need the room as 2 modern racks used up all the power for that room, was also far down in a mountain so was annoying to remove equipment.

hylaride

10 days ago

If you worked in an industrial setting, legacy tech abounds due to the capital costs of replacing the equipment it supports (includes manufacturing, older hospitals, power plants, and etc). Many of these even still use token ring, coax, etc.

One co-op job at a manufacturing plant I worked at ~20 years ago involved replacing the backend core networking equipment with more modern ethernet kit, but we had to setup media converters (in that case token ring to ethernet) as close as possible to the manufacturing equipment (so that token ring only ran between the equipment and the media converter for a few meters at most).

They were "lucky" in that:

1) the networking protocol that was supported by the manufacturing equipment was IPX/SPX, so at least that worked cleanly on ethernet and newer upstream control software running on an OS (HP-UX at the time)

2) there were no lives at stake (eg nuclear safety/hospital), so they had minimal regulatory issues.

consp

10 days ago

There is always some legacy device which does weird/old connections. I distinctly remember the debit card terminals in the late '00 required a 10mbit capable ethernet connection which allowed x25 to be transmitted over the network. It is not a stretch to add 5 to 10 more years to those kind of devices.

electroly

10 days ago

This hasn't mattered in 20 years for me personally, but in 2003 I killed connectivity to a bunch of Siemens 505-CP2572 PLC ethernet cards by switching a hub from 10Mbps to 100Mbps mode. Even back then I assumed there wouldn't be anything requiring 10Mbps any more, but these things are still in use in production manufacturing facilities out there.

11 days ago

2 replies

Nagle is quite sensible when your application isn't taking any care to create sensibly-sized packets, and isn't so sensitive to latency. It avoids creating stupidly small packets unless your network is fast enough to handle them.

silisili

11 days ago

3 replies

At this point, this is an application level problem and not something the kernel should be silently doing for you IMO. An option for legacy systems or known problematic hosts fine, but off by default.

Every modern language has buffers in their stdlib. Anyone writing character at a time to the wire lazily or unintentionally should fix their application.

AnthonyMouse

11 days ago

2 replies

The programs that need it are mostly the ones nobody is maintaining.

TCP_NODELAY can also make fingerprinting easier in various ways which is a reason to make it something you have to ask for.

silisili

10 days ago

2 replies

> The programs that need it are mostly the ones nobody is maintaining

Yes, as I mentioned, it should be kept around for this but off by default. Make it a sysctl param, done.

> TCP_NODELAY can also make fingerprinting easier in various ways which is a reason to make it something you have to ask for

Only because it's on by default for no real reason. I'm saying the default should be off.

10 days ago

2 replies

>> TCP_NODELAY can also make fingerprinting easier in various ways which is a reason to make it something you have to ask for

> Only because it's on by default for no real reason. I'm saying the default should be off.

I'm assuming here that you mean that Nagle's algorithm is on by default, i.e TCP_NODELAY is off by default.

This is wrong. It seems you think the only extra fingerprinting info TCP_NODELAY gives you is the single bit "TCP_NODELAY is on vs off". But it's more than that.

In a world where every application's traffic goes through Nagle's algorithm, lots of applications will just be seen to transmit a packet every 300ms or whatever as their transmissions are buffered up by the kernel to be sent in large packets. In a world where Nagle's algorithm is off by default, those applications could have very different packet sizes and timings.

With something like Telnet or SSH, you might even be able to detect who exactly is typing at the keyboard by analyzing their key press rhythm!

silisili

10 days ago

> I'm assuming here that you mean that Nagle's algorithm is on by default, i.e TCP_NODELAY is off by default.

Correct, I wrote that backwards, good callout.

RE: fingerprinting, I'd concede the point in a sufficiently lazy implementation. I'd fully expect the application layer to handle this, especially in cases where this matters.

BenjiWiebe

10 days ago

As of recently, OpenSSH prevents keystroke timing analysis on it's own.

withinboredom

10 days ago

Nagles algorithm does really well when you're on shitty wifi.

Applications also don't know the MTU (the size of packets) on the interface they're using. Hell, they probably don't even know which interface they're using! This is all abstracted away. So, if you're on a network with a 14xx MTU (such as a VPN), assuming an MTU of 1500 means you'll send one full packet and then a tiny little packet after that.

Nagle's algorithm lets you just send data; no problem. Let the kernel batch up packets. If you control the protocol, just use a design that prevents Delayed ACK from causing the latency. IE, the "OK" from Redis.

FridgeSeal

10 days ago

2 replies

If nobody is maintaining them, do we really need them? In which case, does it really matter?

If we need them, and they’re not being maintained, then maybe that’s the kind of “scream test” wake up we need for them to either be properly deprecated, or updated.

10 days ago

1 reply

How much ongoing development effort do you think goes into, say, something like a gzip encoder?

yxhuvud

10 days ago

1 reply

A gzip encoder has no business deciding whether a socket should wait to fill up packets, however. The list of relevant applications and libraries gets a lot shorter with that restriction.

9 days ago

It was a response to the sentiment expressed in:

> If nobody is maintaining them, do we really need them?

Software can have value even when not maintained.

josefx

10 days ago

> If nobody is maintaining them, do we really need them?

Given how often issues can be traced back to open source projects barely scraping along? Yes and they are probably doing something important. Hell, if you create enough pointless busywork you can probably get a few more "helpfull" hackers into projects like xz.

10 days ago

2 replies

So to be clear, you believe every program that outputs a bulk stream to stdout should be written to check if stdout is a socket and enable Nagle's algorithm if so? That's not just busywork - it's also an abstraction violation. By explicitly turning off Nagle's, you specify that you understand TCP performance and don't need the abstraction, and this is a reasonable way to do things. Imagine if the kernel pinned threads to cores by default and you had to ask to unpin them...

9 days ago

1 reply

> That's not just busywork - it's also an abstraction violation.

You used AI to write this didn't you? Your sentence structure is not just tedious - it's a dead give-away.

9 days ago

No, I did not. Am I now forbidden from using sentence structures that AI has also used? That's not just stupid - it's insane. You know that's not even an em-dash, right?

rocqua

10 days ago

No, the program should take care to enable TCP_NODELAY when creating the socket. If the program gets passed a FD from outside it's on the outside program to ensure this. If somehow the program very often gets outside FDs from an oblivious source that could be a TCP socket, then it might indeed have to manually check if it really wants Nagle's algorithm.

hinkley

10 days ago

I think you’re forgetting how terminals work.

11 days ago

2 replies

If by "latency" you mean a hundred milliseconds or so, that's one thing, but I've seen Nagle delay packets by several seconds, which is just goofy.

Gibbon1

10 days ago

3 replies

Reminds me of trying to do IoT stuff in hospitals before IoT was a thing.

Send exactly one 205 byte packet. How do you really know? I can see it go out on a scope. And the other end receives a packet with bytes 0-56. Then another packet with bytes 142-204. Finally a packet a 200ms later with bytes 57-141.

FfffFFFFffff You!

readmodifywrite

10 days ago

1 reply

If you were using TCP, then this is absolutely normal and expected behavior. It is a stream protocol, not packet/message based.

josefx

10 days ago

1 reply

At the application layer you would not see the reordered bytes. However on the network you have IP beneath both UDP and TCP and network hardware is normally free to slice and reorder those IP packages however it wants.

10 days ago

It's not. Routers are expected to be allowed to slice IPv4 packets above 576 bytes. They can't slice IPv6 and they can't slice TCP.

However, malicious middleboxes insert themselves into your TCP connections, terminating a new connection on each side and therefore completely rewriting TCP segment boundaries.

sundbry

10 days ago

1 reply

If only there was some sort of User Datagram Protocol where you could send specifically tuned packets like this.

10 days ago

1 reply

Those who do not understand TCP are doomed to reimplement it with UDP.

The same is true of those who do understand it.

nomel

10 days ago

That first time you do sure is fun though.

taherm789

10 days ago

Things like these make me cry

littlecranky67

10 days ago

1 reply

It delays one RTT, so if you have seen seconds of delays that means your TCP ACK packages were received seconds later for whatever reason (high load?). Decreasing latency in that situation would WORSEN the situation.

10 days ago

1 reply

Maybe, maybe not, whatever.

I was testing some low-bandwidth voice chat code using two unloaded PCs sitting on the same desk. I nearly jumped out of my skin when "HELLO, HELLO?" came through a few seconds at high volume, after I had already concluded it wasn't working. TCP_NODELAY solved the problem.

littlecranky67

9 days ago

1 reply

Whatever your issue in that setup was, NODELAY had nothing to do with it

5d ago

1 reply

Then why did it stop happening when NODELAY was used?

littlecranky67

4d ago

Hard to say without looking at the complete setup - and probably just a side-effect of the underlying issue. I would even argue that NODELAY for a VoIP solution makes no sense - why are you even using TCP instead of UDP in the first place?

gerdesj

11 days ago

3 replies

I think you are confusing network layers and their functionality.

"CSMA is no longer necessary on Ethernet today because all modern connections are point-to-point with only two "hosts" per channel."

Ethernet really isn't ptp. You will have a switch at home (perhaps in your router) with more than two ports on it. At layer 1 or 2 how do you mediate your traffic, without CSMA? Take a single switch with n ports on it, where n>2. How do you mediate ethernet traffic without CSMA - its how the actual electrical signals are mediated?

"Ethernet connections have both ends both transmitting and receiving AT THE SAME TIME ON THE SAME WIRES."

That's full duplex as opposed to half duplex.

Nagle's algo has nothing to do with all that messy layer 1/2 stuff but is at the TCP layer and is an attempt to batch small packets into fewer larger ones for a small gain in efficiency. It is one of many optimisations at the TCP layer, such as Jumbo Frames and mini Jumbo Frames and much more.

https://oxide-and-friends.transistor.fm/episodes/mr-nagles-w...

11 days ago

2 replies

It's P2P as far as the physical layer (L1) is concerned.

Usually, full duplex requires two separate channels. The introduction of a hybrid on each end allows the use of the same channel at the same time.

Some progress has been made in doing the same thing with radio links, but it's harder.

gerdesj

11 days ago

4 replies

Sorry?

Ethernet has had the concept of full duplex for several decades and I have no idea what you mean by: "hybrid on each end allows the use of the same channel at the same time."

The physical electrical connections between a series of ethernet network ports (switch or end point - it doesn't matter) are mediated by CSMA.

No idea why you are mentioning radios. That's another medium.

switchbak

11 days ago

2 replies

My understanding is that no one used hubs anymore, so your collision domain goes from a number of machines on a hub to a dedicated channel between the switch and the machine. There obviously won’t be collisions if you’re the only one talking and you’re able to do full duplex communications without issue.

nomel

10 days ago

1 reply

> My understanding is that no one used hubs anymore

this is absolutely hilarious.

switchbak

8 days ago

1 reply

Your comment is absolutely devoid of content.

Admittedly, I’m no networking expert but it was my understanding that most installs now use switches almost exclusively. Are you suggesting otherwise?

A quick search would seem to indicate I’m right. Do you mind elaborating on your snark?

nomel

18h ago

"No one" and "no new installations" are not the same. There are many many many millions of hubs out there in the world. The statement, as written, is just ludicrously naive, entirely disconnected from reality.

hylaride

10 days ago

Hubs still exist(ed), but nobody implemented half-duplex or CSMA from gigabit ethernet on up (I can't remember if it was technically part of the gig-e spec or not)

ahoka

10 days ago

CSMA is last used with 10Mbit ethernet, so that’s why radios are only relevant.

stephen_g

10 days ago

A hybrid is a type of RF transformer - https://en.wikipedia.org/wiki/Hybrid_transformer

Dylan16807

11 days ago

> Ethernet has had the concept of full duplex for several decades and I have no idea what you mean by: "hybrid on each end allows the use of the same channel at the same time."

Gigabit (and faster) is able to do full duplex without needing separate wires in each direction. That's the distinction they're making.

> The physical electrical connections between a series of ethernet network ports (switch or end point - it doesn't matter) are mediated by CSMA.

Not in a modern network, where there's no such thing as a wired collision.

> Take a single switch with n ports on it, where n>2. How do you mediate ethernet traffic without CSMA - its how the actual electrical signals are mediated?

Switches are not hubs. Switches have a separate receiver for each port.

AnthonyMouse

11 days ago

1 reply

> It's P2P as far as the physical layer (L1) is concerned.

Only in the sense that the "peer" is the switch. As soon as the switch goes to forward the packet, if ports 2 and 3 are both sending to port 1 at 1Gbps and port 1 is a 1Gbps port, 2Gbps won't fit and something's got to give.

mikepurvis

11 days ago

1 reply

Right but the switch has internal buffers and ability to queue those packets or apply backpressure. Resolving at that level is a very different matter from an electrical collision at L1.

AnthonyMouse

11 days ago

Not as far as TCP is concerned it isn't. You sent the network a packet and it had to throw it away because something else sent packets at the same time. It doesn't care whether the reason was an electrical collision or not. A buffer is just a funny looking wire.

saltcured

11 days ago

2 replies

In modern ethernet, there is also flow-control via the PAUSE frame. This is not for collisions at the media level, but you might think of it as preventing collisions at the buffer level. It allows the receiver to inform the sender to slow down, rather than just dropping frames when its buffers are full.

toast0

10 days ago

1 reply

At least in networks I've used, it's better for buffers to overflow than to use PAUSE.

Too many switches will get a PAUSE frame from port X and send it to all the ports that send packets destined for port X. Then those ports stop sending all traffic for a while.

About the only useful thing is if you can see PAUSE counters from your switch, you can tell a host is unhealthy from the switch whereas inbound packet overflows on the host might not be monitored... or whatever is making the host slow to handle packets might also delay monitoring.

saltcured

10 days ago

Sadly, I'm not too surprised to hear that. I wish we had more rapid iteration to improve such capabilities for real world use cases.

Things like back pressure and flow control are very powerful systems concepts, but intrinsically need there to be an identifiable flow to control! Our systems abstractions that multiplex and obfuscate flows are going to be unable to differentiate which application flow is the one that needs back pressure, and paint too-wide brush.

In my view, the fundamental problem is we're all trying to "have our cake and eat it". We expect our network core to be unaware of the edge device and application goals. We expect to be able to saturate an imaginary channel between two edge devices without any prearrangement, as if we're the only network users. We also expect our sparse and async background traffic to somehow get through promptly. We expect fault tolerance and graceful degradation. We expect fairness.

We don't really define or agree what is saturation, what is prompt, what is graceful, or what is fair... I think we often have selfish answers to these questions, and this yields a tragedy of the commons.

At the same time, we have so many layers of abstraction where useful flow information is effectively hidden from the layers beneath. That is even before you consider adversarial situations where the application is trying to confuse the issue.

Hikikomori

10 days ago

Its not really used in normal networks.

stephen_g

10 days ago

> You will have a switch at home (perhaps in your router) with more than two ports on it. At layer 1 or 2 how do you mediate your traffic, without CSMA? Take a single switch with n ports on it, where n>2. How do you mediate ethernet traffic without CSMA - its how the actual electrical signals are mediated?

CSMA/CD is specifically for a shared medium (shared collision domain in Ethernet terminology), putting a switch in it makes every port its own collision domain that are (in practice these days) always point-to-point. Especially for gigabit Ethernet, there was some info in the spec allowing for half-duplex operation with hubs but it was basically abandoned.

As others have said, different mechanisms are used to manage trying to send more data than a switch port can handle but not CSMA (because it's not doing any of it using Carrier Sense, and it's technically not Multiple Access on the individual segment, so CSMA isn't the mechanism being used).

> That's full duplex as opposed to half duplex.

No actually they're talking about something more complex, 100Mbps Ethernet had full duplex with separate transmit and receive pairs, but with 1000Base-T (and 10GBase-T etc.) the four pairs all simultaneously transmit and receive 250 Mbps (to add up to 1Gbps in each direction). Not that it's really relevant to the discussion but it is really cool and much more interesting than just being full duplex.

paulsutter

11 days ago

2 replies

False. It really was just intended to coalesce packets.

I’ll be nice and not attack the feature. But making that the default is one of the biggest mistakes in the history of networking (second only to TCP’s boneheaded congestion control that was designed imagining 56kbit links)

jandrese

10 days ago

1 reply

TCP uses the worst congestion control algorithm for general networks except for all of the others that have been tried. The biggest change I can think of is adjusting the window based on RTT instead of packet loss to avoid bufferbloat (Vegas).

Unless you have some kind of special circumstance you can leverage it's hard to beat TCP. You would not be the first to try.

paulsutter

10 days ago

1 reply

For serving web pages, TCP is only used by legacy servers.

The fundamental congestion control issue is that after you drop to half, the window is increased by /one packet/, which for all sorts of artificial reasons is about 1500 bytes. Which means the performance gets worse and worse the greater the bandwidth-delay product. Not to mention head-of-line blocking etc.

The reason for QUIC's silent success was the brilliant move of sidestepping the political quagmire around TCP congestion control, so they could solve the problems in peace

jandrese

10 days ago

1 reply

TCP Reno fixed that problem. QUIC is more about sending more parts of the page in parallel. It does do its own flow control, but that's not where it gets the majority of the improvement.

paulsutter

10 days ago

1 reply

TCP Reno Vegas etc all addressed congestion control with various ideas, but were all doomed by the academic downward spiral pissing contest.

QUIC is real and works great, and they sidestepped all of that and just built it and tuned it and has basically won. As for QUIC "sending more parts of the page in parallel" yes thats what I referred to re head of line blocking in TCP.

drewg123

10 days ago

1 reply

There is nothing magic about the congestion control in QUIC. It shares a lot with TCP BBR.

Unlike TLS over TCP, QUIC is still not able to be offloaded to NICs. And most stacks are in userspace. So it is horrifically expensive in terms of watts/byte or cycles/byte sent for a CDN workload (something like 8x as a expensive the last time I looked), and its primarily used and advocated for by people who have metrics for latency, but not server side costs.

cyberax

10 days ago

1 reply

> Unlike TLS over TCP, QUIC is still not able to be offloaded to NICs.

That's not quite true. You can offload QUIC connection steering just fine, as long as your NICs can do hardware encryption. It's actually _easier_ because you can never get a QUIC datagram split across multiple physical packets (barring the IP-level fragmentation).

The only real difference from TCP is the encryption for ACKs.

drewg123

9 days ago

From a CDN perspective, whats missing is there is no kernel stack on FreeBSD / Linux, and no support for sendfile/sendpage and no support for segmentation offload entirely in hardware. So you can't just send an entire file (or a large range) and forget about it, like you can with TCP.

Some NICs, like Broadcom's newer ones, support crypto offloads, but this is not enough to be competitive with TCP / TLS. Especially since support for those offloads are not in any mainline kernel in Linux or BSD.

aleph_minus_one

10 days ago

> (second only to TCP’s boneheaded congestion control that was designed imagining 56kbit links)

What would you change here?

fulafel

10 days ago

1 reply

Are you theorizing a CSMA related motivation or benefit in the Nagle algorithm or is this just an anecdote of those times?

throwway120385

10 days ago

1 reply

CSMA further limits the throughput of the network in cases where you're sending lots of small transmissions by making sure that you're always contending for the carrier.

fulafel

10 days ago

Ok so I guess the answer is that CSMA networks get congested more easily and Nagle saves some traffic, so it helps both in cases of local network congestion and path congestion further out.

martingxx

11 days ago

4 replies

I've always thought that Nagle's algorithm is putting policy in the kernel where it doesn't really belong.

If userspace applications want to make latency/throughput tradeoffs they can already do that with full awareness and control using their own buffers, which will also often mean fewer syscalls too.

ghshephard

11 days ago

1 reply

It's kind of in User Space though - right? When an application opens a socket - it decides whether to open it with TCP_NODELAY or not. There isn't any kernel/os setting - it's done on a socket by socket basis, no?

naught00

11 days ago

TCP_NODELAY is implemented within the kernel. A socket can decide whether to use it or not.

klempner

11 days ago

1 reply

The actual algorithm (which is pretty sensible in the absence of delayed ack) is fundamentally a feature of the TCP stack, which in most cases lives in the kernel. To implement the direct equivalent in userspace against the sockets API would require an API to find out about unacked data and would be clumsy at best.

With that said, I'm pretty sure it is a feature of the TCP stack only because the TCP stack is the layer they were trying to solve this problem at, and it isn't clear at all that "unacked data" is particularly better than a timer -- and of course if you actually do want to implement application layer Nagle directly, delayed acks mean that application level acking is a lot less likely to require an extra packet.

j16sdiz

11 days ago

1 reply

If your application need that level of control, you probably want to use UDP and have something like QUIC over it.

BTW, Hardware based TCP offloads engine exists... Don't think they are widely used nowadays though

nly

11 days ago

Hardware TCP offloads usually deal with the happy fast path - no gaps or out of order inbound packets - and fallback to software when shit gets messy.

Widely used in low latency fields like trading

kvemkon

11 days ago

The tradeoff on one program can influence the other program needing perhaps the opposite decision of such tradeoff. Thus we need the arbiter in the kernel to be able to control what is more important for the whole system. So my guess.

PunchyHamster

10 days ago

Technically yes, practically userspace apps are written by mostly people that either don't, or don't want to care about lower levels. There is plenty of badly written userspace code that will stay badly written.

And it would be right choice if it worked. Hell, simple 20ms flush timer would've made it work just fine.

vsgherzi

11 days ago

1 reply

oxide and friends episode on it! It's quite good

TZubiri

10 days ago

Very on brand, oxide's core proposition is to actually invent a new (server) os+hardware, so they question/polish many of the traditional protocols and standards from the golden era.

jonstewart

11 days ago

3 replies

e40

11 days ago

Came here thinking the same thing...

armitron

11 days ago

assuming he shows up, he'll still probably be trying to defend the indefensible...

Disabling Nagle's algorithm should be done as a matter of principle, there's simply no modern network configuration where it's beneficial.

Animats

11 days ago

OK, I suppose I should say something. I've already written on this before, and that was linked above.

You never want TCP_NODELAY off at the sending end, and delayed ACKs on at the receiving end. But there's no way to set that from one end. Hence the problem.

Is TCP_NODELAY off still necessary? Try sending one-byte TCP sends in a tight loop and see what it does to other traffic on the same path, for, say, a cellular link. Today's links may be able to tolerate the 40x extra traffic. It was originally put in as a protection device against badly behaved senders.

A delayed ACK should be thought of as a bet on the behavior of the listening application. If the listening application usually responds fast, within the ACK delay interval, the delayed ACK is coalesced into the reply and you save a packet. If the listening application does not respond immediately, a delayed ACK has to actually be sent, and nothing was gained by delaying it. It would be useful for TCP implementations to tally, for each socket, the number of delayed ACKs actually sent vs. the number coalesced. If many delayed ACKs are being sent, ACK delay should be turned off, rather than repeating a losing bet.

This should have been fixed forty years ago. But I was out of networking by the time this conflict appeared. I worked for an aerospace company, and they wanted to move all networking work from Palo Alto to Colorado Springs, Colorado. Colorado Springs was building a router based on the Zilog Z8000, purely for military applications. That turned out to be a dead end. The other people in networking in Palo Alto went off to form a startup to make a "PC LAN" (a forgotten 1980s concept), and for about six months, they led that industry. I ended up leaving and doing things for Autodesk, which worked out well.

dllthomas

11 days ago

3 replies

Wildly, the Polish word "nagle" (pronounced differently) means "suddenly" or "all at once", which is just astonishingly apropos for what I'm almost certain is pure coincidence.

sleepy_keita

11 days ago

1 reply

Yeah, it's named after the person who wrote the RFC - John Nagle. Wild coincidence! https://datatracker.ietf.org/doc/html/rfc896

pests

10 days ago

1 reply

hes on hn as "Animats"

tptacek

10 days ago

1 reply

Funner trivia is why he's named "Animats".

teach

10 days ago

1 reply

I've been on HN a long time and this comment was the one to finally make me realize that "animats" is "stamina" spelled backward.

tptacek

10 days ago

See, I didn't even realize that --- wasn't what I was referring to! :)

Pikamander2

10 days ago

Nominative determinism strikes again!

https://en.wikipedia.org/wiki/Nominative_determinism

marton78

10 days ago

Strangely, the Polish word seems to encode a superposition of both settings: with NODELAY on, TCP sends messages suddenly, whereas with NODELAY off it sends tiny messages all at once, in one TCP packet.

x2rj

11 days ago

6 replies

I'va always thought a problem with Nagel's algorithm is, that the socket API does not (really) have a function to flush the buffers and send everything out instantly, so you can use that after messages that require a timely answer.

For stuff where no answer is required, Nagel's algorithm works very well for me, but many TCP channels are mixed use these days. They send messages that expect a fast answer and other that are more asynchronous (from a users point of view, not a programmers).

Wouldn't it be nice if all operating systems, (home-)routers, firewalls and programming languages would have high quality implementations of something like SCTP...

loeg

11 days ago

1 reply

TCP_CORK?

x2rj

11 days ago

Sadly linux only (and apparently some BSDs). Would love to have more (and more generalized) tcp socket modes like that.

11 days ago

2 replies

Yeah, I’ve always felt that the stream API is a leaky abstraction for providing access to networking. I understand the attraction of making network I/O look like local file access given the philosophy of UNIX.

The API should have been message oriented from the start. This would avoid having the network stack try to compensate for the behavior of the application layer. Then Nagel’s or something like it would just be a library available for applications that might need it.

The stream API is as annoying on the receiving end especially when wrapping (like TLS) is involved. Basically you have to code your layers as if the underlying network is handing you a byte at a time - and the application has to try to figure out where the message boundaries are - adding a great deal of complexity.

calineczka

10 days ago

1 reply

> message oriented

Very well said. I think there is enormous complexity in many layers because we don't have that building block easily available.

10 days ago

1 reply

It's the main reason why I use websockets for a whole lot of things. I don't wanna build my own message chunking layer on top of TCP every time.

10 days ago

1 reply

WebSocket is full of web-tech silliness; you'd be better off doing your own framing.

10 days ago

1 reply

Well, it also has the advantage of providing pretty decent encryption for free through WSS.

But yeah, where that's unnecessary, it's probably just as easy to have a 4-byte length prefix, since TCP handles the checksum and retransmit and everything for you.

10 days ago

1 reply

It's just a standard TLS layer, works with any TCP protocol, nothing WebSocket-specific in it.

10 days ago

Meh I've worxed enough with OpenSSL's API to know that I'd never ever want to implement SSL myself. Better let the WebSocket library take care of it.

10 days ago

1 reply

the whole point of TCP is that it is a stream of bytes, not of messages.

The problem is that this is not in practice quite what most applications need, but the Internet evolved towards UDP and TCP only.

So you can have message-based if you want, but then you have to do sequencing, gap filling or flow control yourself, or you can have the overkill reliable byte stream with limited control or visibility at the application level.

10 days ago

1 reply

For me, the “whole point” of TCP is to add various delivery guarantees on top of IP. It does not mandate or require a particular API. Of course, you can provide a stream API over TCP which suits many applications but it does not suit all and by forcing this abstraction over TCP you end up making message oriented applications (e.g request /response type protocols) more complex to implement than if you had simply exposed the message oriented reality of TCP via an API.

10 days ago

1 reply

TCP is not message-oriented. Retransmitted bytes can be at arbitrary offsets and do not need to align with the way the original transmission was fragmented or even an earlier retransmission.

10 days ago

2 replies

I don’t understand your point here or maybe our understanding of the admittedly vague term “message oriented” differs.

I’m not suggesting exposing retransmission, fragmentation, etc to the API user.

The sender provides n bytes of data (a message) to the network stack. The receiver API provides the user with the block of n bytes (the message) as part of an atomic operation. Optionally the sender can be provided with notification when the n-bytes have been delivered to the receiver.

mkipper

10 days ago

1 reply

I think you're misunderstanding their point.

Your API is constrained by the actual TCP protocol. Even if the sender uses this message-oriented TCP API, the receiver can't make any guarantees that a packet they receive lines up with a message boundary, contains N messages, etc etc, due to how TCP actually works in the event of dropped packets and retransmissions. The receiver literally doesn't have the information needed to do that, and it's impossible for the receiver to reconstruct the original message sequence from the sender. You could probably re-implement TCP with retransmission behaviour that gives you what you're looking for, but that's not really TCP anymore.

This is part of the motivation for protocols like QUIC. Most people agree that some hybrid of TCP and UDP with stateful connections, guaranteed delivery and discrete messages is very useful. But no matter how much you fiddle with your code, neither TCP or UDP are going to give you this, which is why we end up with new protocols that add TCP-ish behaviour on top of UDP.

https://www.fragmentationneeded.net/2012/01/dispatches-from-...

9 days ago

Fair enough - I didn’t really fully consider the effect of retransmission and segmentation on the receiver view.

jasomill

10 days ago

Is this a TCP API proposal or a protocol proposal?

Because TCP, by design, is a stream-oriented protocol, and the only out-of-band signal I'm aware of that's intended to be exposed to applications is the urgent flag/pointer, but a quick Google search suggests that many firewalls clear these by default, so compatibility would almost certainly be an issue if your API relied on this.

I suppose you could implement a sort of "raw TCP" API to allow application control of segment boundaries, and force retransmission to respect them, but at the very least this would implicitly expose applications to fragmentation issues that would require additional API complexity to address.

josefx

10 days ago

I think you could try to add the flat MSG_MORE to every send command and then do a last send without it to indirectly do a flush.

amluto

11 days ago

The socket API is all kinds of bad. The way streams should work is that, when sending data, you set a bit indicating whether it’s okay to buffer the data locally before sending. So a large send could be done as a series of okay-to-buffer writes and then a flush-immediately write.

TCP_CORK is a rather kludgey alternative.

The same issue exists with file IO. Writing via an in-process buffer (default behavior or stdio and quite a few programming languages) is not interchangeable with unbuffered writes — with a buffer, it’s okay to do many small writes, but you cannot assume that the data will ever actually be written until you flush.

I’m a bit disappointed that Zig’s fancy new IO system pretends that buffered and unbuffered IO are two implementations of the same thing.

spacechild1

10 days ago

> the socket API does not (really) have a function to flush the buffers and send everything out instantly, so you can use that after messages that require a timely answer.

I never thought about that but I think you're absolutely right! In hindsight it's a glaring oversight to offer a stream API without the ability to flush the buffer.

lysace

11 days ago

Agreed. Something like sync(2) for filesystems.

Seems like there's been a disconnect between users and kernel developers here?

foltik

11 days ago

2 replies

Why doesn’t linux just add a kconfig that enables TCP_NODELAY system wide? It could be enabled by default on modern distros.

rini17

10 days ago

Perhaps you can set up iptables rule to add the bit.

indigodaddy

11 days ago

Looks like there is a sysctl option for BSD/MacOS but Linux it must be done at application level?

hsn915

11 days ago

1 reply

Wouldn't distributed systems benefit from using UDP instead of TCP?

0xbadcafebee

11 days ago

1 reply

[delayed]

neomantra

11 days ago

2 replies

This is true for simple UDP, but reliable transports are often built over UDP.

As with anything in computing, there are trade-offs between the approaches. One example is QUIC now widespread in browsers.

MoldUDP64 is used by various exchanges (that's NASDAQ's name, others do something close). It's a simple UDP protocol with sequence numbers; works great on quality networks with well-tuned receivers (or FPGAs). This is an old-school blog article about the earlier MoldUDP:

Another is Aeron.io, which is a high-performance messaging system that includes a reliable unicast/multicast transport. There is so much cool stuff in this project and it is useful to study. I saw this deep-dive into the Aeron reliable multicast protocol live and it is quite good, albeit behind a sign-up.

https://aeron.io/other/handling-data-loss-with-aeron/

wronex

10 days ago

There is also ENet which is used in a lot of games (that is, battle tested for low latency applications.)

https://enet.bespin.org

nospice

10 days ago

Strictly speaking, you can put any protocol on top of UDP, including a copy of TCP...

But I took parent's question as "should I be using UDP sockets instead of TCP sockets". Once you invent your new protocol instead of UDP or on top of it, you can have any features you want.

Veserv

11 days ago

1 reply

The problem is actually that nobody uses the generic solution to these classes of problems and then everybody complains that the special-case for one set of parameters works poorly for a different set of parameters.

Nagle’s algorithm is just a special case solution of the generic problem of choosing when and how long to batch. We want to batch because batching usually allows for more efficient batched algorithms, locality, less overhead etc. You do not want to batch because that increases latency, both when collecting enough data to batch and because you need to process the whole batch.

One class of solution is “Work or Time”. You batch up to a certain amount of work or up to a certain amount of time, whichever comes first. You choose your amount of time as your desired worst case latency. You choose your amount of work as your efficient batch size (it should be less than max throughput * latency, otherwise you will always hit your timer first).

Nagle’s algorithm is “Work” being one packet (~1.5 KB) with “Time” being the fallback timer of 500 ms when delayed ack is on. It should be obvious that is a terrible set of parameters for modern connections. The problem is that Nagle’s algorithm only deals with the “Work” component, but punts on the “Time” component allowing for nonsense like delayed ack helpfully “configuring” your effective “Time” component to a eternity.

What should be done is use the generic solutions that are parameterized by your system and channel properties which holistically solve these problems which would take too long to describe in depth here.

delifue

11 days ago

If the default time threshold change to 1ms will it be much better in common use cases?

harikb

10 days ago

1 reply

Somewhat related, from 3 years ago. Unfortunately, original blog is gone.

"Golang disables Nagle's Algorithm by default"

1. https://news.ycombinator.com/item?id=34179426

withinboredom

10 days ago

Yeah. A disk failed, and I had to recreate the blog from whatever was still available via other means.

wyldfire

10 days ago

1 reply

Nagle's algorithm is just a special case of TCP worst case latency. Packet loss and congestion also cause significant latency.

If you care about latency, you should consider something datagram oriented like UDP or SCTP.

10 days ago

What if occasional latency is fine, and latency on terrible networks with high packet loss is fine, but you want the happy case to have little latency? Both many (non-competitive) games and SSH falls into this: reliability is more important than achieving the absolute lowest latency possible, but lower latency is still better than higher latency.

rwmj

10 days ago

1 reply

I'm surprised the article didn't also mention MSG_MORE. On Linux it hints to the kernel that "more is to follow" (when sending data on a socket) so it shouldn't send it just yet. Maybe you need to send a header followed by some data. You could copy them into one buffer and use a single sendmsg call, but it's easier to send the header with MSG_MORE and the data in separate calls.

(io_uring and preallocated kernel buffers is another method that helps a lot here.)