In Defence of Swap: Common Misconceptions (2018)

Posted4 months agoActive3 months ago

jitl

116 points

150 comments

chrisdown.nameTechstoryHigh profile

calmmixed

Debate

80/100

LinuxSwapMemory Management

Key topics

Linux

Swap

Memory Management

The article defends the use of swap in Linux, challenging common misconceptions, and the discussion revolves around the usefulness and configuration of swap in various scenarios.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

119

Day 1

Avg / period

Comment distribution150 data points

Loading chart...

Based on 150 loaded comments

Key moments

01Story posted
Sep 20, 2025 at 8:06 PM EDT
4 months ago
Step 01
02First comment
Sep 20, 2025 at 11:34 PM EDT
3h after posting
Step 02
03Peak activity
119 comments in Day 1
Hottest window of the conversation
Step 03
04Latest activity
Oct 2, 2025 at 2:00 AM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (150 comments)

Showing 150 comments

01HNNWZ0MV43FF

4 months ago

2 replies

`sudo apt-get install earlyoom`

Configure it to fire at like 5% and forget it.

I've never seen the OOM do its dang job with or without swap.

cmckn

4 months ago

1 reply

If you've tried `systemd-oomd`, I'm curious what your thoughts are: https://www.freedesktop.org/software/systemd/man/latest/syst...

man8alexd

4 months ago

In my environment, `systemd-oomd` does nothing with the default settings.

internet_points

4 months ago

I enable oom_kill on sysrq, so I can hit alt+sysrq+f to invoke OOM; in /etc/sysctl.d/10-magic-sysrq.conf I have `kernel.sysrq = 240` (ie. 128+64+32+16, 128 being the one for the f key).

ggm

4 months ago

2 replies

Recognition that older linux swap strategies were unhelpful sometimes, which this piece of writing does, validates out past sense it wasn't working well. Regaining trust takes time.

Sometimes I think if backing store and swap were more clearly delineated we might have got to decent algorithms sooner. Having a huge amount of swap pre-emptively claimed was making it look like starvation, when it was just a runtime planning strategy. It's also confusing how top and vmstat report things.

Also, as a BSD mainly person, I think the differences stand out. I haven't noticed an OOM killer approach on BSD.

Ancient model: twice as much swap as memory

Old model: same amount of swap as memory

New model: amount of swap your experience tells you this job mix demands to manage memory pressure fairly, which is a bit of a tall ask sometimes, but basically pick a number up to memory size.

kijin

4 months ago

2 replies

For modern Linux servers with large amounts of RAM, my rule of thumb is between 1/8 and 1/32 of RAM, depending on what the machine is for.

For example, one of my database servers has 128GB of RAM and 8GB of swap. It tends to stabilize around 108GB of RAM and 5GB of swap usage under normal load, so I know that a 4GB swap would have been less than optimal. A larger swap would have been a waste as well.

man8alexd

4 months ago

1 reply

The proper rule of thumb is to make the swap large enough to keep all inactive anonymous pages after the workload has stabilized, but not too large to cause swap thrashing and a delayed OOM kill if a fast memory leak happens.

Another rule of thumb is that performance degradation due to the active working set spilling into the swap is exponential - 0.1% excess causes 2x degradation, 1% - 10x degradation, 10% - 100x degradation (assuming 10^3 difference in latency between RAM and SSD).

kijin

4 months ago

I would approach the issue from the other direction. Start by buying enough RAM to contain the active working set for the foreseeable future. Afterward, you can start experimenting with different swap sizes (swapfiles are easier to resize, and they perform exactly as well as swap partitions!) to see how many inactive anonymous pages you can safely swap out. If you can swap out several gigabytes, that's a bonus! But don't take that for granted. Always be prepared to move everything back into RAM when needed.

ChocolateGod

4 months ago

2 replies

I no longer use disk swap for servers, instead opting for Zram with a maximum is 50% of RAM capacity and a high swapiness value.

It'd be cool if Zram could apply to the RAM itself (like macOS) rather than needing a fake swap device.

cmurf

4 months ago

1 reply

zswap

https://docs.kernel.org/admin-guide/mm/zswap.html

The cgroup accounting also now works in zswap.

ChocolateGod

4 months ago

1 reply

Zswap requires a backing disk swap, Zram does not.

cmurf

4 months ago

The backing disk or file will only be written to if cache eviction on the basis of LRU comes into play, which is fine because that's probably worth the write hit. The likelihood of thrashing, the biggest complaint about disk based swap, is far reduced.

zram based swap isn't free. Its efficiency depends on the compression ratio (and cost).

LargoLasskhyfv

4 months ago

1 reply

Lookie lookie! Isn't it spooky?

https://github.com/CachyOS/CachyOS-Settings/blob/master/usr/...

Resulting in https://i.postimg.cc/hP37vvpJ/screenieshottie.png

Good enough...

ChocolateGod

4 months ago

Yeh. I haven't yet figured out how to get zram to apply transparently to containers though, anything in another memory cgroup will never get compressed unless swap is explicitly exposed to it.

creshal

4 months ago

2 replies

> Also, as a BSD mainly person, I think the differences stand out. I haven't noticed an OOM killer approach on BSD.

BSD allocators simply return errors if no more memory is available; for backwards compatibility reasons Linux is stuck with a fatally flawed API that doesn't.

man8alexd

4 months ago

1 reply

I assumed the same, but just discovered that FreeBSD has vm.overcommit too. But I'm not sure about its working.

toast0

4 months ago

2 replies

Overcommit is subtle. If you allocate a bunch of address space and don't touch it, that's one thing.

If you allocate and touch everything, and then try to allocate more, it's better to get an allocation error than an unsatifyable page fault later.

My understanding (which could very well be wrong) is Linux overcommit will continue to allocate address space when asked regardless of memory pressure; but FreeBSD overcommit will refuse allocations when there's too much memory pressure.

I'm pretty sure I've seen FreeBSD's OOM killer, but it needs a specific pattern of memory use, it's much more likely for an application to get a failed allocation and exit, freeing memory; than for all the applications to have unused allocations that they then use.

All that said, I prefer to run with a small swap, somewhere around 0.5-2GB. Memory pressure is hard to measure (although recent linux has a measure that I haven't used), but swap % and swap i/o are easy to measure. If your swap grows quickly, you might not have time to do any operations to fix it, but your stats should tell the tale. If your swap grows slowly enough, you can set thresholds and analyze the situation. If you have a lot of swap i/o that provides a measure of urgency.

jcalvinowens

4 months ago

1 reply

> If you allocate and touch everything, and then try to allocate more, it's better to get an allocation error than an unsatifyable page fault later.

It depends, but generally speaking I'd disagree with that.

The only time you actually want to see the allocation failures is if you're writing high reliability software where you've gone to the trouble to guarantee some sort of meaningful forward progress when memory is exhausted. That is VERY VERY hard, and quickly becomes impossible when you have non-trivial library dependencies.

If all you do is raise std::bad_alloc or call abort(), handling NULL return from malloc() is arguably a waste of icache: just let it crash. Dereferencing NULL is guaranteed to crash on Linux, only root can mmap() the lowest page.

Admittedly I'm anal, and I write the explicit code to check for it and call abort(), but I know very experienced programmers I respect who don't.

toast0

4 months ago

1 reply

> If all you do is raise std::bad_alloc or call abort(), handling NULL return from malloc() is arguably a waste of icache: just let it crash. Dereferencing NULL is guaranteed to crash on Linux, only root can mmap() the lowest page.

If you don't care to handle the error, which is a totally reasonable position, there's not a whole lot of difference between the allocator returning a pointer that will make you crash on use because it's zero, and a pointer that will make you crash on use because there are no pages available. There is some difference because if you get the allocation while there are no pages available, the fallible allocator has returned a permanently dead pointer and the unfailing allocator has returned a pointer that can work in the future.

But if you do want to respond to errors, it is easier to respond to a NULL return rather than to a failed page fault. I certainly agree it's not easy to do much other than abort in most cases, but I'd rather have the opportunity to try.

jcalvinowens

4 months ago

1 reply

> But if you do want to respond to errors, it is easier to respond to a NULL return rather than to a failed page fault.

It's just inherently incompatible with overcommit, isn't it? Like you can mmap() directly and use MAP_POPULATE|MAP_LOCKED to get what you want*, but that defeats overcommit entirely.

I guess I can imagine a syscall that takes a pointer and says "fault this page please but return an error instead of killing me if you can't", but there's an unavoidable TOCTOU problem in that it could be paged out again before you actually touch it.

A zany idea is to write a custom malloc() that uses userfaultfd to allow overcommit in userspace with it disabled in the kernel. The benefit being that userspace gets to decide what to do if a fault can't be satisfied instead of getting killed. But that would be pretty complex, and I don't know what the performance would look like.

* EDIT: Actually the manpage implies some ambiguity about whether MAP_LOCKED|MAP_POPULATE is guaranteed to avoid the first major fault, it might need mmap()+mlock(), I'd have to look more carefully...

toast0

4 months ago

> It's just inherently incompatible with overcommit, isn't it?

It's true that if overcommit is enabled, you can't guarantee you won't end up with a page fault that can't be satisfied.

But my experience on FreeBSD, which has overcommit enabled by default and returns NULL when asked for allocations that can't be (currently) satisfied is that most of the time you get a NULL allocation rather than an unsatisfied page fault.

What typically happens is a program grows to use beyond available memory (and swap) and it does so by allocating large, but managable chunks, using them, and then repeating. At a certain point, the OS struggles, but is typically able to find a page for each fault, but the large allocation looks too big, and the allocation fails and the program aborts.

But sometimes a program changes its usage pattern and starts using allocations that had been unused. In that case, you can still trigger the fatal page faults, because overcommit let you allocate more than is there.

If you don't want to have both scenarios, you can choose to eliminate the possibility of NULL by strictly allowing all allocations (although you could run out of address space and get a NULL at that point) or you can choose to eliminate the possibility of an unsatisfied page fault by strictly disallowing overcommit. I prefer having NULL when possible, and unsatisfied page faults when not.

man8alexd

3 months ago

Finally paid attention to an OOM event on a FreeBSD host, "failed to reclaim memory" is quite clear. Somehow, I always stopped reading at "out of swap space" before.

  Oct  1 17:05:51 kernel: swap_pager: out of swap space
  Oct  1 17:05:51 kernel: swp_pager_getswapspace(6): failed
  Oct  1 17:10:42 kernel: swap_pager: out of swap space
  Oct  1 17:10:42 kernel: swp_pager_getswapspace(27): failed
  Oct  1 17:10:58 kernel: pid 45073 (llvm-tblgen), jid 0, uid 0, was killed: failed to reclaim memory
  Oct  1 17:11:01 kernel: pid 45070 (llvm-tblgen), jid 0, uid 0, was killed: failed to reclaim memory
  Oct  1 17:11:02 kernel: pid 45074 (llvm-tblgen), jid 0, uid 0, was killed: failed to reclaim memory

jcalvinowens

4 months ago

1 reply

You can trivially disable overcommit on Linux (vm.overcommit_memory=2) to get allocation failures instead of OOMs. But you will find yourself spending a lot more money on RAM :)

hugo1789

4 months ago

And debug many tools which still ignore the fact that malloc could fail.

kijin

4 months ago

4 replies

> 6. Disabling swap doesn't prevent pathological behaviour at near-OOM, although it's true that having swap may prolong it. Whether the global OOM killer is invoked with or without swap, or was invoked sooner or later, the result is the same: you are left with a system in an unpredictable state. Having no swap doesn't avoid this.

This is the most important reason I try to avoid having a large swap. The duration of pathological behavior at near-OOM is proportional to the amount of swap you have. The sooner your program is killed, the sooner your monitoring system can detect it ("Connection refused" is much more clear cut than random latency spikes) and reboot/reprovision the faulty server. We no longer live in a world where we need to keep a particular server online at all cost. When you have an army of servers, a dead server is preferable to a misbehaving server.

OP tries to argue that a long period of thrashing will give you an opportunity for more visibility and controlled intervention. This does not match my experience. It takes ages even to log in to a machine that is thrashing hard, let alone run any serious commands on it. The sooner you just let it crash, the sooner you can restore the system to a working state and inspect the logs in a more comfortable environment.

heavyset_go

4 months ago

2 replies

Maybe I'm just insane, but if I'm on a machine with ample memory, and a process for some reason can't allocate resources, I want that process to fail ASAP. Same thing with high memory pressure situations, just kill greedy/hungry processes, please.

Like something is going very wrong if the system is in that state, so I want everything to die immediately.

gfv

4 months ago

1 reply

sysctl vm.overcommit_memory=2. However, programs for *nix-based systems usually expect overcommit to be on, for example, to support fork(). This is a stark contrast with Windows NT model, where an allocation will fail if it doesn't fit in the remaining memory+swap.

man8alexd

4 months ago

People disable memory overcommit, expecting to fix OOMs, and then they get surprised when their programs start failing mallocs while there are still tons of discardable page cache in the system.

https://unix.stackexchange.com/q/797835/1027 https://unix.stackexchange.com/q/797841/1027

cmurf

4 months ago

systems-oomd does this.

The kernel oom killer is concerned with kernel survival, not user space performance.

mickeyp

4 months ago

3 replies

That assumes the OOM killer kills the right thing. It may well choose to kill something ancillary, which causes your OOM program to just hang or misbehave wildly.

The real danger in all of this, swap or no, is the shitty OOMKiller in Linux.

xdfgh1112

4 months ago

You can apply memory quotas to the individual processes with cgroups. You can also adjust how likely a process is to be killed.

man8alexd

4 months ago

Nowadays, the OOM killer always chooses the largest process in the system/cgroup by default.

kijin

4 months ago

The OOM killer will be just as shitty whether you have swap or not. But the more swap you have, the longer your program will be allowed to misbehave. I prefer a quick and painless death.

bawolff

4 months ago

1 reply

> OP tries to argue that a long period of thrashing will give you an opportunity for more visibility and controlled intervention.

I didn't get that impression. My read was that OP was arguing for user-space process killers so the system doesn't get to the point where the system becomes unresponsive due to thrashing.

kijin

4 months ago

From the article:

> With swap: ... We have more visibility into the instigators of memory pressure and can act on them more reasonably, and can perform a controlled intervention.

But of course if you're doing this kind of monitoring, you can probably just check your processes' memory usage and curb them long before they touch swap.

danw1979

4 months ago

Amen to failing fast.

A machine that is responding just enough to keep a circuit breaker closed is the scourge of distributed systems.

tanelpoder

4 months ago

1 reply

This wasn't on Linux, but on one of the old-school commercial Unixes - a customer had memory leaks in some of their daemon processes. They couldn't fix them for some reason.

So they invested in additional swap space, let the processes slowly grow, swap out leaked stuff and restart them all over the weekend...

rwmj

4 months ago

I wrote a chat server back in the 2000s which would gradually use more and more memory over a period of months. After extensive debugging, I couldn't find any memory leak and concluded the problem was likely inside glibc or caused by memory fragmentation. Solution was to have a cron job that ran every 3 months and rebooted the machine.

userbinator

4 months ago

1 reply

Under no/low memory contention

on some workloads this may represent a non-trivial drop in performance due to stale, anonymous pages taking space away from more important use

WTF?

creshal

4 months ago

1 reply

Welcome to the wonderful world of Java programs. When your tomcat abomination pulls in 500 dependencies for one method call each, and 80% of the methods aren't even called in regular use except to perform dependency injection mumbo jumbo during the 90 seconds your tomcat needs to start up, you easily end up with 70% of your application's anon pages being completely useless, but if you can't banish them to swap, they'll prevent the code on the hot path from having any memory left over for file caching.

So even if you never run into OOM situations, adding a couple gigabytes of swap lets you free up that many gigabytes of RAM for file caching, and suddenly your application is on average 5x faster - but takes 3 seconds longer to service that one obscure API call that needs to dig all those pages back up. YMMV if you prefer consistently poor performance over inconsistent but usually much better performance.

marginalia_nu

4 months ago

1 reply

Java's performance for cold code is bad period. This doesn't really have to do with code being paged out (that very rarely happens) but due to the JIT compiler not having warmed up the appropriate execution paths so that it runs in interpreted mode, often made worse as static object initialization happening when the first code that needs that particular class runs, and if you're unlucky with how the system was designed that may introduce cascading class initialization.

Though any halfway competent Java developer following modern best practices will know to build systems that don't have these characteristics.

creshal

4 months ago

1 reply

> Though any halfway competent Java developer following modern best practices will know to build systems that don't have these characteristics.

I'll let you know if I ever meet any. Until then, another terabyte of RAM for tomcat.

marginalia_nu

4 months ago

Java generally performs much better when it isn't given huge amounts of memory to work with.

LegionMammal978

4 months ago

4 replies

On my desktop system, most of my problems with swap come from dealing with the aftermath of an out-of-control process eating all my RAM. In this case, the offending program demands memory so quickly that everything from legitimate programs gets swapped out. These programs proceed to run poorly for the next several minutes to an hour depending on usage, since the OS only swaps pages back in once they are referenced, even if there is plenty of free space not even being used in the disk cache.

Eventually I wrote a small script that does the equivalent of "sudo swapoff -a && sudo swapon -a" to eagerly flush everything to RAM, but I was surprised by how many people seemed to think there's no legitimate reason to ever want to do so.

hugo1789

4 months ago

2 replies

That works if there is enough memory after the "bad" process has been killed. The question is, is it necessary? Many systems can live with processes performing a little bit poorly for some minutes and I wouldn't do it.

michaelt

4 months ago

1 reply

> The question is, is it necessary? Many systems can live with processes performing a little bit poorly for some minutes and I wouldn't do it.

The outage ain't resolved until things are back to operating normally.

If things aren't back to 100% healthy, could be I didn't truly find the root cause of the problem - in which case I'll probably be woken up again in 30 minutes when the problem comes back.

whatevaa

4 months ago

Desktops are not servers. There could be no problem, just some hungry legitimate program (or vm).

creer

4 months ago

1 reply

It's fine that "many systems" can. But there is no easy way when the user or system can't. Flushing back to RAM is slow - that's not controversial. So it would help if there was a way to do this in advance of the need for the programs where that matters.

aeonik

4 months ago

You mean like vmtouch and madvise?

I use vmtouch all the time to preload or even lock certain data/code into RAM.

ciupicri

4 months ago

1 reply

To add to the injury swapoff doesn't read from disk sequentially, but in some "random" order, so if you're using a hard-disk it's a huge pain, although even SSDs would benefit from this.

grogers

4 months ago

When I was last messing with this ~10 years ago, even with SSD swapoff was just insanely slow. Even relatively small single digit GB swap partitions would take many minutes to drain. I think it was loading one page at a time from swap or something.

Sophira

4 months ago

1 reply

> I was surprised by how many people seemed to think there's no legitimate reason to ever want to do so.

Sounds like it's as legitimate as running the sync command - ie. ideally you should never need to do it, but in practice you sometimes do.

aidenn0

4 months ago

1 reply

I still run "sync" before removing a USB drive. I'm sure it's entirely unnecessary now, but old habits die hard.

ziml77

4 months ago

1 reply

You definitely want to ensure buffers are flushed. Because very, very annoyingly it's not default behavior on Linux distros for removable devices to be mounted with write caching disabled. I don't even know of an easy option to make Linux do that. I think you'd need to write some custom udev rule

aidenn0

4 months ago

1 reply

> Because very, very annoyingly it's not default behavior on Linux distros for removable devices to be mounted with write caching disabled

About a decade ago, removable devices definitely were mounted with the "sync" option on some distros. It really tanked write performance though, so perhaps that's why they changed it. Certainly Plasma (and probably most DEs; I only use plasma) will tell you when the device is fully unmounted when you use the udisks integration.

ziml77

4 months ago

1 reply

The problem is that the write buffer turns copy progress into a complete lie. The last time I put a large file on a removable drive from Linux, the copy finished suspiciously fast. But I thought that surely Linux wouldn't be using a write buffer when Windows hasn't used that on removable devices for 20 years, so I went on my way and shut down the computer... which led to me just sitting at a gray screen. I had to just wait there with no indication of progress or even that it was doing anything at all.

If I wasn't aware of what was happening here I likely would have just force shut down the computer after a minute of waiting. And I suspect if I had done that and checked the drive it would have appeared like the file was there, while actually missing part of the data.

aidenn0

4 months ago

I did some digging and found:

1. Arch uses the "flush" mount option by default when using udisks (which is how removable devices are mounted interactively from a DE).

2. Manjaro has a package called "udev-usb-sync" that matches USB devices in udev and limits the write-buffer size. However, it appears to (by default, you can instead specify a constant value) calculate the buffer size based on the USB transfer speed, and given the fact that I have some USB 3.1 devices that cannot maintain 1MB/s of throughput while others that can maintain over 200MB/s of throughput and both report the same transfer speed to Linux, I don't know how effective it is.

Zefiroj

4 months ago

Check out the lru_gen_min_ttl from MGLRU.

vbezhenar

4 months ago

3 replies

I've always created swap of 1.5x - 4x RAM size on every Linux computer I've had to manage and never had any issues with it. That's my rule that I learned many years ago, follow to this day and will follow.

Worst thing: I left 5% of my SSD unused which will actually be used for garbage collection and other staff. That's OK.

What I don't understand is why modern Linux is so shy of touching swap. With old kernels, Linux happily pushed unused pages to a swap, so even if you don't eat memory, your swap will be filled with tens or hundreds MB of memory and that's a great thing. Modern kernel just keeps swap usage at 0, until memory is exhausted.

kevin_thibedeau

4 months ago

1 reply

I haven't used swap for 15 years. You have to be judicious about heavy app usage with only 16GiB. With 32GiB, I've never triggered OOM.

aidenn0

4 months ago

If you never trigger OOM, then you can basically only benefit from enabling a modest amount of swap.

creshal

4 months ago

> I've always created swap of 1.5x - 4x RAM size on every Linux computer I've had to manage and never had any issues with it.

That's a couple terabyte of swap on servers these days, and even on laptops I wouldn't want to deal with 300-ish GB swap.

bawolff

4 months ago

> What I don't understand is why modern Linux is so shy of touching swap. With old kernels, Linux happily pushed unused pages to a swap, so even if you don't eat memory, your swap will be filled with tens or hundreds MB of memory and that's a great thing. Modern kernel just keeps swap usage at 0, until memory is exhausted.

The article has the answer.

creshal

4 months ago

2 replies

I wish people would actually read TFA instead of reflexively repeating nonsensical folk remedies.

JdeBP

4 months ago

1 reply

I've been telling people about this since the days when there were operating systems still around that actually did swapping (16-bit OS/2, old Unix, Standard Mode DOS+Windows) rather than paging (32-bit OS/2, 386 Enhanced Mode DOS+Windows, Windows NT). I wrote a Frequently Given Answer about it in 2007, I had had to repeat the point so many times since the middle 1990s; and I was far from alone even then.

* http://jdebp.uk./FGA/dont-throw-those-paging-files-away.html

The erroneous folk wisdom is widespread. It often seems to lack any mention of the concepts of a resident set and a working set, and is always mixed in with a wishful thinking idea that somehow "new" computers obviate this, when the basic principles of demand paging are the same as they were four decades ago, Parkinson's Law can still be observed operating in the world of computers, and the "new" computers all of those years ago didn't manage to obviate paging files either.

p_ing

4 months ago

The swapfile.sys in Windows 8+ is used for process swapping (moving the entire private working set out of memory to disk), but only for UWP applications.

mmphosis

4 months ago

TFA means The Fine Article

Panzerschrek

4 months ago

10 replies

As I understand this article, swap is useful for cases where many long-lived programs (daemons) allocate a lot of memory, but almost never access it. But wouldn't it be better to avoid writing such programs? And how many memory such daemons can consume? A couple of hundred megabytes total? Is it really that much on modern systems?

My experience with swap shows, that it only makes things worse. When I program, my application may sometimes allocate a lot of memory due to some silly bug. In such case the whole system practically stops working - even mouse cursor can't move. If I am happy, OOM killer will eventually kill my buggy program, but after that it's not over - almost all used memory is now in swap and the whole system works snail-slow, presumably because kernel doesn't think it should really unswap previously swapped memory and does this only on demand and only page by page.

I a hypothetical case without swap this case isn't so painful. When main system memory is almost fully consumed, OOM killer kills the most memory hungry program and all other programs just continue working as before.

I think that overall reliance on swap is noways just a legacy of old times when main memory was scarce and back than it maybe was useful to have swap. OS kernels should be redesigned to work without swap, this will make system behavior smoother and kernel code may be simpler (all this swapping code may be removed) and thus faster.

Rohansi

4 months ago

2 replies

> As I understand this article, swap is useful for cases where many long-lived programs (daemons) allocate a lot of memory, but almost never access it. But wouldn't it be better to avoid writing such programs?

Ideally yes, but is that something you keep in mind when you write software? Do you ever consider freeing memory just because it hasn't been used in a while? How do you decide when to free it? This is all handled automatically when you have swap enabled, and at a granularity that is much higher than you can practically manually implement it.

Panzerschrek

4 months ago

4 replies

I write mostly C++ or Rust programs. In these languages memory is freed as soon as it's no longer in use (thanks to destructors). So, usually this shouldn't be actively kept in mind. The only exception are cases like caches, but long-running programs should use caching carefully - limit cache size and free cache entries after some amount of time.

Programs, which allocate large amounts of memory without strict necessity to do so, are just a consequence of swap existence. "Thanks" to swap they weren't properly tested in low-memory conditions and thus no necessary optimization were done.

Rohansi

4 months ago

1 reply

You'll also need to consider that the allocator you're using may not immediately free memory to the system. That memory is free to be used by your application but considered as used memory mapped to your program.

Anyway, it's easy to discuss best practices but people actually following them is the actual issue. If you disable swap and the software you're running isn't optimized to minimize idle memory usage then your system will be forced to keep all of that data in RAM.

man8alexd

4 months ago

1 reply

You are both confusing swap and memory overcommit policy. You can disable swap by compiling the kernel with `CONFIG_SWAP=no`, but it won't change the memory overcommit policy, and programs would still be able to allocate more memory than available on the system. There is no problem in allocating the virtual memory - if it isn't used, it never gets mapped to the physical memory. The problem is when a program tries to use more memory than the system has, and you will get OOMs even with the swap disabled. You can disable memory overcommit, but this is only going to result in malloc() failing early while you still have tons of memory.

Rohansi

4 months ago

1 reply

Overcommit is different. We are referring to infrequently used memory - allocated, has been written to, but is not accessed often.

man8alexd

4 months ago

> Programs, which allocate large amounts of memory without strict necessity to do so, are just a consequence of swap existence.

This. The ability to allocate large amounts of memory is due to memory overcommit, not the "swap existence". If you disable swap, you can still allocate memory with almost no restrictions.

> This is all handled automatically when you have swap enabled

And this. This statement doesn't make any sense. If you disable swap, kernel memory management doesn't change, you only lose the ability to reclaim anon pages.

jibal

4 months ago

> Programs, which allocate large amounts of memory without strict necessity to do so, are just a consequence of swap existence. "Thanks" to swap they weren't properly tested in low-memory conditions and thus no necessary optimization were done.

Who told you this? It's not remotely true.

Here's an article about this subject that you might want to read:

https://chrisdown.name/2018/01/02/in-defence-of-swap.html

rwmj

4 months ago

> In these languages memory is freed as soon as it's no longer in use (thanks to destructors).

Unless you have an almost pathological attention to detail, that is not true at all. And even if you do precisely scope your destructors, the underlying allocator won't return the memory to the OS (what matters here) immediately.

csmantle

4 months ago

A side note, stack memories are usually not physically returned to the OS. When (de)allocating on stack, only the stack pointer is moved within the pages preallocated by the OS.

immibis

4 months ago

And were you aware that freeing memory only allows it to be reallocated within your process but doesn't actually release it from your process? State-of-the-art general-purpose allocators are actually still kind of shit.

zozbot234

4 months ago

1 reply

> I a hypothetical case without swap this case isn't so painful. When main system memory is almost fully consumed, OOM killer kills the most memory hungry program

That's not how it works in practice. What happens is that program pages (and read-only data pages) get gradually evicted from memory and the system still slows to a crawl (to the point where it becomes practically unresponsive) because every access to program text outside the current 4KB page now potentially involves a swap-in. Sure, eventually, the memory-hungry task will either complete successfully or the OOM killer will be called, but that doesn't help you if you care about responsiveness first and foremost (and in practice, desktop users do care about that - especially when they're trying to terminate that memory hog).

Panzerschrek

4 months ago

3 replies

Why not just always preserving program code in memory? It's usually not that much - typical executable is usually several megabytes in size and many processes can share the same code memory pages (especially with shared libraries).

creshal

4 months ago

1 reply

> It's usually not that much - typical executable is usually several megabytes in size and many processes can share the same code memory pages (especially with shared libraries)

Have a look at Chrome. Then have a look at all the Electron "desktop" apps, which all ship with a different Chrome version and different versions of shared libraries, which all can't share memory pages, because they're subtly different. You find similar patterns across many, many other workloads.

teddyh

4 months ago

Or modern languages, like Rust and Go, which have decided that runtime dependencies are too hard and instead build enormous static binaries for everything.

inkyoto

4 months ago

> Why not just always preserving program code in memory?

Because the code is never required in its entirety – only «currently» active code paths need to be resident in memory, the rest can be discarded when inactive (or never even gets loaded into memory to start off with) and paged back into memory on demand. Since code pages are read only, the inactive code pages can be just dropped without any detriment to the application whilst reducing the app's memory footprint.

> […] typical executable is usually several megabytes

Executable size != the size of the actually running code.

In modern operating systems with advanced virtual memory management systems, the actual resident code size can go as low as several kilobytes (or, rather, a handful of pages). This, of course, depends on whether the hot paths in the code have a close affinity to each other in the linked executable.

man8alexd

4 months ago

Programs and shared libraries (pages with VM_EXEC attribute) are kept in the memory if they are actively used (have the "accessed" bit set by the CPU) and are least likely to be evicted.

MomsAVoxell

4 months ago

1 reply

> But wouldn't it be better to avoid writing such programs?

Think long-term recording applications, such as audio or studio situations where you want to "fire and forget" reliable recording systems of large amounts of data consistently from multiple streams for extended durations, for example.

dns_snek

4 months ago

1 reply

Why wouldn't you write that data to disk? Holding it all in RAM isn't exactly a reliable way of storing data.

MomsAVoxell

4 months ago

2 replies

What do you think is happening with swap, exactly?

robotresearcher

4 months ago

1 reply

A process’s memory in swap does not persist after the process quits or crashes.

MomsAVoxell

4 months ago

1 reply

That is true, but the point is that having swap available, increases the time between recording samples, and needing to commit them to disk.

Well-written, long term recording software doesn’t quit or crash. It records what it needs to record, and - by using swap - gives itself plenty of time to flush the buffers using whatever techniques are necessary for safety.

Disclaimer: I’ve written this software, both with and without swap available in various embedded contexts, in real products. The answer to the question is that having swap means higher data rates can be attained before needing to sync.

robotresearcher

4 months ago

1 reply

> Well-written, long term recording software doesn’t quit or crash.

Power outages, hardware failures, and OS bugs happen to the finest application software.

I believe you from your experience that it can be useful to have recorded buffers swap out before flushing them to durable storage. But I do find it a bit surprising, since the swap system has to do the storage flush you are paying for the IO, why not do it durably?

The fine article argued that you can save engineer cycles by having the OS manage optimizing out-of-working set memory for you, but that isn’t what you’re arguing here.

I’m interested in understanding your point.

MomsAVoxell

4 months ago

I guess the point is, sometimes you just need a lot of memory and want to record into it as quickly as you can.

Then, when the time is right, flush it all to disk.

The VMM is pretty good at being tight and productive - so use it as the big fat buffer it is, and spawn worker threads to flush things at appropriate times.

If you don't have swap, you have to flush more often ...

dns_snek

4 months ago

1 reply

That's weirdly passive aggressive, swap isn't durable data storage.

Reliably recording massive amounts of data for extended periods of time in a studio setting is the most obvious use case for a fixed-size buffer that gets flushed to durable storage at short and predictable time intervals. You wouldn't want a segfault wiping out the entire day's worth of work, would you?

MomsAVoxell

4 months ago

1 reply

I didn’t mean to imply that swap was durable data storage.

Having swap/more memory available just means you have more buffers before needing to commit and in certain circumstances this can be very beneficial, such as when processing of larger amounts of logged data is needed prior to committing, etc.

There is certainly a case for both having and using swap, and disabling it entirely, depending on the data load and realtime needs of the application. Processing data and saving data have different requirements, and the point is really that there is no black and white on this. Use swap if it’s appropriate to the application - don’t use it, if it isn’t.

dns_snek

4 months ago

1 reply

I don't really understand what problem you're solving by doing it that way.

Instead of storing data (let's call them samples) to durable storage to begin with, you're letting the OS write them to swap which incurs the same cost, but then you need to read them from swap and write them to a different partition again (~triple the original cost).

MomsAVoxell

4 months ago

Actually the VMM is pretty performant, all things considered. Having more memory, managed for the process by the VMM, means less fuss doing a flush than if you were to memory-constrain things out of the gate.

Yes, sometimes, it's perfectly acceptable to flush to disk because you're getting low on RAM. But, on systems with, say .. 4x more swap than physical RAM .. you don't have to do a flush that often - if at all. This is great, for example with high quality audio loads that have to be captured safely over long periods.

A system with low RAM and high swap is also a bit more economical, at scale, when building actual hardware in large numbers. So exploiting swap in that circumstance can also effect the BOM costs.

blueflow

4 months ago

2 replies

Programs run from program text, program text is mapped in as named pages (disk cache). They are evictable! And without swap, they will get evicted on high memory pressure. Program text thrashing is worse than having swap.

The problem is not the existence of swap, but that people are unaware that the disk cache is equally important for performance.

man8alexd

4 months ago

1 reply

VM_EXEC pages are explicitly deprioritized from the reclaim by the kernel. Unlike any other pages, they are put into the active LRU on the first use and remain in the active LRU if they are active.

blueflow

4 months ago

1 reply

... until there are no deprioritized pages left to evict.

man8alexd

4 months ago

Pages from the active LRU are not evicted. Pages from the inactive LRU with the "accessed" bit set are also not evicted.

Panzerschrek

4 months ago

It's yet another old crap - to load program code from disk on-demand. Nowadays it's just easier to load the whole executable into memory and always preserve it.

inkyoto

4 months ago

Swapping (or, rather, paging – I don't think there is an operating system in existence today that swaps out entire processes) does not make modern systems slower – it is a delusion and an urban legend that originated in the sewers of the intertubes and is based on an uninformed opinion rather than the understanding and knowledge of how virtual memory systems work. It has been regurgitated to death, and the article explains it really well why it is a delusion.

20-30 years ago, heavy paging often crippled consumer Intel based PC's[0] because paging went to slow mechanical hard disks on PATA/IDE, a parallel device bus (until 2005 circa), which had little parallelism and initially no native command queuing; SCSI drives did offer features such as tagged command queuing and efficient scatter-gather but were uncommon on desktops leave alone laptops. Today the bottlenecks are largely gone – abundant RAM, switched interconnects such as PCIe, SATA with NCQ/AHCI, and solid-state storage, especially NVMe, provide low-latency, highly parallel I/O – so paging still signals memory pressure yet is far less punishing on modern laptops and desktops.

Swap space today has a quieter benefit: lower energy use. On systems with LPDDR4/LPDDR5, the memory controller can place inactive banks into low-power or deep power-down states; by compressing memory and paging out cold, dirty pages to swap, the OS reduces the number of banks that must stay active, cutting DRAM refresh and background power. macOS on Apple Silicon is notably aggressive with memory compression and swap and works closely with the SoC power manager, which can contribute to the strong battery life of Apple laptops compared with competitors, albeit this is only one factor amongst several.

[0] RISC workstations and servers have had switched interconnects since day 1.

Ferret7446

4 months ago

Kinda, basically. Swap is a cost optimization for "bad" programs.

Having more RAM is always better performance, but swap allows you to skimp out on RAM in certain cases for almost identical performance but lower cost (of buying more RAM), if you run programs that allocate a lot of memory that it subsequently doesn't use. I hear Java is notoriously bad at this, so if you run a lot of heavy enterprise Java software, swap can get you the same performance with half the RAM.

(It is also a "GC strategy", or stopgap for memory leaks. Rather than managing memory, you "could" just never free memory, and allocate a fat blob of swap and let the kernel swap it out.)

kalleboo

4 months ago

> When I program, my application may sometimes allocate a lot of memory due to some silly bug

I had one of those cases a few years ago when a program I was working on was leaking 12 MP raw image buffers in a drawing routine. I set it off running and went browsing HN/chatting with friends. A few minutes later I was like "this process is definitely taking too long" and when I went to check on it, it was using up 200+ GB of RAM (on a 16 GB machine) which had all gone to swap.

I hadn't noticed a thing! Modern SSDs are truly a marvel... (this was also on macOS rather than Linux, which may have a better swap implementation for desktop purposes)

jcynix

4 months ago

> When I program, my application may sometimes allocate a lot of memory due to some silly bug. In such case the whole system practically stops working [...]

You can limit resource usage per process thus your buggy application could be killed long before the system comes to a crawl. See your shell' s entry on its limit/ulimit built-in or use

man prlimit(1) - get and set process resource limits

toast0

4 months ago

You may benefit by reducing your swap size significantly.

The old rule of thumb of 1-2x your ram is way too much for most systems. The solution isn't to turn it off, but to have a sensible limit. Try with half a gig of swap and see how that does. It may give you time to notice the system is degraded and pick something to kill yourself and maybe even debug the memory issue if needed. You're not likely to have lasting performance issues from too many things swapped out after you or the OOM killer end the memory pressure, because not much of your memory will fit in swap.

creshal

4 months ago

> But wouldn't it be better to avoid writing such programs?

Yes, indeed, the world would be a better place if we had just stopped writing Java 20 years ago.

> And how many memory such daemons can consume? A couple of hundred megabytes total?

Consider the average Java or .net enterprise programmer, who spends his entire career gluing together third-party dependencies without ever understanding what he's doing: Your executable is a couple hundred megabytes already, then you recursively initialize all the AbstractFactorySingletonFactorySingletonFactories with all their dependencies monkey patched with something worse for compliance reasons, and soon your program spends 90 seconds simply booting up and sits at two or three dozen gigabytes of memory consumption before it has served its first request.

> Is it really that much on modern systems?

If each of your Java/.net business app VMs needs 50 or so gigabytes to run smoothly, you can only squeeze ten of them in an 1U pizza box with a mere half terabyte RAM; while modern servers allow you to cram in multiple terabytes, do you really want to spend several tens of thousands of dollars on extra RAM, when swap storage is basically free?

Cloud providers do the same math, and if you look at e.g. AWS, swap on EBS costs as much per month as the same amount of RAM costs per hour. That's almost three orders of magnitude cheaper.

> When I program, my application may sometimes allocate a lot of memory due to some silly bug.

Yeah, that's on you. Many, many mechanism let you limit the per-process memory consumption.

But as TFA tries to explain, dealing with this situation is not the purpose of swap, and never has been. This is a pathological edge case.

> almost all used memory is now in swap and the whole system works snail-slow, presumably because kernel doesn't think it should really unswap previously swapped memory and does this only on demand and only page by page.

This requires multiple conditions to be met

- the broken program is allocating a lot of RAM, but not quickly enough to trigger the OOM killer before everything has been swapped out

- you have a lot of swap (do you follow the 1990s recommendation of having 1-2x the RAM amount as swap?)

- the broken program sits in the same cgroup as all the programs you want to keep working even in an OOM situation

Condition 1 can't really be controlled, since it's a bug anyway.

Condition 2 doesn't have to be met unless you explicitly want it to. Why do you?

Condition 3 is realistically on desktop environments, despite years of messing around with flatpaks and snaps and all that nonsense they're not making it easy for users to isolate programs they run that haven't been pre-containerized.

But simply reducing swap to a more realistic size (try 4GB, see how far it gets you) will make this problem much less dramatic, as only parts of the RAM have to get flushed back.

> I a hypothetical case without swap this case isn't so painful. When main system memory is almost fully consumed, OOM killer kills the most memory hungry program and all other programs just continue working as before.

And now you're wasting RAM that could be used for caching file I/O. Have you benchmarked how much time you're wasting through that?

> I think that overall reliance on swap is noways just a legacy of old times when main memory was scarce and back than it maybe was useful to have swap.

No, you just still don't understand the purpose of swap.

Also, "old times"? You mean today? Because we still have embedded environments, we have containers, we have VMs, almost all software not running on a desktop is running in strict memory constraints.

> and kernel code may be simpler (all this swapping code may be removed)

So you want to remove all code for file caching? Bold strategy.

tsoukase

4 months ago

1 reply

In my humble experience, if you run out of memory in Linux you are f... up, irrespective of swap present and/or OOM getting in.

On the other side, a Raspberry Pi freezed unexpectedly (not due to low memory) until a very small swap file was enabled. It was almost never used but the freezes stopped. Fun swap stories.

fuzzfactor

4 months ago

>There's also a lot of misunderstanding about the purpose of swap – many people just see it as a kind of "slow extra memory" for use in emergencies, but don't understand how it can contribute during normal load to the healthy operation of an operating system as a whole.

That's the long-standing defect that needs to be corrected then, there should be no dependence on swap existing whatsoever as long as you have more than enough memory for the entire workload.

jitlAuthor

4 months ago

3 replies

I am testing a distributed database-like system at work that makes heavy use of swap. At startup, we read a table from S3 and compute a recursive materialized view over it. This needs about 4TB of “memory” per node while computing, which we provide as 512gb of RAM + 3900GB of NVMe zswap enabled swap devices. Once the computation is complete, we’re left with a much smaller working set index (about 400gb) we use to serve queries. For this use-case, swap serves as a performant and less labor intensive approach to manually spilling the computation to disk in application code (although there is some mlock going on; it’s not entirely automatic). This is like a very extreme version of the initialization-only pages idea discussed in the articule.

The warm up computation does take like 1/4 the time if it can live entirely in RAM, but using NVMe as “discount RAM” reduces the United States dollar cost of the system by 97% compared to RAM-only.

zozbot234

4 months ago

5 replies

The problem with heavy swapping on NVMe (or other flash memory) is that it wears out the flash storage very quickly, even for seemingly "reasonable" workloads. In a way, the high performance of NVMe can work against you. Definitely something you want to check out via SMART or similar wearout stats.

inkyoto

4 months ago

1 reply

Not an issue for the commenter – since they have mentioned S3, they are either using AWS EBS or instance attached scratch NVMe's which the vendor (AWS) takes care of.

The AWS control plane will detect an ailing SSD backing up the EBS and will proactively evacuate the data before the physical storage goes pear shaped.

If it is an EC2 instance with an instance attached NVMe, the control plane will issue an alert that can be automatically acted upon, and the instance can be bounced with a new EC2 instance allocated from a pool of the same instance type and get a new NVMe. Provided, of course, the design and implementation of the running system are stateless and can rebuild the working set upon a restart.

jitlAuthor

4 months ago

1 reply

EBS is slow. No way we would use it for swap. Gotta be instance storage device. And yes, we can rebuild a node from source data, we do so regularly to release changes anyways.

inkyoto

4 months ago

I figured that you were using instance attached NVMe's since you mentioned the scale of your load – an EBS even with the io2 Express storage class can't keep up with a physical NVMe drive on high intensity I/O tasks.

Regardless, AWS takes care the hardware cycling / migration in either case.

ciupicri

4 months ago

For what it's worth, these are the lifetime estimates for the Micron 7450 SSD [1]:

    Model  Capacity  4K Rand  128K Seq
               [GB]    [TBW]     [TBW]
    PRO        3840    7_300    24_400
    PRO        7680   14_000    48_800
    MAX        3200   17_500    30_900
    MAX        6400   35_000    61_800

> Values represent the theoretical maximum endurance for the given transfer size and type. Actual lifetime will vary by workload …

> Total bytes written calculated assuming drive is 100% full (user capacity) with workload of 100% random aligned 4KB writes.

[1]: page 6/17, https://assets.micron.com/adobe/assets/urn:aaid:aem:d133a40b...

justsomehnguy

4 months ago

> that it wears out the flash storage very quickly

Only if you use a consumer grade flash with a non-consumer grade usage.

For anything with DPWD >= 1 it's not an issue, eg:

https://news.ycombinator.com/item?id=45273937

jitlAuthor

4 months ago

Let’s say we’re spending $1 million on hardware hypothetically with the swap setup.

At that price point, either we use swap and let the kernel engineers move data from RAM to disk and back, or we disable swap and need user space code to move the same data to disk and back. We’d need to price out writing & maintaining the user space implementation (mmap perhaps?) for it to be fair price comparison.

To avoid SSD wear and tear, we could spend $29 million a year more to put the data in RAM only. Not worth!

(We rent EC2 instances from AWS, so SSD wear is baked into the pricing)

p_ing

4 months ago

While what you stated is overall not true, who cares with a 97% cost savings vs RAM? Just pop in another NVMe when one fails.

dsr_

4 months ago

1 reply

Have you considered having one box with 4TB of RAM to do the computation, then sending it around to all the other nodes?

jitlAuthor

4 months ago

Each node handles an independent ~4TB shard of data in horizontal scale-out fashion. Perhaps we could try some complex shenanigans where we rent 4TB RAM nodes, compute, send to 512GB RAM nodes then terminate the 4TB nodes but that’s a bunch of extra complexity for not much of a win.

dist-epoch

4 months ago

What's the reduction of cost measured in Euros though?

shatsky

4 months ago

1 reply

Author pushes abstract idea about "page reclamation" in front of ideas of performance, reliability and controllable service degradation which people actually want; because author believes that it is the one and only solution to them; and then defends swap because it is good for it.

No, this is just plain wrong. There are very specific problems which happen when there is not enough memory.

1. File-backed page reads causing more disk reads, eventually ending with "programs being executed from disk" (shared libraries are also mmaped) which feels like system lockup. This does not need any "egalitarian reclamation" abstraction and swap, and swap does not solve it. But it can be solved simply by reserving some minimal amount of memory for buf/cache, with which system is still responsive. 2. Eventually failure to allocate more memory for some process. Any solutions like "page reclamation" with pushing unused pages to some swap can only increase maximum amount of memory which can be used before it happens, from one finite value to bigger finite value. When there is no memory to free without losing data, some process must be killed. Swap does not solve this. The least bad solution would be to warn user in advance and let them choose processes to kill.

man8alexd

4 months ago

1 reply

Neither executables nor shared libraries are going to be evicted if they are in active use and have the "accessed" bit set in their page tables. This code has been present in the kernel mm/vmscan.c at least since 2012.

shatsky

4 months ago

1 reply

Will look into that again. If you're right about unevictability of these pages, what is the mechanism which causes sudden extreme degradation of performance when system is almost out of memory due to some app gradually consuming it, from quite responsive system to totally unresponsive system which can stay stuck with trashing disk for ages until oom will fire?

man8alexd

4 months ago

Once your active working set starts spilling into swap, the performance degradation goes exponential. The difference in latency between RAM and SSD is orders of magnitude. Assuming 10^3 difference, 0.1% memory excess causes 2x degradation, 1% - 10x degradation, 10% - 100x degradation.

fpoling

4 months ago

1 reply

The article has not mentioned memory compression as an alternative to swap which many Linux distributions enable by default.

On the other hand these days latest SSD are way faster than memory compression even with LUKS encryption on and even when compression uses LZ4 compression. Plus modern SSDs do not suffer from frequent writes as before so on my laptop I disabled the memory compression and then all reasoning from the article applies again.

Then on a development laptop running compilations/containers/VMs/browser vm.swappines does not seems matter that much if one has enough memory. So I no longer tune it to 100 or more and leave at the default 60%.

vlovich123

4 months ago

1 reply

> these days latest SSD are way faster than memory compression

That's a really provocative claim. Any benchmarks to support this?

fpoling

4 months ago

2 replies

On my laptop with Samsung PRO 990 SSD and Intel Core Ultra 7 165U CPU with 64G RAM under Debian 13:

Read test via coping to RAM memory from LUKS-encrypted BTRFS against /tmp that is a standard RAM disk:

  $ dd of=/tmp/input-90K.jsonl if=input-90K.jsonl conv=fdatasync bs=10M iflag=direct oflag=direct
  2596+1 records in
  2596+1 records out
  27225334502 bytes (27 GB, 25 GiB) copied, 20.4403 s, 1.3 GB/s

Write test:

  $ dd if=/tmp/input-90K.jsonl of=tmp.jsonl conv=fdatasync bs=10M oflag=direct
  2596+1 records in
  2596+1 records out
  27225334502 bytes (27 GB, 25 GiB) copied, 16.8612 s, 1.6 GB/s

Preparing RAM disk with zram compression to emulate zram:

  $ sudo zramctl --algorithm=lz4 --size=30GiB /dev/zram0
  $ sudo mkfs.ext4 /dev/zram0
  $ sudo mount /dev/zram0 /mnt
  $ df -h /mnt
  Filesystem      Size  Used Avail Use% Mounted on
  /dev/zram0       30G  2.1M   28G   1% /mnt

Write test to lz4-compressed ZRAM:

  $ sudo dd if=/tmp/input-90K.jsonl of=/mnt/tmp.jsonl conv=fdatasync bs=10M oflag=direct iflag=direct
  2596+1 records in
  2596+1 records out
  27225334502 bytes (27 GB, 25 GiB) copied, 93.2813 s, 292 MB/s

Read test from lz4-compressed ZRAM:

  $ dd of=/tmp/input-90K.jsonl if=/mnt/tmp.jsonl conv=fdatasync bs=10M oflag=direct iflag=direct
  2596+1 records in
  2596+1 records out
  27225334502 bytes (27 GB, 25 GiB) copied, 34.8479 s, 781 MB/s

So SSD with LUKS is 1.5 times faster than zram for read and 5 times faster than zram for write.

Note without LUKS but native SSD encryption the speed of SSD will be at least 2 times faster. Also using recent kernel is important so LUKS uses CPU instructions for AES encryptions. Without that SSD under LUKS will be several times slower.

man8alexd

4 months ago

1 reply

You are measuring sequential throughput with a block size of 10M. Swap I/O is random 4K pages (with default readahead 32K and clustered swapout 1M), with the read latency being the most important factor.

fpoling

4 months ago

1 reply

> are measuring sequential throughput

On SSD I am measuring LUKS performance in fact as IO is much faster then LUKS encryption using specialized CPU instructions. As I wrote, without LUKS the numbers at least twice faster even with random access.

The point is that in 2025 with the latest SSDs there is no point in using compressed memory. Ewen with LUKS encryption it will be faster than even highly tuned swap setup.

In 2022-23 when LUKS was not optimized it was different so I used hardware encryption on SSD after realizing that even lz4 compression was significantly slower than SSD.

fpoling

4 months ago

EDIT: while its true that on purely random 4K SSD performance degrades badly, with 32K random read/writes it is still above 2 GB/s so in practice it is LUKS that is the bottleneck, not SSD.

vlovich123

4 months ago

1 reply

I think this says more about the terrible memory bandwidth & limited compute of the Intel mobile CPUs than about the positive speed of SSDs. Here's an 13900K 64 GiB with a SN850X SSD LUKS encrypted ext4. On my machine RAM compression is still faster. There's also various overheads in this test that make it not a 100% representative sample although I'm not sure how big the divergence is (namely zram swap doesn't have a filesystem and it's deep within the memory management code and not using O_DIRECT).

Basic memory bandwidth test:

    $ dd if=/dev/zero of=/dev/null bs=10M count=7000
    7000+0 records in
    7000+0 records out
    73400320000 bytes (73 GB, 68 GiB) copied, 1.44856 s, 50.7 GB/s

Read test $ dd if=random.bin of=/tmp/random.bin conv=fdatasync bs=10M iflag=direct oflag=direct 2500+0 records in 2500+0 records out 26214400000 bytes (26 GB, 24 GiB) copied, 9.09728 s, 2.9 GB/s

Write test

    $ dd if=/tmp/random.bin of=tmp.bin conv=fdatasync bs=10M iflag=direct oflag=direct
    2500+0 records in
    2500+0 records out
    26214400000 bytes (26 GB, 24 GiB) copied, 53.9548 s, 486 MB/s

Not sure why that disk write test was suddenly so bad.

    $ df -h /mnt
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/zram0       30G  2.1M   28G   1% /mnt

Write test to lz4-compressed ZRAM:

    $ sudo dd if=/tmp/random.bin of=/mnt/tmp.bin conv=fdatasync bs=10M iflag=direct oflag=direct
    2500+0 records in
    2500+0 records out
    26214400000 bytes (26 GB, 24 GiB) copied, 7.97006 s, 3.3 GB/s

Read test:

    $ dd of=/tmp/random.bin if=/mnt/tmp.bin conv=fdatasync bs=10M iflag=direct oflag=direct
    2500+0 records in
    2500+0 records out
    26214400000 bytes (26 GB, 24 GiB) copied, 5.16566 s, 5.1 GB/s

fpoling

4 months ago

1 reply

What is random.bin? I was testing with a json dataset that compresses by factor like 2.5 with zram. But if random is incompressible, then zram does not write compressed data but rather the original resulting in much faster read speed.

Also on your SSD do you have logical 4K sector or 512 byte sectors? If the latter, then Linux distros defaults to 512 LUKS sectors on them resulting in much slower performance especially with writes.

I always ensure that LUKS sectors are 4K even if SSD reports 512 bytes and does not allow to change that to 4K like Samsung 9* series.

vlovich123

4 months ago

1 reply

4k sectors. I don’t have a 25gib json file. Where can I get/generate the dataset you were using?

fpoling

4 months ago

The dataset is internal, but if you get a bunch of web pages from wikipedia and wrap them into HTML then it give a rough idea.

With 4K LUKS sectors the write speed is too low for a modern SSD. Check that LUKS use a fast implementation. I have:

  $ /usr/sbin/cryptsetup benchmark
  ...
  #     Algorithm |       Key |      Encryption |      Decryption
        aes-cbc        128b       664.1 MiB/s      2615.9 MiB/s

... aes-xts 512b 2708.8 MiB/s 2986.0 MiB/s

and my LUKS setup uses aes-xts 512b.

ciupicri

4 months ago

1 reply

The problem with swap for me is that in time for some reason Chrome tends to be swapped out even if free shows me some available memory when I notice its sluggishness. I've tried setting vm.swapping to <20, even 5, but it doesn't help too much.

blueflow

4 months ago

2 replies

Because Chrome's program text itself is reclaimable and counts as "available".

man8alexd

4 months ago

1 reply

active pages are not reclaimable.

blueflow

4 months ago

1 reply

Pages will not stay in the active LRU if the accessor is stalled.

Edit: You keep repeating that all over the thread - i started my week an hour ago and i had two of these memory stall events during the weekend. Again its the machines with big binaries running and no swap. Maybe you could provide a better explanation than "this doesn't happen"?

man8alexd

4 months ago

You keep repeating that the program text pages are being evicted as if that were the only thing that happens under memory pressure. It is the last thing that happens after the kernel drops all regular page cache, causing I/O starvation, and without swap, the kernel can't do anything about inactive anon pages. This means that at this moment, you are almost out of memory and the kernel can't do anything other than shuffle the remaining pages, thrashing the I/O. The kernel can't predict the future and doesn't know if it is a temporary situation due to some load spike or if it is a memory leak and resources won't be available again. The kernel doesn't know SLOs and operator priorities - whether the workload should survive at all costs, or the stalls are not acceptable and the workload should be OOM-killed. It is the job of the operator to give the kernel appropriate information by setting cgroup limits and monitoring memory pressure. If you want the workload to be OOM-killed before it consumes all the memory and starts causing I/O thrashing - set `memory.max` and `memory.swap.max` cgroup limits. If you want the workload memory not to be reclaimed by the kernel - set `memory.min`.

ciupicri

4 months ago

1 reply

Though I can deactivate the swap partition from the HDD and afterwards it's running acceptable. Perhaps it's using the faster zram, but I don't remember noticing zram usage increasing as shown my swapon -s. Or am I missing something?

blueflow

4 months ago

1 reply

Do you have the pressure stall information data in /proc/pressure? If you have it enabled, reboot, use chrome until your get your stall again, and then look which of the files in there has the highest total= number, thats likely it.

I doubt that disabling swap reduces your stalls, but i'd like to see the numbers for that.

ciupicri

4 months ago

2 replies

I haven't rebooted yet, but Chrome went downhill again.

    # cat /proc/pressure/memory 
    some avg10=3.34 avg60=6.40 avg300=3.22 total=65245554
    full avg10=3.18 avg60=6.14 avg300=3.10 total=64646112

    # free -m
          total   used   free  shared  buff/cache  available
    Mem:  15874   8099    309     370        8162       7775
    Swap: 25599  10873  14726

Swap is composed from 4+5 GBs of zRAM (lzo&zstd) and 16 GBs on HDD. There's also some background I/O from time to time caused by torrent seeding. vm.swappiness = 13.

I misread your comment, so I didn't check all the files, but here's how /proc/pressure/ looks now after managing to close Chrome:

    cpu     some avg10=0.00 avg60=0.00 avg300=0.07 total=188606342
    cpu     full avg10=0.00 avg60=0.00 avg300=0.00 total=0
    io      some avg10=2.07 avg60=1.50 avg300=9.23 total=2149496347
    io      full avg10=2.07 avg60=1.47 avg300=8.79 total=2045347731
    irq     full avg10=0.11 avg60=0.20 avg300=0.23 total=288863055
    memory  some avg10=0.00 avg60=0.00 avg300=1.22 total=94771913
    memory  full avg10=0.00 avg60=0.00 avg300=1.18 total=93379177

man8alexd

4 months ago

Get rid of the HDD swap. Set `vm.swappiness` to 200. Run Chrome with cgroup limits (systemd-run). Get more memory, 16 GB is too low for Chrome to run nowadays.

blueflow

4 months ago

Okay this is a legit case where having swap on a slow HDD is the main reason for stalls. 10 GB swapped, $expletive.

shawnz

4 months ago

It's crazy to me that even Fedora disables swap to disk by default now. It really speaks to how broadly misunderstood swap is

naniwaduni

4 months ago

The way I learned it, swap is basically the inverse of file caching: in much the way that extra memory can be used to cache more frequently-used files, then evicted when a "better" use for that memory comes around; swap can be used to save rarely-used anonymous memory so that you can "evict" them when there are other things you'd rather have in memory, then pull them back into memory if they ever become relevant again.

Or to look at it from another perspective... it lets you reclaim unprovable memory leaks.

tmtvl

4 months ago

Nothing about zram/zswap? I know that zram is more performant but I wonder how it holds up under high memory pressure compared to zswap.

goopypoop

4 months ago

swap.avi is its own damning defence

egberts1

4 months ago

It is still problematic to get your swap space encrypted-by-default (usual way is via LVM); still doable, workable, but jumpy.

The Linux Kernel still have this global lock for swapiness.

Downside is no smooth playing video.

hedora

4 months ago

Is linux considered stable with swap completely disabled these days?

In the past, they recommended against that for deadlock reasons.

On the workloads I care about (desktop, and servers that avoid mmap), the anonymous dirty page part of the kernel heap is < 10% of RAM, so swap is mostly just there to waste space and slow down the oomkiller.

View full discussion on Hacker News

ID: 45318798Type: storyLast synced: 11/20/2025, 7:55:16 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN