In Defence of Swap: Common Misconceptions (2018)
Posted4 months agoActive3 months ago
chrisdown.nameTechstoryHigh profile
calmmixed
Debate
80/100
LinuxSwapMemory Management
Key topics
Linux
Swap
Memory Management
The article defends the use of swap in Linux, challenging common misconceptions, and the discussion revolves around the usefulness and configuration of swap in various scenarios.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
3h
Peak period
119
Day 1
Avg / period
30
Comment distribution150 data points
Loading chart...
Based on 150 loaded comments
Key moments
- 01Story posted
Sep 20, 2025 at 8:06 PM EDT
4 months ago
Step 01 - 02First comment
Sep 20, 2025 at 11:34 PM EDT
3h after posting
Step 02 - 03Peak activity
119 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 2, 2025 at 2:00 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45318798Type: storyLast synced: 11/20/2025, 7:55:16 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Configure it to fire at like 5% and forget it.
I've never seen the OOM do its dang job with or without swap.
Sometimes I think if backing store and swap were more clearly delineated we might have got to decent algorithms sooner. Having a huge amount of swap pre-emptively claimed was making it look like starvation, when it was just a runtime planning strategy. It's also confusing how top and vmstat report things.
Also, as a BSD mainly person, I think the differences stand out. I haven't noticed an OOM killer approach on BSD.
Ancient model: twice as much swap as memory
Old model: same amount of swap as memory
New model: amount of swap your experience tells you this job mix demands to manage memory pressure fairly, which is a bit of a tall ask sometimes, but basically pick a number up to memory size.
For example, one of my database servers has 128GB of RAM and 8GB of swap. It tends to stabilize around 108GB of RAM and 5GB of swap usage under normal load, so I know that a 4GB swap would have been less than optimal. A larger swap would have been a waste as well.
Another rule of thumb is that performance degradation due to the active working set spilling into the swap is exponential - 0.1% excess causes 2x degradation, 1% - 10x degradation, 10% - 100x degradation (assuming 10^3 difference in latency between RAM and SSD).
It'd be cool if Zram could apply to the RAM itself (like macOS) rather than needing a fake swap device.
https://docs.kernel.org/admin-guide/mm/zswap.html
The cgroup accounting also now works in zswap.
zram based swap isn't free. Its efficiency depends on the compression ratio (and cost).
https://github.com/CachyOS/CachyOS-Settings/blob/master/usr/...
Resulting in https://i.postimg.cc/hP37vvpJ/screenieshottie.png
Good enough...
BSD allocators simply return errors if no more memory is available; for backwards compatibility reasons Linux is stuck with a fatally flawed API that doesn't.
If you allocate and touch everything, and then try to allocate more, it's better to get an allocation error than an unsatifyable page fault later.
My understanding (which could very well be wrong) is Linux overcommit will continue to allocate address space when asked regardless of memory pressure; but FreeBSD overcommit will refuse allocations when there's too much memory pressure.
I'm pretty sure I've seen FreeBSD's OOM killer, but it needs a specific pattern of memory use, it's much more likely for an application to get a failed allocation and exit, freeing memory; than for all the applications to have unused allocations that they then use.
All that said, I prefer to run with a small swap, somewhere around 0.5-2GB. Memory pressure is hard to measure (although recent linux has a measure that I haven't used), but swap % and swap i/o are easy to measure. If your swap grows quickly, you might not have time to do any operations to fix it, but your stats should tell the tale. If your swap grows slowly enough, you can set thresholds and analyze the situation. If you have a lot of swap i/o that provides a measure of urgency.
It depends, but generally speaking I'd disagree with that.
The only time you actually want to see the allocation failures is if you're writing high reliability software where you've gone to the trouble to guarantee some sort of meaningful forward progress when memory is exhausted. That is VERY VERY hard, and quickly becomes impossible when you have non-trivial library dependencies.
If all you do is raise std::bad_alloc or call abort(), handling NULL return from malloc() is arguably a waste of icache: just let it crash. Dereferencing NULL is guaranteed to crash on Linux, only root can mmap() the lowest page.
Admittedly I'm anal, and I write the explicit code to check for it and call abort(), but I know very experienced programmers I respect who don't.
If you don't care to handle the error, which is a totally reasonable position, there's not a whole lot of difference between the allocator returning a pointer that will make you crash on use because it's zero, and a pointer that will make you crash on use because there are no pages available. There is some difference because if you get the allocation while there are no pages available, the fallible allocator has returned a permanently dead pointer and the unfailing allocator has returned a pointer that can work in the future.
But if you do want to respond to errors, it is easier to respond to a NULL return rather than to a failed page fault. I certainly agree it's not easy to do much other than abort in most cases, but I'd rather have the opportunity to try.
It's just inherently incompatible with overcommit, isn't it? Like you can mmap() directly and use MAP_POPULATE|MAP_LOCKED to get what you want*, but that defeats overcommit entirely.
I guess I can imagine a syscall that takes a pointer and says "fault this page please but return an error instead of killing me if you can't", but there's an unavoidable TOCTOU problem in that it could be paged out again before you actually touch it.
A zany idea is to write a custom malloc() that uses userfaultfd to allow overcommit in userspace with it disabled in the kernel. The benefit being that userspace gets to decide what to do if a fault can't be satisfied instead of getting killed. But that would be pretty complex, and I don't know what the performance would look like.
* EDIT: Actually the manpage implies some ambiguity about whether MAP_LOCKED|MAP_POPULATE is guaranteed to avoid the first major fault, it might need mmap()+mlock(), I'd have to look more carefully...
It's true that if overcommit is enabled, you can't guarantee you won't end up with a page fault that can't be satisfied.
But my experience on FreeBSD, which has overcommit enabled by default and returns NULL when asked for allocations that can't be (currently) satisfied is that most of the time you get a NULL allocation rather than an unsatisfied page fault.
What typically happens is a program grows to use beyond available memory (and swap) and it does so by allocating large, but managable chunks, using them, and then repeating. At a certain point, the OS struggles, but is typically able to find a page for each fault, but the large allocation looks too big, and the allocation fails and the program aborts.
But sometimes a program changes its usage pattern and starts using allocations that had been unused. In that case, you can still trigger the fatal page faults, because overcommit let you allocate more than is there.
If you don't want to have both scenarios, you can choose to eliminate the possibility of NULL by strictly allowing all allocations (although you could run out of address space and get a NULL at that point) or you can choose to eliminate the possibility of an unsatisfied page fault by strictly disallowing overcommit. I prefer having NULL when possible, and unsatisfied page faults when not.
This is the most important reason I try to avoid having a large swap. The duration of pathological behavior at near-OOM is proportional to the amount of swap you have. The sooner your program is killed, the sooner your monitoring system can detect it ("Connection refused" is much more clear cut than random latency spikes) and reboot/reprovision the faulty server. We no longer live in a world where we need to keep a particular server online at all cost. When you have an army of servers, a dead server is preferable to a misbehaving server.
OP tries to argue that a long period of thrashing will give you an opportunity for more visibility and controlled intervention. This does not match my experience. It takes ages even to log in to a machine that is thrashing hard, let alone run any serious commands on it. The sooner you just let it crash, the sooner you can restore the system to a working state and inspect the logs in a more comfortable environment.
Like something is going very wrong if the system is in that state, so I want everything to die immediately.
https://unix.stackexchange.com/q/797835/1027 https://unix.stackexchange.com/q/797841/1027
The kernel oom killer is concerned with kernel survival, not user space performance.
The real danger in all of this, swap or no, is the shitty OOMKiller in Linux.
I didn't get that impression. My read was that OP was arguing for user-space process killers so the system doesn't get to the point where the system becomes unresponsive due to thrashing.
> With swap: ... We have more visibility into the instigators of memory pressure and can act on them more reasonably, and can perform a controlled intervention.
But of course if you're doing this kind of monitoring, you can probably just check your processes' memory usage and curb them long before they touch swap.
A machine that is responding just enough to keep a circuit breaker closed is the scourge of distributed systems.
So they invested in additional swap space, let the processes slowly grow, swap out leaked stuff and restart them all over the weekend...
on some workloads this may represent a non-trivial drop in performance due to stale, anonymous pages taking space away from more important use
WTF?
So even if you never run into OOM situations, adding a couple gigabytes of swap lets you free up that many gigabytes of RAM for file caching, and suddenly your application is on average 5x faster - but takes 3 seconds longer to service that one obscure API call that needs to dig all those pages back up. YMMV if you prefer consistently poor performance over inconsistent but usually much better performance.
Though any halfway competent Java developer following modern best practices will know to build systems that don't have these characteristics.
I'll let you know if I ever meet any. Until then, another terabyte of RAM for tomcat.
Eventually I wrote a small script that does the equivalent of "sudo swapoff -a && sudo swapon -a" to eagerly flush everything to RAM, but I was surprised by how many people seemed to think there's no legitimate reason to ever want to do so.
The outage ain't resolved until things are back to operating normally.
If things aren't back to 100% healthy, could be I didn't truly find the root cause of the problem - in which case I'll probably be woken up again in 30 minutes when the problem comes back.
I use vmtouch all the time to preload or even lock certain data/code into RAM.
Sounds like it's as legitimate as running the sync command - ie. ideally you should never need to do it, but in practice you sometimes do.
About a decade ago, removable devices definitely were mounted with the "sync" option on some distros. It really tanked write performance though, so perhaps that's why they changed it. Certainly Plasma (and probably most DEs; I only use plasma) will tell you when the device is fully unmounted when you use the udisks integration.
If I wasn't aware of what was happening here I likely would have just force shut down the computer after a minute of waiting. And I suspect if I had done that and checked the drive it would have appeared like the file was there, while actually missing part of the data.
1. Arch uses the "flush" mount option by default when using udisks (which is how removable devices are mounted interactively from a DE).
2. Manjaro has a package called "udev-usb-sync" that matches USB devices in udev and limits the write-buffer size. However, it appears to (by default, you can instead specify a constant value) calculate the buffer size based on the USB transfer speed, and given the fact that I have some USB 3.1 devices that cannot maintain 1MB/s of throughput while others that can maintain over 200MB/s of throughput and both report the same transfer speed to Linux, I don't know how effective it is.
Worst thing: I left 5% of my SSD unused which will actually be used for garbage collection and other staff. That's OK.
What I don't understand is why modern Linux is so shy of touching swap. With old kernels, Linux happily pushed unused pages to a swap, so even if you don't eat memory, your swap will be filled with tens or hundreds MB of memory and that's a great thing. Modern kernel just keeps swap usage at 0, until memory is exhausted.
That's a couple terabyte of swap on servers these days, and even on laptops I wouldn't want to deal with 300-ish GB swap.
The article has the answer.
* http://jdebp.uk./FGA/dont-throw-those-paging-files-away.html
The erroneous folk wisdom is widespread. It often seems to lack any mention of the concepts of a resident set and a working set, and is always mixed in with a wishful thinking idea that somehow "new" computers obviate this, when the basic principles of demand paging are the same as they were four decades ago, Parkinson's Law can still be observed operating in the world of computers, and the "new" computers all of those years ago didn't manage to obviate paging files either.
My experience with swap shows, that it only makes things worse. When I program, my application may sometimes allocate a lot of memory due to some silly bug. In such case the whole system practically stops working - even mouse cursor can't move. If I am happy, OOM killer will eventually kill my buggy program, but after that it's not over - almost all used memory is now in swap and the whole system works snail-slow, presumably because kernel doesn't think it should really unswap previously swapped memory and does this only on demand and only page by page.
I a hypothetical case without swap this case isn't so painful. When main system memory is almost fully consumed, OOM killer kills the most memory hungry program and all other programs just continue working as before.
I think that overall reliance on swap is noways just a legacy of old times when main memory was scarce and back than it maybe was useful to have swap. OS kernels should be redesigned to work without swap, this will make system behavior smoother and kernel code may be simpler (all this swapping code may be removed) and thus faster.
Ideally yes, but is that something you keep in mind when you write software? Do you ever consider freeing memory just because it hasn't been used in a while? How do you decide when to free it? This is all handled automatically when you have swap enabled, and at a granularity that is much higher than you can practically manually implement it.
Programs, which allocate large amounts of memory without strict necessity to do so, are just a consequence of swap existence. "Thanks" to swap they weren't properly tested in low-memory conditions and thus no necessary optimization were done.
Anyway, it's easy to discuss best practices but people actually following them is the actual issue. If you disable swap and the software you're running isn't optimized to minimize idle memory usage then your system will be forced to keep all of that data in RAM.
This. The ability to allocate large amounts of memory is due to memory overcommit, not the "swap existence". If you disable swap, you can still allocate memory with almost no restrictions.
> This is all handled automatically when you have swap enabled
And this. This statement doesn't make any sense. If you disable swap, kernel memory management doesn't change, you only lose the ability to reclaim anon pages.
Who told you this? It's not remotely true.
Here's an article about this subject that you might want to read:
https://chrisdown.name/2018/01/02/in-defence-of-swap.html
Unless you have an almost pathological attention to detail, that is not true at all. And even if you do precisely scope your destructors, the underlying allocator won't return the memory to the OS (what matters here) immediately.
That's not how it works in practice. What happens is that program pages (and read-only data pages) get gradually evicted from memory and the system still slows to a crawl (to the point where it becomes practically unresponsive) because every access to program text outside the current 4KB page now potentially involves a swap-in. Sure, eventually, the memory-hungry task will either complete successfully or the OOM killer will be called, but that doesn't help you if you care about responsiveness first and foremost (and in practice, desktop users do care about that - especially when they're trying to terminate that memory hog).
Have a look at Chrome. Then have a look at all the Electron "desktop" apps, which all ship with a different Chrome version and different versions of shared libraries, which all can't share memory pages, because they're subtly different. You find similar patterns across many, many other workloads.
Because the code is never required in its entirety – only «currently» active code paths need to be resident in memory, the rest can be discarded when inactive (or never even gets loaded into memory to start off with) and paged back into memory on demand. Since code pages are read only, the inactive code pages can be just dropped without any detriment to the application whilst reducing the app's memory footprint.
> […] typical executable is usually several megabytes
Executable size != the size of the actually running code.
In modern operating systems with advanced virtual memory management systems, the actual resident code size can go as low as several kilobytes (or, rather, a handful of pages). This, of course, depends on whether the hot paths in the code have a close affinity to each other in the linked executable.
Think long-term recording applications, such as audio or studio situations where you want to "fire and forget" reliable recording systems of large amounts of data consistently from multiple streams for extended durations, for example.
Well-written, long term recording software doesn’t quit or crash. It records what it needs to record, and - by using swap - gives itself plenty of time to flush the buffers using whatever techniques are necessary for safety.
Disclaimer: I’ve written this software, both with and without swap available in various embedded contexts, in real products. The answer to the question is that having swap means higher data rates can be attained before needing to sync.
Power outages, hardware failures, and OS bugs happen to the finest application software.
I believe you from your experience that it can be useful to have recorded buffers swap out before flushing them to durable storage. But I do find it a bit surprising, since the swap system has to do the storage flush you are paying for the IO, why not do it durably?
The fine article argued that you can save engineer cycles by having the OS manage optimizing out-of-working set memory for you, but that isn’t what you’re arguing here.
I’m interested in understanding your point.
Then, when the time is right, flush it all to disk.
The VMM is pretty good at being tight and productive - so use it as the big fat buffer it is, and spawn worker threads to flush things at appropriate times.
If you don't have swap, you have to flush more often ...
Reliably recording massive amounts of data for extended periods of time in a studio setting is the most obvious use case for a fixed-size buffer that gets flushed to durable storage at short and predictable time intervals. You wouldn't want a segfault wiping out the entire day's worth of work, would you?
Having swap/more memory available just means you have more buffers before needing to commit and in certain circumstances this can be very beneficial, such as when processing of larger amounts of logged data is needed prior to committing, etc.
There is certainly a case for both having and using swap, and disabling it entirely, depending on the data load and realtime needs of the application. Processing data and saving data have different requirements, and the point is really that there is no black and white on this. Use swap if it’s appropriate to the application - don’t use it, if it isn’t.
Instead of storing data (let's call them samples) to durable storage to begin with, you're letting the OS write them to swap which incurs the same cost, but then you need to read them from swap and write them to a different partition again (~triple the original cost).
Yes, sometimes, it's perfectly acceptable to flush to disk because you're getting low on RAM. But, on systems with, say .. 4x more swap than physical RAM .. you don't have to do a flush that often - if at all. This is great, for example with high quality audio loads that have to be captured safely over long periods.
A system with low RAM and high swap is also a bit more economical, at scale, when building actual hardware in large numbers. So exploiting swap in that circumstance can also effect the BOM costs.
The problem is not the existence of swap, but that people are unaware that the disk cache is equally important for performance.
20-30 years ago, heavy paging often crippled consumer Intel based PC's[0] because paging went to slow mechanical hard disks on PATA/IDE, a parallel device bus (until 2005 circa), which had little parallelism and initially no native command queuing; SCSI drives did offer features such as tagged command queuing and efficient scatter-gather but were uncommon on desktops leave alone laptops. Today the bottlenecks are largely gone – abundant RAM, switched interconnects such as PCIe, SATA with NCQ/AHCI, and solid-state storage, especially NVMe, provide low-latency, highly parallel I/O – so paging still signals memory pressure yet is far less punishing on modern laptops and desktops.
Swap space today has a quieter benefit: lower energy use. On systems with LPDDR4/LPDDR5, the memory controller can place inactive banks into low-power or deep power-down states; by compressing memory and paging out cold, dirty pages to swap, the OS reduces the number of banks that must stay active, cutting DRAM refresh and background power. macOS on Apple Silicon is notably aggressive with memory compression and swap and works closely with the SoC power manager, which can contribute to the strong battery life of Apple laptops compared with competitors, albeit this is only one factor amongst several.
[0] RISC workstations and servers have had switched interconnects since day 1.
Having more RAM is always better performance, but swap allows you to skimp out on RAM in certain cases for almost identical performance but lower cost (of buying more RAM), if you run programs that allocate a lot of memory that it subsequently doesn't use. I hear Java is notoriously bad at this, so if you run a lot of heavy enterprise Java software, swap can get you the same performance with half the RAM.
(It is also a "GC strategy", or stopgap for memory leaks. Rather than managing memory, you "could" just never free memory, and allocate a fat blob of swap and let the kernel swap it out.)
I had one of those cases a few years ago when a program I was working on was leaking 12 MP raw image buffers in a drawing routine. I set it off running and went browsing HN/chatting with friends. A few minutes later I was like "this process is definitely taking too long" and when I went to check on it, it was using up 200+ GB of RAM (on a 16 GB machine) which had all gone to swap.
I hadn't noticed a thing! Modern SSDs are truly a marvel... (this was also on macOS rather than Linux, which may have a better swap implementation for desktop purposes)
You can limit resource usage per process thus your buggy application could be killed long before the system comes to a crawl. See your shell' s entry on its limit/ulimit built-in or use
man prlimit(1) - get and set process resource limits
The old rule of thumb of 1-2x your ram is way too much for most systems. The solution isn't to turn it off, but to have a sensible limit. Try with half a gig of swap and see how that does. It may give you time to notice the system is degraded and pick something to kill yourself and maybe even debug the memory issue if needed. You're not likely to have lasting performance issues from too many things swapped out after you or the OOM killer end the memory pressure, because not much of your memory will fit in swap.
Yes, indeed, the world would be a better place if we had just stopped writing Java 20 years ago.
> And how many memory such daemons can consume? A couple of hundred megabytes total?
Consider the average Java or .net enterprise programmer, who spends his entire career gluing together third-party dependencies without ever understanding what he's doing: Your executable is a couple hundred megabytes already, then you recursively initialize all the AbstractFactorySingletonFactorySingletonFactories with all their dependencies monkey patched with something worse for compliance reasons, and soon your program spends 90 seconds simply booting up and sits at two or three dozen gigabytes of memory consumption before it has served its first request.
> Is it really that much on modern systems?
If each of your Java/.net business app VMs needs 50 or so gigabytes to run smoothly, you can only squeeze ten of them in an 1U pizza box with a mere half terabyte RAM; while modern servers allow you to cram in multiple terabytes, do you really want to spend several tens of thousands of dollars on extra RAM, when swap storage is basically free?
Cloud providers do the same math, and if you look at e.g. AWS, swap on EBS costs as much per month as the same amount of RAM costs per hour. That's almost three orders of magnitude cheaper.
> When I program, my application may sometimes allocate a lot of memory due to some silly bug.
Yeah, that's on you. Many, many mechanism let you limit the per-process memory consumption.
But as TFA tries to explain, dealing with this situation is not the purpose of swap, and never has been. This is a pathological edge case.
> almost all used memory is now in swap and the whole system works snail-slow, presumably because kernel doesn't think it should really unswap previously swapped memory and does this only on demand and only page by page.
This requires multiple conditions to be met
- the broken program is allocating a lot of RAM, but not quickly enough to trigger the OOM killer before everything has been swapped out
- you have a lot of swap (do you follow the 1990s recommendation of having 1-2x the RAM amount as swap?)
- the broken program sits in the same cgroup as all the programs you want to keep working even in an OOM situation
Condition 1 can't really be controlled, since it's a bug anyway.
Condition 2 doesn't have to be met unless you explicitly want it to. Why do you?
Condition 3 is realistically on desktop environments, despite years of messing around with flatpaks and snaps and all that nonsense they're not making it easy for users to isolate programs they run that haven't been pre-containerized.
But simply reducing swap to a more realistic size (try 4GB, see how far it gets you) will make this problem much less dramatic, as only parts of the RAM have to get flushed back.
> I a hypothetical case without swap this case isn't so painful. When main system memory is almost fully consumed, OOM killer kills the most memory hungry program and all other programs just continue working as before.
And now you're wasting RAM that could be used for caching file I/O. Have you benchmarked how much time you're wasting through that?
> I think that overall reliance on swap is noways just a legacy of old times when main memory was scarce and back than it maybe was useful to have swap.
No, you just still don't understand the purpose of swap.
Also, "old times"? You mean today? Because we still have embedded environments, we have containers, we have VMs, almost all software not running on a desktop is running in strict memory constraints.
> and kernel code may be simpler (all this swapping code may be removed)
So you want to remove all code for file caching? Bold strategy.
On the other side, a Raspberry Pi freezed unexpectedly (not due to low memory) until a very small swap file was enabled. It was almost never used but the freezes stopped. Fun swap stories.
That's the long-standing defect that needs to be corrected then, there should be no dependence on swap existing whatsoever as long as you have more than enough memory for the entire workload.
The warm up computation does take like 1/4 the time if it can live entirely in RAM, but using NVMe as “discount RAM” reduces the United States dollar cost of the system by 97% compared to RAM-only.
The AWS control plane will detect an ailing SSD backing up the EBS and will proactively evacuate the data before the physical storage goes pear shaped.
If it is an EC2 instance with an instance attached NVMe, the control plane will issue an alert that can be automatically acted upon, and the instance can be bounced with a new EC2 instance allocated from a pool of the same instance type and get a new NVMe. Provided, of course, the design and implementation of the running system are stateless and can rebuild the working set upon a restart.
Regardless, AWS takes care the hardware cycling / migration in either case.
> Total bytes written calculated assuming drive is 100% full (user capacity) with workload of 100% random aligned 4KB writes.
[1]: page 6/17, https://assets.micron.com/adobe/assets/urn:aaid:aem:d133a40b...
Only if you use a consumer grade flash with a non-consumer grade usage.
For anything with DPWD >= 1 it's not an issue, eg:
https://news.ycombinator.com/item?id=45273937
At that price point, either we use swap and let the kernel engineers move data from RAM to disk and back, or we disable swap and need user space code to move the same data to disk and back. We’d need to price out writing & maintaining the user space implementation (mmap perhaps?) for it to be fair price comparison.
To avoid SSD wear and tear, we could spend $29 million a year more to put the data in RAM only. Not worth!
(We rent EC2 instances from AWS, so SSD wear is baked into the pricing)
No, this is just plain wrong. There are very specific problems which happen when there is not enough memory.
1. File-backed page reads causing more disk reads, eventually ending with "programs being executed from disk" (shared libraries are also mmaped) which feels like system lockup. This does not need any "egalitarian reclamation" abstraction and swap, and swap does not solve it. But it can be solved simply by reserving some minimal amount of memory for buf/cache, with which system is still responsive. 2. Eventually failure to allocate more memory for some process. Any solutions like "page reclamation" with pushing unused pages to some swap can only increase maximum amount of memory which can be used before it happens, from one finite value to bigger finite value. When there is no memory to free without losing data, some process must be killed. Swap does not solve this. The least bad solution would be to warn user in advance and let them choose processes to kill.
See also https://github.com/hakavlad/prelockd
On the other hand these days latest SSD are way faster than memory compression even with LUKS encryption on and even when compression uses LZ4 compression. Plus modern SSDs do not suffer from frequent writes as before so on my laptop I disabled the memory compression and then all reasoning from the article applies again.
Then on a development laptop running compilations/containers/VMs/browser vm.swappines does not seems matter that much if one has enough memory. So I no longer tune it to 100 or more and leave at the default 60%.
That's a really provocative claim. Any benchmarks to support this?
Read test via coping to RAM memory from LUKS-encrypted BTRFS against /tmp that is a standard RAM disk:
Write test: Preparing RAM disk with zram compression to emulate zram: Write test to lz4-compressed ZRAM: Read test from lz4-compressed ZRAM: So SSD with LUKS is 1.5 times faster than zram for read and 5 times faster than zram for write.Note without LUKS but native SSD encryption the speed of SSD will be at least 2 times faster. Also using recent kernel is important so LUKS uses CPU instructions for AES encryptions. Without that SSD under LUKS will be several times slower.
On SSD I am measuring LUKS performance in fact as IO is much faster then LUKS encryption using specialized CPU instructions. As I wrote, without LUKS the numbers at least twice faster even with random access.
The point is that in 2025 with the latest SSDs there is no point in using compressed memory. Ewen with LUKS encryption it will be faster than even highly tuned swap setup.
In 2022-23 when LUKS was not optimized it was different so I used hardware encryption on SSD after realizing that even lz4 compression was significantly slower than SSD.
Basic memory bandwidth test:
Read test $ dd if=random.bin of=/tmp/random.bin conv=fdatasync bs=10M iflag=direct oflag=direct 2500+0 records in 2500+0 records out 26214400000 bytes (26 GB, 24 GiB) copied, 9.09728 s, 2.9 GB/sWrite test
Not sure why that disk write test was suddenly so bad. Write test to lz4-compressed ZRAM: Read test:Also on your SSD do you have logical 4K sector or 512 byte sectors? If the latter, then Linux distros defaults to 512 LUKS sectors on them resulting in much slower performance especially with writes.
I always ensure that LUKS sectors are 4K even if SSD reports 512 bytes and does not allow to change that to 4K like Samsung 9* series.
With 4K LUKS sectors the write speed is too low for a modern SSD. Check that LUKS use a fast implementation. I have:
... aes-xts 512b 2708.8 MiB/s 2986.0 MiB/sand my LUKS setup uses aes-xts 512b.
Edit: You keep repeating that all over the thread - i started my week an hour ago and i had two of these memory stall events during the weekend. Again its the machines with big binaries running and no swap. Maybe you could provide a better explanation than "this doesn't happen"?
I doubt that disabling swap reduces your stalls, but i'd like to see the numbers for that.
I misread your comment, so I didn't check all the files, but here's how /proc/pressure/ looks now after managing to close Chrome:
Or to look at it from another perspective... it lets you reclaim unprovable memory leaks.
The Linux Kernel still have this global lock for swapiness.
Downside is no smooth playing video.
In the past, they recommended against that for deadlock reasons.
On the workloads I care about (desktop, and servers that avoid mmap), the anonymous dirty page part of the kernel heap is < 10% of RAM, so swap is mostly just there to waste space and slow down the oomkiller.