Vm.overcommit_memory=2 Is the Right Setting for Servers
Key topics
Diving into the world of Linux memory management, a heated debate erupts over the optimal setting for `vm.overcommit_memory`. At the center of the discussion is the value "2", which represents a "never overcommit" mode, and commenters are weighing in on its implications. Some, like LordGrey, are educating others on the nuances of Linux's overcommit handling modes, while others, like dbdr, are probing the reasoning behind the default 50% threshold for physical RAM allocation, sparking insightful responses from vin10 and crote on the historical context and kernel memory usage. As the conversation unfolds, it becomes clear that there's no one-size-fits-all solution, with various settings and trade-offs to consider, leaving some to wonder, like sidewndr46, whether any setting can actually result in `malloc` returning NULL.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
24m
Peak period
59
60-72h
Avg / period
17.3
Based on 138 loaded comments
Key moments
- 01Story posted
Dec 17, 2025 at 5:45 AM EST
17 days ago
Step 01 - 02First comment
Dec 17, 2025 at 6:09 AM EST
24m after posting
Step 02 - 03Peak activity
59 comments in 60-72h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 24, 2025 at 1:13 PM EST
10 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
The Linux kernel supports the following overcommit handling modes
0 - Heuristic overcommit handling. Obvious overcommits of address space are refused. Used for a typical system. It ensures a seriously wild allocation fails while allowing overcommit to reduce swap usage. root is allowed to allocate slightly more memory in this mode. This is the default.
1 - Always overcommit. Appropriate for some scientific applications. Classic example is code using sparse arrays and just relying on the virtual memory consisting almost entirely of zero pages.
2 - Don't overcommit. The total address space commit for the system is not permitted to exceed swap + a configurable amount (default is 50%) of physical RAM. Depending on the amount you use, in most situations this means a process will not be killed while accessing pages but will receive errors on memory allocation as appropriate.
Naive question: why is this default 50%, and more generally why is this not the entire RAM, what happens to the rest?
If you know a a system is going to run (e.g.) a Postgres database, then tweaking the vm.* sysctl values is part of the tuning process.
I can understand leaving some amount free in case the kernel needs to allocate additional memory in the future, but anything near half seems like a lot!
If successful, calloc(), malloc(), realloc(), reallocf(), valloc(), and aligned_alloc() functions return a pointer to allocated memory. If there is an error, they return a NULL pointer and set errno to ENOMEM.
In practice, I find a lot of code that does not check for NULL, which is rather distressing.
As an aside: To me, checking malloc() for NULL is easier than checking a pointer returned by malloc on first use. That's what you're supposed to do in the presence of overcommit.
Usefully handling allocation errors is very hard to do well, since it infects literally every error handling path in your codebase. Any error handling that calls a function that might return an indirect allocation error needs to not allocate itself. Even if you have a codepath that speculatively allocates and can fallback, the process is likely so close to ruin that some other function that allocates will fail soon.
It’s almost universally more effective (not to mention easier) to keep track of your large/variable allocations proactively, and then maintain a buffer for little “normal” allocations that should have an approximate constant bound.
This is just a Linux ecosystem thing. Other full size operating systems do memory accounting differently, and are able to correctly communicate when more memory is not available.
Solaris (and FreeBSD?) have overcommitting disabled by default.
That is the conservative design used by several traditional UNIX systems for anonymous memory and MAP_PRIVATE mappings: the kernel accounts for, and may reserve, enough swap to back the potential private pages up front. Tools and docs in the Solaris and BSD family talk explicitly in those terms. An easy way to test it out in a BSD would be disabling the swap partition and trying to launch a large process – it will get killed at startup, and it is not possible to modify this behaviour.
Linux’s default policy is the opposite end of that spectrum: optimistic memory allocation, where allocations and private mappings can succeed without guaranteeing backing store (i.e. swap), with failure deferred to fault time and handled by the OOM killer – that is what Linux calls overcommit.
malloc(-1) should always return NULL. Malloc returns NULL if the virtual address space for a given process is exhausted.
It will not return NULL when the system is out of memory (depending on the overcommit settings)
Source:
* https://www.kernel.org/doc/Documentation/vm/overcommit-accou...
Memory allocation is a highly non-deterministic process which highly depends on the code path, and it is generally impossible to predict how the child will handle its own memory space – it can be little or it can be more (relatively to the parent), and it is usually somewhere in the middle. Most daemons, for example, consuming next to zero extra memory after forking.
The Ruby garbage collector «mark-and-sweep» (old versions of Ruby – 1.8 and 1.9) and Python reference counting (the Instagram case) bugs are prime examples of pathological cases when a child would walk over its data pages, making each dirty and causing a system collapse, but the bugs have been fixed or workarounds have been applied. Honourable mention goes to Redis in a situation when THP (transparent huge pages) are enabled.
No heuristics exist out there that would turn memory allocation into a deterministic process.
The redis engineers KNOW fork-to-save will at most result in few tens of MBs of extra memory used in vast majority of cases and benefit of seamless saving. Like, there is a theoretical where it uses double the memory but it would require all the data in database be replaced during short interval snapshot is saved and that's just unrealistic
I'm not disabling overcommit for now, but maybe I should.
I recommend everyone to enable linux's new multi-generational LRU, that can be configured to trigger the OOM when the workingset of the last N deciseconds doesn't fit in memory. And <https://github.com/hakavlad/nohang> has some more suggestions.
You can and maybe even should disable overcommit this way when running postgres on the server (and only a minimum of what you would these days call sidecar processes (monitoring and backup agents, etc.) on the same host/kernel), but once you have a typical zoo of stuff using dynamic languages living there, you WILL blow someone's leg off.
have you checked what your `vm.overcommit_ratio` is? If its < 100%, then you will get OOM kills even if plenty of RAM is free since the default is 50 i.e. 50% of RAM can be COMMITTED and no more.
curious what kind of failures you are alluding to.
Thanks for the tip about vm.overcommit_ratio though, I think it's set to the default.
For cleanly surfacing errors, overcommit=2 is a bad choice. For most servers, it's much better to leave overcommit on, but make the OOM killer always target your primary service/container, using oom-score-adj, and/or memory.oom.group to take out the whole cgroup. This way, you get to cleanly combine your OOM condition handling with the general failure case and can restart everything from a known foundation, instead of trying to soldier on while possibly lacking some piece of support infrastructure that is necessary but usually invisible.
Taken from a SO post:
In fairness, i don't know what else general purpose software is supposed to do here other than die. Its not like there is a graceful way to handle insufficient memory to run the program.
But in practice, I agree with you. This is just not worth it. So much work to handle it properly everywhere and it is really difficult to test every malloc failures.
So that's where an OOM killer might have a better strategy than just letting the last program that happen to allocate memory last to fail.
which is elegant and completely legitimate
A forked process would assume memory is already allocated, but I guess it would fail when writing to it as if vm.overcommit is set to 0 or 1.
On the other hand, if you have a pretty good idea that the child will finish persisting and exit before the cache is fully rewritten, double is too much. There's not really a mechanism for that though. Even if you could set an optimistic multiplier for multiple mapped CoW pages, you're back to demand paging failures --- although maybe it's still worthwhile.
It's wrong 99.99999% of the time. Because alternative is either "make it take double and waste half the RAM" or "write in memory data in a way that allows for snapshotting, throwing a bunch of performance into the trash"
the author says it's bad design, but has entirely missed WHY it wants overcommit
The difference is that you'll fail allocations, where there's a reasonable interface for errors, rather than failing at demand paging when writing to previously unused pages where there's not a good interface.
Of course, there are many software patterns where excessive allocations are made without any intent of touching most of the pages; that's fine with overcommit, but it will lead to allocation failures when you disable overcommit.
Remember, the limit is artificial and defined by the user with overcommit=2, by overcommit_ratio and user_reserve_kbytes. Using overcommit=2 necessarily wastes RAM.
The RAM is not unusable, it will be used. Some portion of ram may be unallocatable, but that doesn't mean it's wasted.
There's a tradeoff. With overcommit disabled, you will get allocation failure rather than OOM killer. But you'll likely get allocation failures at memory pressure below that needed to trigger the OOM killer. And if you're running a wide variety of software, you'll run into problems because overcommit is the mainstream default for Linux, so many things are only widely tested with it enabled.
I think that's a meaningless distinction: if userspace can't allocate it, it is functionally wasted.
I completely agree with your second paragraph, but again, some portion of RAM obtainable with overcommit=0 will be unobtainable with overcommit=2.
Maybe a better way to say it is that a system with overcommit=2 will fail at a lower memory pressure than one with overcommit=0. Additional RAM would have to be added to the former system to successfully run the same workload. That RAM is waste.
You can have simple web server that took less than 100MB of RAM take gigabytes, just because it spawned few COW-ed threads
Doesn't really change the point. The RAM might not be completely wasted, but given that near every app will over-allocate and just use the pooled memory, you will waste memory that could otherwise be used to run more stuff.
And it can be quite significant, like it's pretty common for server apps to start a big process and then have COWed thread per connection in a pool, so your apache2 eating maybe 60MB per thread in pool is now in gigabytes range at very small pool sizes.
Blog is essentially call to "let's make apps like we did in the DOS era" which is ridiculus
And even if it's not COW, there's nothing wrong or inefficient about opportunistically allocating pages ahead of time to avoid syscall latency. Or mmapping files and deciding halfway through you don't need the whole thing.
There are plenty of reasons overcommit is the default.
That "waste" (that overcommit turns into "not a waste") means you can do far less allocations, with overcommit you can just allocate a bunch of memory and use it gradually, instead of having to do malloc() every time you need a bit of memory and free() every time you get rid of it.
You'd also increase memory fragmentation that way possibly hitting performance.
It's also pretty much required for GCed languages to work sensibly
- Apache2 runs. - Apache2 takes 50MB. - Apache2 spawns 32 threads. - Apache2 takes 50MB + (per-thread vars * 32)
overcommit 2
- Apache2 runs. - Apache2 takes 50MB. - Apache2 spawns 32 threads. - Apache2 takes 50MB + (5032) + (per-thread vars 32)
Boom, Now your simple apache server serving some static files can't fit on 512MB VM and needs in excess of 1.7GB of memory just to allocate
Two reasons why overcommit is a good idea:
- It lets you reserve memory and use the dirtying of that memory to be the thing that commits it. Some algorithms and data structures rely on this strongly (i.e. you would have to use a significantly different algorithm, which is demonstrably slower or more memory intensive, if you couldn't rely on overcommit).
- Many applications have no story for out-of-memory other halting. You can scream and yell at them to do better, but that won't help, because those apps that find themselves in that supposedly-bad situation ended up there for complex and well-considered reasons. My favorite: having complex OOM error handling paths is the worst kind of attack surface, since it's hard to get test coverage for it. So, it's better to just have the program killed instead, because that nixes the untested code path. For those programs, there's zero value in having the memory allocator be able to report OOM conditions other than by asserting in prod that mmap/madvise always succeed, which then means that the value of not overcommitting is much smaller.
Are there server apps where the value of gracefully handling out of memory errors outweighs the perf benefits of overcommit and the attack surface mitigation of halting on OOM? Yeah! But I bet that not all server apps fall into that bucket
I turn off swap, but do use zram, and while doing memory intensive tasks in the background, it's usually something interactive like the browser that ends up being killed. I would turn on swap if this happened more often. I might also turn off overcommit if apps handled OOM situations gracefully, but as you say, the complexity might not be worth it.
> In mode 2 the MAP_NORESERVE flag is ignored.
https://www.kernel.org/doc/Documentation/vm/overcommit-accou...
Secondly, memory is a global resource so you don't get local failures when it's exhausted, whoever allocates first after memory has been exhausted will get an error they might be the application responsible for the exhaustion or they might not be. They might crash on the error or they might "handle it", keep going and render the system completely unusable.
No, exact accounting is not a solution. Ulimits and configuring the OOM killer are solutions.
For modern (post-x86_64) memory allocators a common strategy is to allocate hundreds of gigabytes of virtual memory and let the kernel handle deal with actually swapping in physical memory pages upon use.
This way you can partition the virtual memory space into arenas as you like. This works really well.
Obviously you don't want to have to octuple your physical memory for the same workload, especially these days, so the typical way around that is to allocate a lot of swap. Then the allocations that aren't actually used can be backed by swap instead of DRAM.
Except then you've essentially reimplemented overcommit. Allocations report success because you have plenty of swap but if you go to use them the system grinds to a halt and starts timing out because you have hundreds of GB of active allocations but only 64GB of physical memory.
Then your memory requirements always were potentially 512GB. It may just happen to be even with that amount of allocation you may only need 64GB of actual physical storage; however, there is clearly a path for your application to suddenly require 512GB of storage. Perhaps when it's under an attack or facing misconfigured clients.
If your failure strategy is "just let the server fall over under pressure" then this might be fine for you.
If an allocator unconditionally maps in 512GB at once to minimize expensive reallocations, that doesn't inherently have any relationship to the maximum that could actually be used in the program.
Or suppose a generic library uses buffers that are ten times bigger than the maximum message supported by your application. Your program would deterministically never access 90% of the memory pages the library allocated.
> If your failure strategy is "just let the server fall over under pressure" then this might be fine for you.
The question is, what do you intend to happen when there is memory pressure?
If you start denying allocations, even if your program is designed to deal with that, so many others aren't that your system is likely to crash, or worse, take a trip down rarely-exercised code paths into the land of unpredictable bugs.
fork(2) has been copy-on-write for decades and does not copy the entire process address space. Thread stacks are a non-sequitur either as stack uses the data pages, the thread stacks are rather small in size in most scanarios, hence the thread stacks are also subject to copy-on-write.
The only overhead that the use of fork(2) incurs is an extra copy of process' memory descriptior pages, which is a drop in the ocean for modern systems for large amounts of RAM.
Thread stacks come up because reserving them completely ahead of time would incur large amounts of memory usage. Typically they start small and grow when you touch the guards. This is a form of overcommit. Even windows dynamically grows stacks like this
This is largely not true for most processes. For a child process to start writing into its own data pages en masse, there has to exist a specific code path that causes such behaviour. Processes do not randomly modify their own data space – it is either a bug or a peculiar workload that causes it.
You would have a stronger case if you mentioned, e.g., the JVM, which has a high complexity garbage collector (rather, multiple types of garbage collectors – each with its own behaviour), but the JVM ameliorates the problem by attempting to lock in the entire heap size at startup or bailing if it fails to do so.
In most scenarios, forking a process has a negligible effect on the overall memory consumption in the system.
> In most scenarios, forking a process has a negligible effect on the overall memory consumption in the system.
Yes, that’s what they’re getting at. It’s good overcommitment. It’s still overcommitment, because the OS has no way of knowing whether the process has the kind of rare path you’re talking about for the purposes of memory accounting.
This is a crucial distinction and I agree when the problem is framed this way.
The original statement by another GP, however, was that fork(2) is wasteful (it is not).
In fact, I have mentioned it in a sister thread that the OS does not have a way to know of the kind of behaviour the parent or the child will exhibit after forking[0].
Generally speaking, this is in line with the foundational ethos of the UNIX philosophy where UNIX gives its users a wide array of tools tantamount to shotguns that shoot both forward and backward simultaneously and the responsibility for with the number of deaths and permanent maimings ultimately lies with its users. In comparison, memory management in operating systems that run mainframes is substantially more complex and sophisticated.
[0] In a separate thread, somebody else has mentioned a valid reverse scenario where the child idles by after forking and it is the parent that makes its data pages dirty causing the physical memory consumption to baloon.
> Thread stacks come up because reserving them completely ahead of time would incur large amounts of memory usage. Typically they start small and grow when you touch the guards. This is a form of overcommit.
Ahead of the time memory reservation entails a page entry being allocated in the process’s page catalogue («logical» allocation), and the page «sits» dormant until it is accessed and causes a memory access fault – that is the moment when the physical allocation takes place. So copying reserved but not accessed yet pages has zero effect on the physical memory consumption of the process.
What actually happens to the thread stacks depends on the actual number of active threads. In modern designs, threads are consumed from thread pools that implement some sort of a run queue where the threads sit idle until they get assigned a unit of work. So if a thread is idle, it does not use its own stack thread and, consequently, there is no side effect on the child's COW address space.
Granted, if the child was copied with a large number of active threads, the impact will be very different.
> Even windows dynamically grows stacks like this
Windows employs a distinct process/thread design, making the UNIX concept of a process foreign. Threads are the primary construct in Windows and the kernel is highly optimised for thread management rather than processes. Cygwin has outlined significant challenges in supporting fork(2) semantics on Windows and has extensively documented the associated difficulties. However, I am veering off-topic.
Idle threads do increase the amount of committed stack. Once their stack grows it stays grown, it's not common to unmap the end and shrink the stacks. In a system without overcommit these stacks will contribute to total reserved phys/swap in the child, though ofc the pages will be cow.
> Windows employs a distinct process/thread design, making the UNIX concept of a process foreign. Threads are the primary construct in Windows and the kernel is highly optimised for thread management rather than processes. Cygwin has outlined significant challenges in supporting fork(2) semantics on Windows and has extensively documented the associated difficulties. However, I am veering off-topic.
The nt kernel actually works similarly to Linux w.r.t. processes and threads. Internally they are the same thing. The userspace is what makes process creation slow. Actually thread creation is also much slower than on Linux, but it's better than processes. Defender also contributes to the problems here.
Windows can do cow mappings, fork might even be implementable with undocumented APIs. Exec is essentially impossible though. You can't change the identity of a process like that without changing the PID and handles.
Fun fact: the clone syscall will let you create a new task that both shares VM and keeps the same stack as the parent. Chaos results, but it is fun. You used to be able to share your PID with the parent too, which also caused much destruction.
I am not clear on why the stack of an idlying thread would continue to grow. If a previously processed unit of work resulted in large amounts of memory pages backing the thread stack getting committed, then yes, it is not common to unmap the no longer required pages. It is a deliberate trade-off: automatic stack shrink is difficult to do safely and cheaply.
Idle does not actually make stacks grow, put simply.
> The nt kernel actually works similarly to Linux w.r.t. processes and threads.
Respectfully, this is slightly more that entirely incorrect.
Since Linux uses a single kernel abstraction («task_struct») for both processes and threads, it has one schedulable kernel object – «task_struct» – for both what user space calls a process and what user space calls a thread. «Process» is essentially a thread group leader plus a bundle of shared resources. Linux underwent the consolidation of abstractions in a quest to support POSIX threads at the kernel level decades ago.
Since fork(2) is, in fact, clone(2) with a bunch of flags, what you get depends on clone flags: sharing VM, files, FS context, signal handlers, and whether you are in the same thread group (CLONE_THREAD) and that creates a new thread group with its own memory management (but populated using copy-on-write), separate signal disposition context, etc.
Windows has two different kernel objects: a process (EPROCESS) and a thread (ETHREAD/KTHREAD). Threads are the schedulable entities; a process is the container for address space, handle table, security token, job membership, accounting, etc. They are tightly coupled, but not «the same thing».
On Windows, «CreateProcess» is heavier than Linux fork for structural reasons: it builds a new process object, maps an image section, creates the initial thread, sets up the PEB/TEB, initialises the loader path, environment, mitigations, etc. A chunk of that work is kernel-side and a chunk is user-mode (notably the loader and, for Win32, subsystem involvement). Blaming only «userspace» is wrong.
Defender (and third-party AV/EDR) can measurably slow process creation because it tends to inspect images, scripts, and memory patterns around process start, not because of deficiences of the kernel and system calls design.
Instead, you want to kill the process that's hogging all the memory.
The OOM killer heuristic is not perfect, but it will generally avoid killing critical processes and is fairly good at identifying memory hogs.
And if you agree that using the OOM killer is better than returning failure to a random unlucky process, then there's no reason not to use overcommit.
Besides, overcommit is useful. Virtual-memory-based copy-on-write, allocate-on-write, sparse arrays, etc. are all useful and widely-used.
And I do say "often" because it does sometimes work.
I have set all my Firefox processes near-maximum priority to kill for the OOM killer, but it didn't help.
Also don't forget about memory compression: only meaningful with overcommit.
I don't care, I disable it anyway. Have been doing so for decades. Never caused a problem.
Even 0.3 DWPD drives have years before wearing out and in your modern server you really should have something with >= 1 DWPD.
POSIX not so much.
https://lwn.net/Articles/104185/
Because If my bitcoin price checker built on electron will start allocating all memory on the machine, then some arbitrary process (e.g. systemd) can get malloc error. But it's not systemd's fault the memory got eaten; so why it's being punished for low memory conditions?
It's like choosing a random person to be ejected from the plane.
I want to agree with the locality of errors argument, and while in simple cases, yes, it holds true, it isn't necessarily true. If we don't overcommit, the allocation that kills us is simply the one that fails. Whether this allocation is the problematic one is a different question: if we have a slow leak that, every 10k allocation allocs and leaks, we're probably (9999 / 10k, assuming spherical allocations) going to fail on one that isn't the problem. We get about as much info as the oom-killer would have, anyways: this program is allocating too much.
- https://github.com/torvalds/linux/blob/master/mm/util.c#L753
One would be where you have frequent large mallocs which get freed fast. Another would be where you have written a garbage collected language in C/C++.
When calling free, delete or letting your GC do that for you the memory isn't actually given back immediately, glibc has malloc_trim(0) for that, which tries it's best to give back as much unused memory to the OS as possible.
Then you can retry your call to malloc and see if it fails and then just let your supervisor restart your service/host/whatever or not.
Fact is that by the time small allocations are failing you are almost no better off handling the null than you would be handling segfaults and the sigterms from the killer.
Often for servers performance will fall off a cliff long before the oom killer is needed, too.
Advertising for 2 with "but apps should handle it" is utter ignorance
The allocation site is not necessarily what is leaking memory. What you actually want in either case is a memory dump where you can tell what is leaking or using the memory.