How Memory Maps (mmap) Deliver Faster File Access in Go

3 months ago

Multics did default to this behavior, but Unix was written on the PDP-7 and later the PDP-11, neither of which supported virtual memory or paging, so the Unix system call interface necessarily used read() and write() calls instead.

This permitted the use of the same system calls on files, on the teletype, on the paper tape reader and punch, on the magtape, on the line printer, and eventually on pipes. Even before pipes, the ability to "record" a program's terminal output in a file or "play back" simulated user input from a file made Unix especially convenient.

But pipes, in turn, permitted entire programs on Unix to be used as the building blocks of a new, much more powerful programming language, one where you manipulated not just numbers or strings but potentially endless flows of data, and which could easily orchestrate computations that no single program in the PDP-11's 16-bit address space could manage.

And that was how Unix users in the 01970s had an operating system with the entire printed manual available in the comprehensive online help system, a way to produce publication-quality documents on the phototypesetter, incremental recompilation, software version control, full-text search, WYSIWYG screen editors that could immediately jump to the definition of a function, networked email, interactive source-level debugging, a relational database, etc.—all on a 16-bit computer that struggled to run half a million instructions per second, which at most companies might have been relegated to controlling some motors and heaters in a chemical plant or something.

It turns out that often what you can do matters even more than how fast you can do it.

3 months ago

The point is that invoking the OS has a cost. Using mmap, for those situations where it makes sense, lets you avoid that cost.

toast0

3 months ago

> Why wouldn’t the OS itself default to this behavior? Could it fall apart under load, or is it just not important enough to replace the legacy code relying on it?

Mmap and read/write syscalls are both ways to interact with files, but they have different behaviors. You can't exactly swap one for the other without knowledge of the caller. What you likely do see is that OS utilities likely use mmap when it makes sense and a difference.

You also have a lot of things that can work on files or pipes/etc and having a common interface is more useful than having more potential performance (sometimes the performance is enough to warrant writing it twice).

nteon

3 months ago

2 replies

the downside is that the go runtime doesn't expect memory reads to page fault, so you may end up with stalls/latency/under-utilization if part of your dataset is paged out (like if you have a large cdb file w/ random access patterns). Using file IO, the Go runtime could be running a different goroutine if there is a disk read, but with mmap that thread is descheduled but holding an m & p. I'm also not sure if there would be increased stop the world pauses, or if the async preemption stuff would "just work".

Section 3.2 of this paper has more details: https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf

3 months ago

4 replies

To me this indicates a limitation of the API. Cause you do want to maintain that the kernel can page out that memory under pressure while userspace accesses that memory asynchronously while allowing the thread to do other asynchronous things. There’s no good programming model/OS api that can accomplish this today.

im3w1l

3 months ago

1 reply

There are apis that sort of let you do it: mincore, madvise, userfaultfd.

3 months ago

1 reply

None of those APIs are cheap enough to call in a fast path.

3 months ago

2 replies

no syscall will be cheap to call in a fast path. You would need an hardware instruction that tells you if a load or store would fault.

zozbot234

3 months ago

1 reply

> You would need an hardware instruction that tells you if a load or store would fault.

You have MADV_FREE pages/ranges. They get cleared when purged, so reading zeros tells you that the load would have faulted and needs to be populated from storage.

3 months ago

2 replies

MADV_FREE is insufficient - userspace doesn’t get a signal from the OS to know when there’s system wide memory pressure and having userspace try to respond to such a signal would be counter productive and slow in a kernel operation that needs to be a fast path. It’s more that you want MADV (page cache) a memory range and then have some way to have a shared data structure where you are told if it’s still resident and can lock it from being paged out.

3 months ago

1 reply

MADV_FREE is also extremely expensive. CPU vendors have finally simplified TLB shootdown in recent CPUs with both AMD and Intel now having instructions to broadcast TLB flushes in hardware, which gets rid of one of the worst sources of performance degradation in threaded multicore applications (oh the pain of IPIs mixed with TLB flushing!). However, it's still very expensive to walk page tables and free pages.

Hardware reference counting of memory allocations would be very interesting. It would be shockingly simple to implement compared to many other features hardware already has to tackle.

zozbot234

3 months ago

> MADV_FREE is also extremely expensive.

It's quite expensive to free pages under memory pressure (though it's not clear that there's any other choice to be made), but if the pages are never freed it should be cheap, AIUI.

zozbot234

3 months ago

> userspace doesn’t get a signal from the OS to know when there’s system wide memory pressure

Memory pressure indicators exist, https://docs.kernel.org/accounting/psi.html

> have some way to have a shared data structure where you are told if it’s still resident and can lock it from being paged out.

What's more efficient than fetching data and comparing it with zero? Any write within the range will then cancel the MADV_FREE property on the written-to page thus "locking" it again, and this is also very efficient.

3 months ago

1 reply

Rather than a direct syscall, you could imagine something like rseq where you have a shared userspace / kernel data structure where the userspace code gets aborted and restarted if the page was evicted while being processed. But making this work correctly and actually not have a perf overhead and also be an ergonomic API is super hard. In practice people who care probably are satisfied by direct I/O within io_uring with a custom page cache and a truly optimal implementation where the OS can still manage file pages and evict them but the application still new when it happened isn’t worth it.

3 months ago

Unfortunately, a lot of the shared state with userland became much more difficult to implement securely when the Meltdown and Spectre (and others) exploits became concerns that had to be mitigated. They makes the OS's job a heck of a lot harder.

Sometimes I feel modern technology is basically a delicately balanced house of cards that falls over when breathed upon or looked at incorrectly.

avianlyric

3 months ago

3 replies

There is no sensible OS API that could support this, because fundamentally memory access is a hardware API. The OS isn’t involved in normal memory reads, because that would be ludicrously inefficient, effectively requiring a syscall for every memory operation, which effectively means a syscall for any operation involving data I.e. all operations.

Memory operations are always synchronous because they’re performed directly as a consequence of CPU instructions. Reading memory that’s been paged out results in the CPU itself detecting that the virtual address isn’t in RAM, and performing a hardware level interrupt. Literally abandoning a CPU instruction mid execution to start executing an entirely separate set of instructions which will hopefully sort out the page fault that just occurred, then kindly ask the CPU to go back and repeat the operation that caused the page fault.

OS is only involved only because it’s the thing that provided the handling instructions for the CPU to execute in the event of a page fault. But it’s not in anyway actually capable of changing how the CPU initially handles the page fault.

Also the current model does allow other threads to continue executing other work while the page fault is handled. The fault is completely localised to individual thread that triggered the fault. The CPU has no concept of the idea that multiple threads running on different physical core are in anyway related to each other. It also wouldn’t make sense to allow the interrupted thread to someone kick off a separate asynchronous operation, because where is it going to execute? The CPU core where the page fault happened is needed to handle the actual page fault, and copy in the needed memory. So even if you could kick off an async operation, there wouldn’t be any available CPU cycles to carry out the operation.

Fundamentally there aren’t any sensible ways to improve on this problem, because the problem only exists due to us pretending that our machines have vastly more memory than they actually do. Which comes with tradeoffs, such as having to pause the CPU and steal CPU time to maintain the illusion.

If people don’t like those tradeoffs, there’s a very simple solution. Put enough memory in your machine to keep your entire working set in memory all the time. Then page faults can never happen.

im3w1l

3 months ago

3 replies

I think you have a misunderstanding of how disk IO happens. The CPU core sends a command to the disk "I want some this and that data", then the CPU core can go do something else while the disk services that request. From what I read the disk actually puts the data directly into memory by using DMA, without needing to involve the CPU.

So far so good, but then the question is to ensure that the CPU core has something more productive to do then just check "did the data arrive yet?" over and over and coordinating that is where good apis come in.

dapperdrake

3 months ago

2 replies

(Not the person you are replying to.)

There is nothing in the sense of Python async or JS async that the OS thread or OS process in question could usefully do on the CPU until the memory is paged into physical RAM. DMA or no DMA.

The OS process scheduler can run another process or thread. But your program instance will have to wait. That’s the point. It doesn’t matter whether waiting is handled by a busy loop a.k.a. polling or by a second interrupt that wakes the OS thread up again.

That is why Linux calls it uninterruptible sleep.

EDIT: io_uring would of course change your thread from blocking syscalls to non-blocking syscalls. Page faults are not a syscall, as GP pointed out. They are, however, a context-switch to an OS interrupt handler. That is why you have an OS. It provides the software drivers for your CPU, MMU, and disks/storage. Here this is the interrupt handler for a page fault.

hyghjiyhu

3 months ago

1 reply

(I am the person you are replying to)

It could work like this. "Hey OS I would like to process these pages* are they good to go? If not could you fetch and lock them for me" and then if they are ready you process them knowing it won't fault, and if they are not you do something else and try again later.

It's a sort of hybrid of the mmap and fread paradigms in that there are both explicit read requests but the kernel can also get you data on its own initiative if there are spare resources for it.

* to amortize syscall overhead.

avianlyric

3 months ago

What advantages does that provide over using more OS threads. Ultimately this model is based on the idea that we want our programming runtimes to become increasingly responsible for low level scheduling concerns that have traditionally been handled by the OS scheduler.

I can broadly understand why there may be a desire to go down that path. But I’m not convinced that it would produce meaningful better performance than the current abstractions. Especially if you take a step back as ask the question: is mmap is the right tool to be using in these situations, rather using other tools like io_uring?

To be clear I don’t know the answer to this question. But the complexity of the solutions being suggested to potentially improve the mmap API really make me question if they’re capable of producing meaningful improvements.

3 months ago

What everyone forgets is just how expensive context switches are on modern x86 CPUs. Those 512 bit vector registers fill up a lot of cache lines. That's why async tends to win over processes / threads for many workloads.

ori_b

3 months ago

I think you have a misunderstanding of how the OS is signaled about disk I/O being necessary. Most of the post above was discussing that aspect of it, before the OS even sends the command to the disk.

lmz

3 months ago

It's hard to say on one hand "I use mmap because I don't want fancy APis for every read" and on the other "I want to do something useful on page fault" because you don't want to make every memory read a possible interruption point.

3 months ago

1 reply

> There is no sensible OS API that could support this, because fundamentally memory access is a hardware API.

Not only is there a sensible OS API that could support this, Linux already implements it; it's the SIGSEGV signal. The default way to respond to a SIGSEGV is by exiting the process with an error, but Linux provides the signal handler with enough information to do something sensible with it. For example, it could map a page into the page frame that was requested, enqueue an asynchronous I/O to fill it, put the current green thread to sleep until the I/O completes, and context-switch to a different green thread.

Invoking a signal handler only has about the same inherent overhead as a system call. But then the signal handler needs another couple of system calls. So on Linux this is over a microsecond in all. That's probably acceptable, but it's slower than just calling pread() and having the kernel switch threads.

Some garbage-collected runtimes do use SIGSEGV handlers on Linux, but I don't know of anything using this technique for user-level virtual memory. It's not a very popular technique in part because, like inotify and epoll, it's nonportable; POSIX doesn't specify that the signal handler gets the arguments it would need, so running on other operating systems requires extra work.

im3w1l also mentions userfaultfd, which is a different nonportable Linux-only interface that can solve the same thing but is, I think, more efficient.

maxdamantus

3 months ago

3 replies

Just to clarify, I think the parent posts are talking about non-failing page faults, ie where the kernel just needs to update the mapping in the MMU after finding the existing page already in memory (minor page fault), or possibly reading it from filesystem/swap (major page fault).

SIGSEGV isn't raised during a typical page fault, only ones that are deemed to be due to invalid reads/writes.

When one of the parents talks about "no good programming model/OS api", they basically mean an async option that gives the power of threads; threading allows concurrency of page faults, so the kernel is able to perform concurrent reads against the underlying storage media.

Off the top of my head, a model I can think of for supporting concurrent mmap reads might involve a function:

  bool hint_read(void *data, size_t length);

When the caller is going to read various parts of an mmapped region, it can call `hint_read` multiple times beforehand to add regions into a queue. When the next page fault happens, instead of only reading the currently accessed page from disk, it can drain the `hint_read` queue for other pages concurrently. The `bool` return indicates whether the queue was full, so the caller stops making useless `hint_read` calls.

I'm not familiar with userfaultfd, so don't know if it relates to this functionality. The mechanism I came up with is still a bit clunky and probably sub-optimal compared to using io_uring or even `readv`, but these are alternatives to mmap.

3 months ago

1 reply

If you want accessing a particular page to cause a SIGSEGV so your custom fault handler gets invoked, you can just munmap it, converting that access from a "non-failing page fault" into one "deemed to be invalid". Then the mechanism I described would "allow[] concurrency of page faults, so the [userspace threading library] is able to perform concurrent reads against the underlying storage media". As long as you were aggressive enough about unmapping pages that none of your still-mapped pages got swapped out by the kernel. (Or you could use mlock(), maybe.)

I tried implementing your "hint_read" years ago in userspace in a search engine I wrote, by having a "readahead thread" read from pages before the main thread got to them. It made it slower, and I didn't know enough about the kernel to figure out why. I think I could probably make it work now, and Linux's mmap implementation has improved enormously since then, so maybe it would just work right away.

maxdamantus

3 months ago

2 replies

The point about inducing segmentation faults is interesting and sounds like it could work to implement the `hint_read` mechanism. I guess it would mostly be a question of how performant userfaultfd or SIGSEGV handling is. In any case it will be sub-optimal to having it in the kernel's own fault handler, since each userfaultfd read or SIGSEGV callback is already a user-kernel-user switch, and it still needs to perform another system call to do the actual reads, and even more system calls to mmap the bits of memory again.

Presumably having fine-grained mmaps will be another source of overhead. Not to mention that each mmap requires another system call. Instead of a single fault or a single call to `readv`, you're doing many `mmap` calls.

> I tried implementing your "hint_read" years ago in userspace in a search engine I wrote, by having a "readahead thread" read from pages before the main thread got to them.

Yeah, doing it in another thread will also have quite a bit of overhead. You need some sort of synchronisation with the other thread, and ultimately the "readahead" thread will need to induce the disk reads through something other than a page fault to achieve concurrent reads, since within the readahead thread, the page faults are still synchronous, and they don't know what the future page faults will be.

It might help to do `readv` into dummy buffers to force the kernel to load the pages from disk to memory, so the subsequent page faults are minor instead of major. You're still not reducing the number of page faults though, and the total number of mode switches is increased.

Anyway, all of these workarounds are very complicated and will certainly be a lot more overhead than vectored IO, so I would recommend just doing that. The overall point is that using mmap isn't friendly to concurrent reads from disk like io_uring or `readv` is.

Major page faults are basically the same as synchronous read calls, but Golang read calls are asynchronous, so the OS thread can continue doing computation from other Goroutines.

Fundamentally, the benchmarks in this repository are broken because in the mmap case they never read any of the data [0], so there are basically no page faults anyway. With a well-written program, there shouldn't be a reason that mmap would be faster than IO, and vectored IO can obviously be faster in various cases.

[0] Eg, see here where the byte slice is assigned to `_` instead of being used: https://github.com/perbu/mmaps-in-go/blob/7e24f1542f28ef172b...

3 months ago

1 reply

munmap + signal handling is terrible not least of which that you don’t want to be fucking with the page table in that way as an unmap involves a cross cpu TLB shoot down which is slooow in a “make the entire machine slow” kind of way.

3 months ago

That is correct, although my laptop is only four cores.

immibis

3 months ago

1 reply

Inducing segmentation faults is literally how the kernel implements memory mapping, and virtual memory in general, by the way. From the CPU's perspective, that page is unmapped. The kernel gets its equivalent of a SIGSEGV signal (which is a "page fault"=SIGSEGV "interrupt"=signal), checks its own private tables, decides the page is currently on disk, schedules it to be read from disk, does other stuff in the meantime, and when the page has finished being read from disk, it returns from the interrupt.

(It does get even deeper than that: from the CPU's perspective, the interrupt is very brief, just long enough to take note that it happened and avoid switching back to the thread that page-faulted. The rest of the stuff I mentioned, although logically an "interrupt" from the application's perspective, happens with the CPU's "am I handling an interrupt?" flag set to false. This is equivalent to writing a signal handler that sets a flag saying the thread is blocked, edits its own return address so it will return to the scheduler instead of the interrupted code, then calls sigreturn to exit the signal handler.)

3 months ago

There are some differences, including the cross-CPU TLB shootdowns vlovich mentioned.

3 months ago

1 reply

Are you reinventing madvise?

maxdamantus

3 months ago

1 reply

I think the model I described is more precise than madvise. I think madvise would usually be called on large sequences of pages, which is why it has `MADV_RANDOM`, `MADV_SEQUENTIAL` etc. You're not specifying which memory/pages are about to be accessed, but the likely access pattern.

If you're just using mmap to read a file from start to finish, then the `hint_read` mechanism is indeed pointless, since multiple `hint_read` calls would do the same thing as a single `madvise(..., MADV_SEQUENTIAL)` call.

The point of `hint_read`, and indeed io_uring or `readv` is the program knows exactly what parts of the file it wants to read first, so it would be best if those are read concurrently, and preferably using a single system call or page fault (ie, one switch to kernel space).

I would expect the `hint_read` function to push to a queue in thread-local storage, so it shouldn't need a switch to kernel space. User/kernel space switches are slow, in the order of a couple of 10s of millions per second. This is why the vDSO exists, and why the libc buffers writes through `fwrite`/`println`/etc, because function calls within userspace can happen at rates of billions per second.

3 months ago

1 reply

you can do fine grained madvise via io_uring, which indeed uses a queue. But at that point why use mmap at all, just do async reads via io_uring.

3 months ago

1 reply

The entire point I was trying to make at the beginning of the thread is that mmap gives you memory pages in the page cache that the OS can drop on memory pressure. Io_uring is close on the performance and fine-grained access patterns front. It’s not so good on the system-wide cooperative behavior with memory front and has a higher cost as either you’re still copying it from the page cache into a user buffer (non trivial performance impact vs the read itself) + trashing your CPU caches or you’re doing direct I/O and having to implement a page cache manually (and risks duplicating page data inefficiently in userspace if the same file is accessed by multiple processes.

3 months ago

Right, so zero copy IO but still having the ability to share the pagecache across process and allow the kernel to drop caches on high mempressure. One issue is that when under pressure, a process might not really be able to successfully read a page and keep retyring and failing (with an LRU replacement policy it is unlikely and probably self-limiting, but still...).

3 months ago

You’ve actually understood my suggestion - thank you. Unfortunately I think hint_read inherently can’t work because it’s a race condition between the read and how long you access the page. And this race is inherent in any attempted solution that needs to be solved. Signals are also the wrong abstraction mechanism (and are slow and have all sorts of other problems).

You need something more complicated I think, like rseq and futex you have some shared data structure that both understand how to mutate atomically. You could literally use rseq to abort if the page isn’t in memory and then submit an io_uring task to get signaled when it gets paged in again but rseq is a bit too coarse (it’ll trigger on any preemption).

There’s a race condition starvation danger here (it gets evicted between when you get the signal and the sequence completes) but something like this conceptually could maybe be closer to working.

But yes it’s inherently difficult which is why it doesn’t exist but it is higher performance. And yes, this only makes sense for mmap not all allocations so SIGSEGV is irrelevant if looking at today’s kernels.

blibble

3 months ago

1 reply

> There is no sensible OS API that could support this, because fundamentally memory access is a hardware API.

there's nothing magic about demand paging, faulting is one way it can be handled

another could be that the OS could expose the present bit on the PTE to userland, and it has to check it itself, and linux already has asynchronous "please back this virtual address" APIs

> Memory operations are always synchronous because they’re performed directly as a consequence of CPU instructions.

although most CPU instructions may look synchronous they really aren't, the memory controller is quite sophisticated

> Fundamentally there aren’t any sensible ways to improve on this problem, because the problem only exists due to us pretending that our machines have vastly more memory than they actually do. Which comes with tradeoffs, such as having to pause the CPU and steal CPU time to maintain the illusion.

modern demand paging is one possible model that happens to be near universal amongst operating system today

there are many, many other architectures that are possible...

avianlyric

3 months ago

> although most CPU instructions may look synchronous they really aren't, the memory controller is quite sophisticated

I was eliding at lot of details. But my broader point is that from the perspective of the thread being interpreted, the paging process is completely synchronous. Sure advanced x86 CPU maybe be tracking data dependencies between instructions and actively reordering instructions to reduce the impact of the pipeline stalling caused by the page fault. But that’s all low level optimisation that are (or should be) completely invisible to the executing thread.

> there are many, many other architectures that are possible...

I would be curious to see any examples of those alternatives. Demand paging provides a powerful abstraction, and it’s not clear to me how you can sensibly move page management into applications. At a very minimum that would suggest that every programming language would need a memory management runtime capable to predicting possible memory reads ahead of time in a sensible fashion, and triggering its own paging logic.

wmf

3 months ago

3 replies

If C had exceptions a page fault could safely unwind the stack up to the main loop which could work on something else until the page arrives. This has the advantage that there's no cost for the common case of accessing resident pages. Exceptions seem to have fallen out of favor so this may trade one problem for another.

pjmlp

3 months ago

Windows C has exceptions, and no one has ever thought about doing something like this.

They are only used for the same purpose as UNIX signals, without their flaws.

In any case, page faults are OS specific, how to standardise such behaviour, with the added performance loss switching between both userspace and kernel.

3 months ago

C++ has exceptions and having seen the vast majority of code and the way it’s written and the understanding of people writing it, exception safety is a foreign concept. Doing it in C without RAII seems particularly masochistic and doomed to fail.

And unwinding the stack isn’t what you want to do because you’re basically signaling you want to cancel the operation and you’re throwing all the state when you precisely don’t want to do that - you just want to pause the current task and do other I/O in the meantime.

https://dl.acm.org/doi/10.1145/121132.121151

3 months ago

you can longjmp, swapcontext or whatever from a signal handler into another lightweight fiber. The problem is that there is no "until the page arrive" notification. You would have to poll mincore which is awful.

You could of course imagine an ansychronous "mmap complete notification" syscal, but at that point why not just use io_uring, it will be simpler and it has the benefit of actually existing.

twic

3 months ago

1 reply

There isn't today, but there was in 1991, scheduler activations:

The rough idea is that if the kernel blocks a thread on something like a page cache miss, then it notifies the program through something a bit like a signal handler; if the program is doing user-level scheduling, it can then take account of that thread being blocked. The actual mechanism in the paper is more refined than that.

scottlamb

3 months ago

Nice find. That going nowhere seems like classic consequence of the cyclical nature of these things: user-managed concurrency was cool, then it wasn't, then Go (and others) brought it back.

I think the more recent UMCG [1] (kind of a hybrid approach, with threads visible by the kernel but mostly scheduled by userspace) handles this well. Assuming it ever actually lands in upstream, it seems reasonable to guess Go would adopt it, given that both originate within Google.

It's worth pointing out that the slow major page fault problem is not unique to programs using mmap(..., fd, ...). The program binary is implicitly mmaped, and if swap is enabled, even anonymous memory can be paged out. I prefer to lock ~everything [2] into RAM to avoid this, but most programs don't do this, and default ulimits prevent programs running within login shells from locking much if anything.

[1] https://lwn.net/Articles/879398/

[2] particularly on (mostly non-Go) programs with many threads, it's good to avoid locking into RAM the guard pages or stack beyond what is likely to be used, so better not to just use mlockall(MCL_CURRENT | MCL_FUTURE) unfortunately.

3 months ago

This is amazingly good feedback. I hadn't thought of that at all. It is so much harder to reason about the Go runtime as opposed to a threaded application.

habibur

3 months ago

3 replies

Is mmap still faster than fread? That might have been true in the 90s but I was wondering about current improvements.

If you have enough free memory, the file will be cached in memory anyway instead of residing on disk. Therefore both will be reading from memory, albeit through different API.

Looking for recent benchmark or view from OS developers.

do_not_redeem

3 months ago

1 reply

Even if the file is cached, fread has to do a memcpy. mmap doesn't.

3 months ago

1 reply

fread is (usually) buffered io, so it actually does two additional mem copies (kernel to FILE buffer then to user buffer)

assbuttbuttass

3 months ago

1 reply

Not in Go

3 months ago

oh, right, this is Go (https://pkg.go.dev/github.com/odeke-em/go-utils/fread#sectio...). Do the strings it return share memory with the internal buffer?

loeg

3 months ago

read, or fread? fread is the buffered version that does an extra copy for no reason that would benefit this use case.

stingraycharles

3 months ago

In our experience building a high performance database server: absolutely. If your line of thinking is “if you have enough free memory”, then these types of optimizations aren’t for you. one of the main benefits is eliminating an extra copy.

additionally, mmap is heavily optimized for random access, so if that’s what you’re doing, then you’ll have a much better time with it than fread.

(I hope a plug is not frowned upon here: if you like this kind of stuff, we’re a fully remote company and hiring C++ devs: https://apply.workable.com/quasar/j/436B0BEE43/ )

buybackoff

3 months ago

6 replies

It looks suspicious at 25x. Even 2.5x would be suspicious unless reading very small records.

I assume both cases have the file cached in RAM already fully, with a tiny size of 100MB. But the file read based version actually copies the data into a given buffer, which involves cache misses to get data from RAM to L1 for copying. The mmap version just returns the slice and it's discarded immediately, the actual data is not touched at all. Each record is 2 cache lines and with random indices is not prefetched. For the CPU AMD Ryzen 7 9800X3D mentioned in the repo, just reading 100 bytes from RAM to L1 should take ~100 nanos.

The benchmark compares actually getting data vs getting data location. Single digit nanos is the scale of good hash tables lookups with data in CPU caches, not actual IO. For fairness, both should use/touch the data, eg copy it.

3 months ago

1 reply

> For the CPU AMD Ryzen 7 9800X3D mentioned in the repo, just reading 100 bytes from RAM to L1 should take ~100 nanos.

I think this is the wrong order of magnitude. One core of my Ryzen 5 3500U seems to be able to run memcpy() at 10 gigabytes per second (0.1 nanoseconds per byte) and memset() at 31 gigabytes per second (0.03 nanoseconds per byte). I'd expect a sequential read of 100 bytes to take about 3 nanoseconds, not 100 nanoseconds.

However, I think random accesses do take close to 100 nanoseconds to transmit the starting row and column address and open the row. I haven't measured this on this hardware because I don't have a test I'm confident in.

3 months ago

1 reply

100 nanoseconds from RAM is correct. Latency != bandwidth. 3 nanoseconds would be from cache or so on a Ryzen. You ain't gonna get the benefits of prefetching on the first 100 bytes.

3 months ago

1 reply

Yes, my comment clearly specified that I was talking about sequential reads, which do get the benefits of prefetching, and said, "I think random accesses do take close to 100 nanoseconds".

3 months ago

1 reply

If you're doing large amounts of sequential reads from a filesystem, it's probably not in cache. You only get latency that low if you're doing nothing else that stresses the memory subsystem, which is rather unlikely. Real applications have overhead, which is why microbenchmarks like this are useless. Microbenchmarks are not the best first order estimate for programmers to think of.

3 months ago

1 reply

Yes, I went into more detail on those issues in https://news.ycombinator.com/item?id=45689464, but overhead is irrelevant to the issue we were discussing, which is about how long it takes to read 100 bytes from memory. Microbenchmarks are generally exactly the right way to answer that question.

Memory subsystem bottlenecks are real, but even in real applications, it's common for the memory subsystem to not be the bottleneck. For example, in this case we're discussing system call overhead, which tends to move the system bottleneck inside the CPU (even though a significant part of that effect is due to L1I cache evictions).

Moreover, even if the memory subsystem is the bottleneck, on the system I was measuring, it will not push the sequential memory access time anywhere close to 1 nanosecond per byte. I just don't have enough cores to oversubscribe the memory bus 30×. (1.5×, I think.) Having such a large ratio of processor speed to RAM interconnect bandwidth is in fact very unusual, because it tends to perform very poorly in some workloads.

If microbenchmarks don't give you a pretty good first-order performance estimate, either you're doing the wrong microbenchmarks or you're completely mistaken about what your application's major bottlenecks are (plural, because in a sequential program you can have multiple "bottlenecks", colloquially, unlike in concurrent systens where you almost always havr exactly one bottleneck.) Both of these problems do happen often, but the good news is that they're fixable. But giving up on microbenchmarking will not fix them.

3 months ago

1 reply

If you're bottlenecked on a 100 byte read, the app is probably doing something really stupid, like not using syscalls the way they're supposed to. Buffered I/O has existed from fairly early on in Unix history, and it exists because it is needed to deal with the mismatch between what stupid applications want to do versus the guarantees the kernel has to provide for file I/O.

The main benefit from the mmap approach is that the fast path then avoids all the code the kernel has to execute, the data structures the kernel has to touch, and everything needed to ensure the correctness of the system. In modern systems that means all kinds of synchronization and serialization of the CPU needed to deal with $randomCPUdataleakoftheweek (pipeline flushes ftw!).

However, real applications need to deal with correctness. For example, a real database is not just going to just do 100 byte reads of records. It's going to have to take measures (locks) to ensure the data isn't being written to by another thread.

Rarely is it just a sequential read of the next 100 bytes from a file.

I'm firmly in the camp that focusing on microbenchmarks like this is frequently a waste of time in the general case. You have to look at the application as a whole first. I've implemented optimizations that looked great in a microbenchmark, but showed absolutely no difference whatsoever at the application level.

Moreover, my main hatred for mmap() as a file I/O mechanism is that it moves the context switches when the data is not present in RAM from somewhere obvious (doing a read() or pread() system call) to somewhere implicit (reading 100 bytes from memory that happens to be mmap()ed and was passed as a pointer to a function written by some other poor unknowing programmer). Additionally, read ahead performance for mmap()s when bringing data into RAM is quite a bit slower than on read()s in large part because it means that the application is not providing a hint (the size argument to the read() syscall) to the kernel for how much data to bring in (and if everything is sequential as you claim, your code really should know that ahead of time).

So, sure, your 100 byte read in the ideal case when everything is cached is faster, but warming up the cache is now significantly slower. Is shifting costs that way always the right thing to do? Rarely in my experience.

And if you don't think about it (as there's no obvious pread() syscall anymore), those microseconds and sometimes milliseconds to fault in the page for that 100 byte read will hurt you. It impacts your main event loop, the size of your pool of processes / threads, etc. The programmer needs to think about these things, and the article mentioned none of this. This makes me think that the author is actually quite naive and merely proud in thinking that he discovered the magic Go Faster button without having been burned by the downsides that arise in the Real World from possible overuse of mmap().

3 months ago

1 reply

Perhaps surprisingly, I agree with your entire comment from beginning to end.

Sometimes mmap can be a real win, though. The poster child for this is probably LMDB. Varnish also does pretty well with mmap, though see my caveat on that in my linked comment.

3 months ago

1 reply

Varish was very well done. It's disappointing that with HTTPS-first nowadays there is very little oppourtunity to make good use of local web caches of web content across browsers / clients. Caches would have been a godsend back in the 1990s when we had to use shared dialup to connect to the internet while using NetScape in a classroom full of computers.

https://github.com/LMDB/dbbench/blob/1281588b7fdf119bcba65ce...

3 months ago

Yeah. But it sees a lot of reverse proxy use, and projects like IPFS are exploring the possibilities of securely-locally-cacheable data.

a-dub

3 months ago

1 reply

doing these sorts of benchmarks is actually quite tricky. you must clear the page cache by allocating >1x physical ram before each attempt.

moreover, mmap by default will load lazy, where mmap with MAP_POPULATE will prefetch. in the former case, reporting average operation times is not valid because the access time distributions are not gaussian (they have a one time big hit at first touch). with MAP_POPULATE (linux only), there is long loading delay when mmap is first called, but then the average access times will be very low. when pages are released will be determined by the operating system page cache eviction policy.

the data structure on top is best chosen based on desired runtime characteristics. if it's all going in ram, go ahead and use a standard randomized hash table. if it's too big to fit in ram, designing a structure that is aware of lru style page eviction semantics may make sense (ie, a hash table or other layout that preserves locality for things that are expected to be accessed in a temporally local fashion.)

codedokode

3 months ago

1 reply

> you must clear the page cache

In Linux there is a /proc/sys/vm/drop_caches pseudo file that does this. Look how great Linux is compared to other OSes.

a-dub

3 months ago

1 reply

that's super cool! live and learn. even better would be the capability to drop caches from a supplied point in the filesystem hierarchy.

ahoka

3 months ago

1 reply

People would run it from cron to "free memory", believe it or not.

DoctorOW

3 months ago

Hence, https://www.linuxatemyram.com/

hyc_symas

3 months ago

That's such an obvious error in their benchmark code. In my benchmark code I make sure to touch the data so at least the 1st page is actually paged in from disk.

checker659

3 months ago

Latency Numbers Every Programmer Should Know (originally by Jeff Dean / Peter Norvig)

https://gist.github.com/jboner/2841832

Scaevolus

3 months ago

Yeah, 3.3ns is about 12 CPU cycles. You can indeed create a pointer to a memory location that fast!

Tuna-Fish

3 months ago

> For the CPU AMD Ryzen 7 9800X3D mentioned in the repo, just reading 100 bytes from RAM to L1 should take ~100 nanos.

It's important to note that throughput is not just an inverse of latency, because modern OoO cpus with modern memory subsystems can have hundreds of requests in flight. If your code doesn't serialize accesses, latency numbers are irrelevant to throughput.

MayCXC

3 months ago

1 reply

wowie. mmap also dramatically improved perf for LLaMA: https://justine.lol/mmap/

kristjansson

3 months ago

uh. there was a bit more to the story than 'yup totally unalloyed free lunch'

mholt

3 months ago

1 reply

Just this month, I've learned the hard way that some file systems do not play well with mmap: https://github.com/mattn/go-sqlite3/issues/1355

In my case, it seems that Mac's ExFAT driver is incompatible with sqlite's WAL mode because the driver returned a memory address that is misaligned on ARM64. Most bizarre error I've encountered in years.

So, uh, mind your file systems, kids!

3 months ago

1 reply

I would be very careful about that conclusion. Reading that thread it sounds like you’re relying on Claude to make this conclusion but you haven’t actually verified what the address being returned actually is.

The reason I’m skeptical is three fold. The first is that it’s generally impossible for a filesystem to mmap return a pointer that’s not page boundary aligned. The second is that unaligned accesses are still fine on modern ARM is not a SIGBUS. The third is that Claude’s reasoning that the pointer must be 8-byte aligned and that indicates a misaligned read is flawed - how do you know that SQLite isn’t doing a 2-byte read at that address?

If you really think it’s a bad alignment it should be trivial to reproduce - mmap the file explicitly and print the address or modify the SQLite source to print the mmap location it gets.

mholt

3 months ago

1 reply

I'd love to be wrong, but the address it's referring to is the correct address from the error / stack trace.

I honestly don't know anything about this. There's no search results for my error. ChatGPT and Claude and Grok all agreed one way or another, with various prompts.

Would be happy to have some help verifying any of this. I just know that disabling WAL mode, and not using Mac's ExFAT driver, both fixed the error reliably.

achierius

3 months ago

1 reply

But is that the address being returned by mmap? Furthermore, what instruction is this crashing on? You should be able to look up the specific alignment requirements of that instruction to verify.

> ChatGPT and Claude and Grok all agreed one way or another, with various prompts.

This means less than you'd think: they're all trained on a similar corpus, and Grok in particular is probably at least partially distilled from Claude. So they tend to come to similar conclusions given similar data.

mholt

3 months ago

1 reply

I believe it's being returned by the FS driver, not mmap() necessarily. I think I knew what instruction it was when I was debugging it but don't remember right now. (I could probably dig through my LLM history and get it though.)

And yeah, I knew AI is useless, I try to avoid it, but when I'm way over my head it's better than nothing (it did lead me to the workaround that I mentioned in my previous comment).

3 months ago

If it was in the FS driver (w which runs in kernel / different process?) why would your process be dying?

commandersaki

3 months ago

1 reply

This is a good article but I'm wondering what is the relationship between this website/company and varnish-cache.org, since in the article they make claims of releasing Varnish Cache, and the article wasn't written by Poul-Henning Kamp.

wmf

3 months ago

1 reply

Varnish hasn't been a solo project for many years. Also PHK's version is now called Vinyl Cache while the corporate fork is called Varnish.

commandersaki

3 months ago

1 reply

The article says "when we launched Varnish Cache back in 2006". Who is we? My memory was that around that time PHK released it to the world and was the sole developer at the time.

3 months ago

1 reply

I was wondering about this too. He apparently worked at the company for a while? Did he found it?

3 months ago

1 reply

Yes. When Varnish Cache launched, in 2006, I worked in a rather small OSS consultancy, which did the Linux port of Varnish Cache and provided maintenance and funding for the project.

3 months ago

1 reply

You say, "Yes. When Varnish Cache launched, in 2006, I worked in a rather small OSS consultancy, which did the Linux port of Varnish Cache and provided maintenance and funding for the project."

But eventually phk left, and you came into conflict with him over the name, which was resolved by him choosing a different name for his version of Varnish?

3 months ago

2 replies

Not really.

We've been funding phks work on Varnish and Vinyl cache for 20 years. Do you think phk can write, maintain and release something on his own? Vinyl Cache cannot be a one-man-show, be real.

3 months ago

1 reply

(I do, in fact, think phk can write, maintain, and release something on his own.)

3 months ago

He knows a lot of things and is amongst the best software developers I've worked with, but on a project like this you need a lot more breath than any single developer can bring.

3 months ago

I see. Thank you for explaining!

gustavpaul

3 months ago

3 replies

The MmapReader is not copying the requested byte range into the buf argument, so if ever the underlying file descriptor is closed (or the file truncated out of band) any subsequent slice access will throw SIGBUS, which is really unpleasant.

It also means the latency due to pagefaults is shifted from inside mmapReader.ReadRecord() (where it would be expected) to wherever in the application the bytes are first accessed, leading to spooky unpreditactable latency spikes in what are otherwise pure functions. That inevitably leads to wild arguments about how bad GC stalls are :-)

An apples to apples comparison should be copying the bytes from the mmap buffer and returning the resulting slice.

dahfizz

3 months ago

Being able to avoid an extra copy is actually a huge performance gain when you can safely do it. You shouldn't discount how useful mmap is just because its not useful in every scenario.

You shouldn't replace every single file access with mmap. But when it makes sense, mmap is a big performance win.

dapperdrake

3 months ago

It’s not accessible until it is in user space. (Virtual memory addresses mapped to physical RAM holding the data.)

Good point.

loeg

3 months ago

> so if ever the underlying file descriptor is closed

Nit: Mmap mapping lifetimes are not attached to the underlying fd. The file truncation and latency concerns are valid, though.

3 months ago

2 replies

The simple answer to "How do memory maps (mmap) deliver faster file access?" is "sometimes", but the blog post does give some more details.

I was suspicious of the 25× speedup claim, but it's a lot more plausible than I thought.

On this Ryzen 5 3500U running mostly at 3.667GHz (poorly controlled), reading data from an already-memory-mapped page is as fast as memcpy (about 10 gigabytes per second when not cached on one core of my laptop, which works out to 0.1 nanoseconds per byte, plus about 20 nanoseconds of overhead) while lseek+read is two system calls (590ns each) plus copying bytes into userspace (26–30ps per byte for small calls, 120ps per byte for a few megabytes). Small memcpy (from, as it happens, an mmapped page) also costs about 25ps per byte, plus about 2800ps per loop iteration, probably much of which is incrementing the loop counter and passing arguments to the memcpy function (GCC is emitting an actual call to memcpy, via the PLT).

So mmap will always be faster than lseek+read on this machine, at least if it doesn't have a page fault, but the point at which memcpy from mmap would be 25× faster than lseek+read would be where 2×590 + .028n = 25×(2.8 + .025n) = 70 + .625n. Which is to say 1110 = .597n ∴ n = 1110/.597 = 1859 bytes. At that point, memcpy from mmap should be 49ns and lseek+read should be 1232ns, which is 25× as big. You can cut that size more than in half if you use pread() instead of lseek+read, and presumably io_uring would cut it even more. If we assume that we're also taking cache misses to bring in the data from main memory in both cases, we have 2×590 + .1n = 25×(2.8 + .1n) = 70 + 2.5n, so 1110 = 2.4n ∴ n = 1110/2.4 = 462 bytes.

On the other hand, mmap will be slow if it's hitting a page fault, which sort of corresponds to the case where you could have cached the result of lseek+read in private RAM, which you could do on a smaller-than-pagesize granularity, which potentially means you could hit the slow path much less often for a given working set. And lseek+read has several possible ways to do make the I/O asynchronous, while the only way to make mmap page faults asynchronous is to hit the page faults in different threads, which is a pretty heavyweight mechanism.

On the other hand, lseek+read with a software cache is sort of using twice as much memory (one copy is in the kernel's buffer cache and another copy is in the application's software cache) so mmap could still win. And, if there are other processes writing to the data being queried, you need some way to invalidate the software cache, which can be expensive.

(On the gripping hand, if you're reading from shared memory while other processes are updating it, you're probably going to need some kind of locking or lock-free synchronization with those other processes.)

So I think a reasonably architected lseek+read (or pread) approach to the problem might be a little faster or a little slower than the mmap approach, but the gap definitely won't be 25×. But very simple applications or libraries, or libraries where many processes might be simultaneously accessing the same data, could indeed get 25× or even 256× performance improvements by letting the kernel manage the cache instead of trying to do it themselves.

Someone at a large user of Varnish told me they've mostly removed mmap from their Varnish fork for performance.

loeg

3 months ago

1 reply

> lseek+read is two system calls

You'd never do that, though -- you'd use pread.

3 months ago

The article I'm commenting on said its author used seek and read, so I don't know if maybe for some reason they did do that instead of pread(), which it also mentioned. I didn't want to optimistically assume otherwise. Is pread() available in the Golang standard library?

https://github.com/golang/go/issues/19563 is someone using os.File.ReadAt, which is a method name that makes me even more uncertain. But there's also syscall.Pread apparently, so it should be fine?

If you are making only one system call, the 25× crossover point is 800-some bytes by my measurements.

3 months ago

It's worth reading bcrl's comment at https://news.ycombinator.com/item?id=45690006 for more depth on some of these issues.

Animats

3 months ago

1 reply

I never knew that Linux memory mapped files were copy-on-write. I'd assumed they let you alter the page and wrote out dirty pages later.

pengaru

3 months ago

MAP_PRIVATE vs. MAP_SHARED

charlietap

3 months ago

3 replies

This article is nonsensical. If you're reading this please don't start mmap'ing files just to read from them. It proposes an incredibly unrealistic scenario where the program is making thousands of random incredibly small unbuffered reads from a file. In reality 99 percent of programs will sequentially reading bytes into a buffer which makes orders of magnitude less syscalls.

Mmap is useful in niche scenarios, it's not magic.

karel-3d

3 months ago

That is not unrealistic if you are using the file to save binary data on given positions and don't need to read all data. For example if you have a big matrix of fixed size structs and you need to read only some of them.

[0] https://www.computerenhance.com/p/memory-mapped-files

3 months ago

This is a niche scenario. The scenario outlined is reading CDB databases.

icedchai

3 months ago

At a previous company, we had a custom "database" (I use that term very loosely) built on memory mapped files. All startup, all pages were read to ensure the data was hot, unlikely to be any page faults. It worked well for the application, but obviously because the whole thing fit in memory and was preloaded. We also had our own custom write-ahead-log. Today, I'd probably use sqlite.

philippta

3 months ago

2 replies

At computerenhance.com[0] Casey Muratori shows that memory mapped files actually perform worse at sequential reads, which is the common case for file access.

That’s because the CPU won’t prefetch data as effectively and has to rely on page faults to know what to read next. With regular, sequential file reads, the CPU can be much smarter and prefetch the next page while the program is consuming the previous one.

3 months ago

Io_uring should be outperforming both - you can configure the read ahead optimally, there’s no page faults, and there’s no copies as there is with buffered I/O:

[1] https://archive.is/vkdCo

atombender

3 months ago

Does madvise(..., MADV_SEQUENTIAL) not help here?

karel-3d

3 months ago

1 reply

mmap is fine when you know the file fits in memory, and you need random file reads/writes of only some parts of the file. It's not magic.

It's also quite hard to debug in go, because mmaped files are not visible in pprof; whe you run out of memory, mmap starts behaving really suboptimally. And it's hard to see which file takes how much memory (again it doesn't show in pprof).

3 months ago

2 replies

random reads are ok. writes through a mmap are a disaster.

vlowther

3 months ago

Only if you are doing in-place updates. If append-only datastores are your jam, writes via mmap are Just Fine:

  $ go test -v
  === RUN   TestChunkOps
      chunk_test.go:26: Checking basic persistence and Store expansion.
      chunk_test.go:74: Checking close and reopen read-only
      chunk_test.go:106: Checking that readonly blocks write ops
      chunk_test.go:116: Checking Clear
      chunk_test.go:175: Checking interrupted write
  --- PASS: TestChunkOps (0.06s)
  === RUN   TestEncWriteSpeed
      chunk_test.go:246: Wrote 1443 MB/s
      chunk_test.go:264: Read 5525.418751 MB/s
  --- PASS: TestEncWriteSpeed (1.42s)
  === RUN   TestPlaintextWriteSpeed
      chunk_test.go:301: Wrote 1693 MB/s
      chunk_test.go:319: Read 10528.744206 MB/s
  --- PASS: TestPlaintextWriteSpeed (1.36s)
  PASS

karel-3d

3 months ago

Interesting. We use it to mmap a big file that just-enough fits in memory and we mostly read randomly (around 1/1000 of the file) and sometimes sparringly write and it works great. I haven't tested how would it work more fast with Seek/Read/Write, but mmap is very ergonomic for this; it just acts as a slice and it's basically invisible; and you can then take the slice of bytes and unsafe.cast it as slice of something else. Very easy.

benjiro

3 months ago

2 replies

People are so focused on the mmap part, and the latency, that the usage is overlooked.

> The last couple of weeks I've been working on an HTTP-backed filesystem.

It feels like this is micro optimizations, that are going to get blocked anyway by the whole HTTP cycle anyway.

There is also the benchmark issue:

The enhanced CDB format seems to be focused on a read only benefits, as writes introduced a lot of latency, and issue with mmap. In other words, there is a need to freeze for the mmap, then unfreeze, write for updates, freeze for mmap ...

This cycle introduces overhead, does it not? Has this been benchmarked? Because from what i am seeing, the benefits are mostly in the frozen state (aka read only).

If the data is changed infrequently, why not just use json? No matter how slow it is, if your just going to do http requests for the directory listing, your overhead is not the actual file format.

If this enhanced file format was used as file storage, and you want to be able to fast read files, that is a different matter. Then there are ways around it with keeping "part" files where files 1 ... 1000 are in file.01, 2 ... 2000 in file.02 (thus reducing overhead from the file system). And those are memory mapped for fast reading. And where updates are invalidated files/rewrites (as i do not see any delete/vacume ability in the file format).

So, the actual benefits just for a file directory listing db escapes me.

3 months ago

1 reply

We need to support over 10M files in each folder. JSON wouldn't fare well as the lack of indices makes random access problematic. Composing a JSON file with many objects is, at least with the current JSON implementation, not feasible.

CDB is only a transport medium. The data originates in PostgreSQL and upon request, stored in CDB and transferred. Writing/freezing to CDB is faster than encoding JSON.

CDB also makes it possible to access it directly, with ranged HTTP requests. It isn't something I've implemented, but having the option to do so is nice.

benjiro

3 months ago

1 reply

> CDB is only a transport medium. The data originates in PostgreSQL and upon request, stored in CDB and transferred. Writing/freezing to CDB is faster than encoding JSON.

Might have been interesting to actually include this in the article, do you not think so? ;-)

The way the article is written, made it seen that you used cdb on edge nodes to store metadata. With no information as to what your storing / access, how, why ... This is part of the reason we have these discussions here.