Zpdf
github.comKey Features
Tech Stack
Key Features
Tech Stack
~17k pages/sec peak throughput.
Key choices: memory-mapped I/O, SIMD string search, parallel page extraction, streaming output. Handles CID fonts, incremental updates, all common compression filters.
~5,000 lines, no dependencies, compiles in <2s.
Why it's fast:
- Memory-mapped file I/O (no read syscalls)
- Zero-copy parsing where possible
- SIMD-accelerated string search for finding PDF structures
- Parallel extraction across pages using Zig's thread pool
- Streaming output (no intermediate allocations for extracted text)
What it handles: - XRef tables and streams (PDF 1.5+)
- Incremental PDF updates (/Prev chain)
- FlateDecode, ASCII85, LZW, RunLength decompression
- Font encodings: WinAnsi, MacRoman, ToUnicode CMap
- CID fonts (Type0, Identity-H/V, UTF-16BE with surrogate pairs)From https://github.com/Lulzx/zpdf/blob/main/src/main.zig it looks like the help text cites an unimplemented "-j" option to enable multiple threads.
There is a "--parallel" option, but that is only implemented for the "bench" command.
SIMD is always compiled via Zig's @Vector, Zig lowers vectors to the best available instructions at compile time.
I can't talk about the code, but the readme and commit messages are most likely LLM-generated.
And when you take into account that the first commit happened just three hours ago, it feels like the entire project has been vibe coded.
2. Be not good at or a fan of git when committing
Not sure what the disconnect is.
Now if it were vibecoded, I wouldn't be surprised. But benefit of the doubt
I don't particularly care, though, and I'm more positive about LLMs than negative even if I don't (yet?) use them very much. I think it's hilarious that a few people asked for Python bindings and then bam, done, and one person is like "..wha?" Yes, LLMs can do that sort of grunt work now! How cool, if kind of pointless. Couldn't the cycles have just been spent on trying to make muPDF better? Though I see they're in C and AGPL, I suppose either is motivation enough to do a rewrite instead.
If the intent of "benefit of the doubt" is to reduce people having a freak out over anyone who dares use these tools, I get that.
I'll try my best to make it a really good one!
You still have no basis in claiming copyright protection hence you cannot set a license on that code.
Instead of the WTFPL you should just write a disclaimer that due to being machine generated and devoid of creating work, the work is not protected by copyright and free to be used without any license.
You avoid an unnecessary copy. Normal read system call gets the data from disk hardware into the kernel page cache and then copies it into the buffer you provide in your process memory. With mmap, the page cache is mapped directly into your process memory, no copy.
All running processes share the mapped copy of the file.
There are a lot of downsides to mmap: you lose explicit error handling and fine-grained control of when exactly I/O happens. Consult the classic article on why sophisticated systems like DBMSs do not use mmap: https://db.cs.cmu.edu/mmap-cidr2022/
I now wonder which use cases would mmap suit better - if any...
> All running processes share the mapped copy of the file.
So something like building linkers that deal with read only shared libraries "plugins" etc ..?
* You want your program to crash on any I/O error because you wouldn't handle them anyway
* You value the programming convenience of being able to treat a file on disk as if the entire thing exists in memory
* The performance is good enough for your use. As the article showed, sequential scan performance is as good as direct I/O until the page cache fills up *from a single SSD*, and random access performance is as good as direct I/O until the page cache fills up *if you use MADV_RANDOM*. If your data doesn't fit in memory, or is across multiple storage devices, or you don't correctly advise the OS about your access patterns, mmap will probably be much slower
To be clear, normal I/O still benefits from the OS's shared page cache, where files that other processes have loaded will probably still be in memory, avoiding waiting on the storage device. But each normal I/O process incurs the space and time cost of a copy into its private memory, unlike mmap.I've never had to use mmap but this is always been the issue in my head. If you're treating I/O as memory pages, what happens when you read a page and it needs to "fault" by reading the backing storage but the storage fails to deliver? What can be said at that point, or does the program crash?
Sqlite does (or can optionally use mmap). How come?
Is sqlite with mmap less reliable or anything?
... it will if two programs open the same sqlite, one with mmap, and another without https://www.sqlite.org/mmap.html - at least "in some operating systems" (no mention of which ones)
https://www.sqlite.org/mmap.html
> The operating system must have a unified buffer cache in order for the memory-mapped I/O extension to work correctly, especially in situations where two processes are accessing the same database file and one process is using memory-mapped I/O while the other is not. Not all operating systems have a unified buffer cache. In some operating systems that claim to have a unified buffer cache, the implementation is buggy and can lead to corrupt databases.
Sqlite is otherwise rock solid and won't lose data as easily
SQLite can use mmap(). That is a tested and supported capability. But we don't advocate it because of the inability to precisely identify I/O errors and report them back up into the application.
https://www.sqlite.org/mmap.html
> The operating system must have a unified buffer cache in order for the memory-mapped I/O extension to work correctly, especially in situations where two processes are accessing the same database file and one process is using memory-mapped I/O while the other is not. Not all operating systems have a unified buffer cache. In some operating systems that claim to have a unified buffer cache, the implementation is buggy and can lead to corrupt databases.
What are those OSes with buggy unified buffer caches? More importantly, is there a list of platforms where the use of mmap in sqlite can lead to data loss?
You didn't. Claude did. Like it did write this comment.
And you didn't even bother testing it before submitting.
a better speed comparison would either be multi-process pdfium (since pdfium was forked from foxit before multi-thread support, you can't thread it), multi-threaded foxit, or something like syncfusion (which is quite fast and supports multiple threads).
These were always the fastest/best options. I can (and do) achieve 41k pages/sec or better on these options.
The other thing it doesn't appear you mention is whether you handle putting the words in appearance order (IE how they appear on the page), or only stream order (which varies in its relation to apperance order) .
If it's only stream order, sure, that's really fast to do. But also not anywhere near as helpful as appearance order, which is what other text-extraction engines do.
Contrasted with python, which is interpreted, has a clunky runtime, minimal optimizations, and all sorts of choices that result in slow, redundant, and also slow, performance.
The price for performance is safety checks, redundancy, how badly wrong things can go, and so on.
A good compromise is luajit - you get some of the same aggressive optimizations, but in an interpreted language, with better-than-c performance but interpreted language convenience, access to low level things that can explode just as spectacularly as with zig or c, but also a beautiful language.
It may be easier to write code that runs faster in Zig than in C under similar build optimization levels, because writing high performance C code looks a lot like writing idiomatic Zig code. The Zig standard library offers a lot of structures like hash maps, SIMD primitives, and allocators with different performance characteristics to better fit a given use-case. C application code often skips on these things simply because it is a lot more friction to do in C than in Zig.
machine code, not https://en.wikipedia.org/wiki/Bytecode
> The price for performance is safety checks
In Zig, non-ReleaseFast build modes have significant safety checks.
> luajit ... with better-than-c performance
No.
Don't disagree but in specific case, per the author, project was made via Claude Code. Although could as well be that Zig is better as LLM target. Noticed many new vibe projects decide to use Zig as target.
the licensing is a huge blocker for using mupdf in non-OSS tools, so it's very nice to see this is MIT
python bindings would be good too
also, added python bindings.
as others have commented, I think while this is a nice portfolio piece, I would worry about its longevity as a vibe coded project
The author has created 30 new projects on github, over the past month alone, and he also happen to have an LLM-generated blog. I think it's fair to say it's not “legitimately useful” except as a way for the author to fill his resume as he's looking for a job.
This kind of behavior is toxic.
I actually don’t mind LLM generated code when it’s been manually reviewed, but this and a quick look through other submissions makes me realise the author is simply trying to pad their resume with OSS projects. Respect the hustle, but it shows a lack of respect for other’s time to then submit it to show HN
I'm not convinced that projects vibe coded over the evening deserve the HN front page…
If it's really better than what we had before, what does it matter how it was made? It's literally hacked together with the tools of the day (LLMs) isn't that the very hacker ethos? Patching stuff together that works in a new and useful way.
5x speed improvements on pdf text extraction might be great for some applications I'm not aware of, I wouldn't just dismiss it out of hand because the author used $robot to write the code.
Presumably the thought to make the thing in the first place and decide what features to add and not add was more important than how the code is generated?
That's a very big if. The whole point is that what we had before was made slowly. This was made quickly. In itself it's not better but what it typically means is hours and hours of testing. Going through painful problems that highlight idiosyncrasies of the problem space. Things that are really weird and specific to whatever the tool is trying to address.
In such cases we can be expect that with very little time very few things were tested and tested properly (including a comment mentioned how tests were also generated). "We" the audience of potentially interested users have then to do that work (as plenty did commenting on that post).
IMHO what you bring forward is precisely that :
- can the new "solution" actually pass ALL the tests the previous one did? More?
This should be brought to the top and the actual compromises can then be understood, "we" can then decide if it's "better" for our context. In some cases faster with lossy output is actually better, in others absolutely not. The difference between the new and the old solutions isn't binary and have no visibility on that is what makes such a process nothing more than yet another showcase that LLMs can indeed produce "something" that is absolutely boring while consuming a TON of resources, including our own attention.
TL;DR: there should be test "harness" made by 3rd parties (or from well known software it is the closest too) that an LLM generated piece of code should pass before being actually compared.
~/c/t/s/zpdf (main)> zig version
0.15.2
Sky is blue, water is wet, slop does not work.Using AI lazily is a problem though. Writing code has never been the most important part of software development, making sure that the code does what the user needs is what takes most of the time. But from the github issues and the comment here from the few who have tested the tool, it looka like the author didn't even test the AI output on real PDF.
If you use AI to build in 3 month something that would have taken a year without it, then cool. But here we're talking about someone who's spending 2-3 hours every other day building a new fake software project to pad his resume. This isn't something anyone should endorse.
01F9020101FC020401F9020301FB02070205020800030209020701FF01F90203020901F9012D020A0201020101FF01FB01FE0208
0200012E0219021802160218013202120222 0209021D0212021D012E013202200222000301FA021A0220021C022002160213012E0222000F000301F90206012C
020301FF02000205020101FC020901F90003020001F9020701F9020E020802000205020A
01FC028C0213021B022002230221021800030200012E021902180216021201320221021A012E00030209021D0212021D012E013202200222000301FA021A0220021C022002160213012E0222000F000301F90206012C
0200020D02030208020901F90203020901FF0203020502080003012B020001F9012B020001F901FA0205020A01FD01FE0208
020201300132012E012F021A012F0210021B013202200221012E0222 0209021D0212021D012E013202200222000301FA021A0220021C022002160213012E0222000F000301F90206012C
This for entire book. Mutool extracts the text just fine.ΑΛΕΞΑΝΔΡΟΣ ΤΡΙΑΝΤΑΦΥΛΛΙΔΗΣ Καθηγητής Τμήματος Βιολογίας, ΑΠΘ
ΝΙΚΟΛΕΤΑ ΚΑΡΑΪΣΚΟΥ
Επίκουρη Καθηγήτρια Τμήματος Βιολογίας, ΑΠΘ
ΚΩΝΣΤΑΝΤΙΝΟΣ ΓΚΑΓΚΑΒΟΥΖΗΣ
Μεταδιδάκτορας Τμήματος Βιολογίας, ΑΠΘ
Γονιδιώματα
Δομή, Λειτουργία και ΕφαρμογέςI'm curious about the trade-offs mentioned in the comments regarding Unicode handling. For document analysis pipelines (like extracting text from technical documentation or research papers), robust Unicode support is often critical.
Would be interesting to see benchmarks on different PDF types - academic papers with equations, scanned documents with OCR layers, and complex layouts with tables. Performance can vary wildly depending on the document structure.
zpdf extract texbook.pdf | grep -m1 Stanford
DONALD E. KNUTHStanford UniversityIllustrations byfpdf
jpdf
cpdf
cpppdf
bfpdf
ppdf
...
opdf
2 more comments available on Hacker News
Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.