Peeking Inside Gigantic Zips with Only Kilobytes
Posted3 months agoActive2 months ago
ritiksahni.comTechstory
calmpositive
Debate
20/100
Zip File FormatData CompressionFile System Optimization
Key topics
Zip File Format
Data Compression
File System Optimization
The article discusses a technique to efficiently inspect large zip files without loading the entire file into memory, sparking discussion on its applications and potential improvements.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
N/A
Peak period
21
Day 5
Avg / period
4.1
Comment distribution29 data points
Loading chart...
Based on 29 loaded comments
Key moments
- 01Story posted
Oct 12, 2025 at 5:57 AM EDT
3 months ago
Step 01 - 02First comment
Oct 12, 2025 at 5:57 AM EDT
0s after posting
Step 02 - 03Peak activity
21 comments in Day 5
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 27, 2025 at 7:03 AM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45556904Type: storyLast synced: 11/20/2025, 12:38:35 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
[1] https://gildas-lormeau.github.io/zip.js/api/classes/HttpRang...
[2] https://github.com/gildas-lormeau/zip.js/blob/master/tests/a...
[3] https://github.com/gildas-lormeau/zip.js
1) The format has limited and archaic support for file metadata - e.g. file modification times are stored as a MS-DOS timestamp with a 2-second (!) resolution, and there's no standard system for representing other metadata.
2) The single-level central directory can be awkward to work with for archives containing a very large number of members.
3) Support for 64-bit file sizes exists but is a messy hack.
4) Compression operates on each file as a separate stream, reducing its effectiveness for archives containing many small files. The format does support pluggable compression methods, but there's no straightforward way to support "solid" compression.
5) There is technically no way to reliably identify a ZIP file, as the end of central directory record can appear at any location near the end of the file, and the file can contain arbitrary data at its start. Most tools recognize ZIP files by the presence of a local file header at the start ("PK\x01\x02"), but that's not reliable.
I do it by ignoring ZIP's native compression entirely, using store-only ZIP files and then compressing the whole thing at the filesystem level instead.
Here's an example comparison of the same WWW site rip in a DEFLATE ZIP, in a store-only ZIP with zstd filesystem compression, in a tar with same zstd filesystem compression (identical size but less useful for seeking due to lack of trailing directory versus ZIP), and finally the raw size pre-zipping:
This probably wouldn't help GP with their need for HTTP seeking since their HTTP server would incur a decompress+recompress at the filesystem boundary.The last example in my list of four file sizes is them in a folder. Filesystem compression works at the file level, so you have to turn many-almost-identical-files into one file in order to benefit from it. ZFS does have block-level deduplication, but that's it's own can of worms that shouldn't be turned on flippantly due to resource requirements and `recordsize` tuning needed to really benefit from it.
although zfs dedup is probably better in 2025
As far as the last (file type detection) goes, the generally agreed upon standard is that file formats should be "sniffable" by looking for a signature in the file's header - ideally within the first few bytes of the file. Having to search through 64 KB of the file's end for a signature is a major departure from that pattern.
I think the general pattern - using the range header + prior knowledge of a file format to only download the parts of a file that are relevant - is still really underutilized.
One small problem I see is that a server that does not support range requests would just try to send you the entire file in the first request, I think.
So maybe doing a preflight HEAD request first to see if the server sends back Accept-Ranges could be useful.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Guides/Ran...
For static files served by CDNs or an "established" HTTP servers I think support is pretty much a given (though e.g. Python's FastAPI only got support in 2020 [1]), but for anything dynamic, I doubt many devs would go through the trouble and implement support if it wasn't strictly necessary for their usecase.
E.g. the URL may point to a service endpoint that loads the file contents from a database or blob storage instead of the file system. Then the service would have to implement range support itself and translate them to the necessary storage/database calls (if those exist), etc etc. That's some effort you have to put in.
Even for static files, there may be reverse proxies in front that (unintentionally) remove the support again. E.g. [2]
[1] https://github.com/Kludex/starlette/issues/950
[2] https://caddy.community/t/cannot-seek-further-in-videos-usin...
https://blog.nella.org/2016/01/17/seeking-http/
(Originally written for Advent of Go.)
[0] https://github.com/saulpw/unzip-http/
I'll dig up a link.
Tangential, but any Free Software that uses `shared-mime-info` to identify files (any of your GNOMEs, KDEs, etc) are unable to correctly identify Zip files by their EOCD due to lack of accepted syntax for defining search patterns based on negative file offsets. Please show your support on this Issue if you would also like to see this resolved: https://gitlab.freedesktop.org/xdg/shared-mime-info/-/issues... (linking to my own comment, so no this is not brigading)
Anything using `file(1)` does not have this problem: https://github.com/file/file/blob/280e121/magic/Magdir/zip#L...