My Zip Isn't Your Zip: Identifying and Exploiting Semantic Gaps Between Parsers
Posted5 months agoActive4 months ago
usenix.orgTechstory
calmpositive
Debate
40/100
SecurityParser DifferentialsZip File FormatVulnerability Research
Key topics
Security
Parser Differentials
Zip File Format
Vulnerability Research
A research paper identifies and exploits semantic gaps between different ZIP file parsers, highlighting potential security vulnerabilities, and sparking discussion on the implications and potential mitigations.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
4d
Peak period
20
84-96h
Avg / period
8.3
Comment distribution33 data points
Loading chart...
Based on 33 loaded comments
Key moments
- 01Story posted
Aug 21, 2025 at 4:58 AM EDT
5 months ago
Step 01 - 02First comment
Aug 24, 2025 at 6:59 PM EDT
4d after posting
Step 02 - 03Peak activity
20 comments in 84-96h
Hottest window of the conversation
Step 03 - 04Latest activity
Aug 26, 2025 at 10:39 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 44970583Type: storyLast synced: 11/20/2025, 2:43:43 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Both parsers could be buggy, but when they have different kinds of bugs, you get a zero click undetectable exploit
[1]: https://libzip.org/documentation/zip_open.html#DESCRIPTION
zip is the container around it
The one legit-practical attack I see is the one where they trick the VS Code Extension marketplace into serving extensions with trusted publishers, but even there I'm struck by the fact that the security model for verifying extensions would depend on ZIP metadata.
I do not at all mean to talk this work down; this is my favorite species of vulnerability research, and I can see why it did well at Usenix Security.
1. Authenticode signatures have unauthenticated sections.
2. ZIP files don't require headers.
So you can shove a ZIP file (i.e. JAR, DOCM, APK, etc.) into a signed Windows executable without breaking its signature, and then depending on the extension it will do any number of things when clicked.
(The extent to which this works has changed a lot in the intervening years, but prior to a patch in 2013 it was especially bad, and the patches never made their way into the spec, so custom Authenticode validators like Wine's or, say, the one in Palo Alto Networks gear, were still vulnerable the last time I checked.)
Anyway, at the same time:
1. Cybersecurity products lean on Authenticode to keep false positives down for specific publishers.
2. Those same products cache everything by hash without regard for file type.
Put all of this together and you could, as of 2020 at least, not only execute whatever you wanted, you could also have it misreported by CrowdStrike or whoever as a signed Windows component.
Fun stuff, but I agree that it's kind of marginal.
> security scanners are a simple example, but Linux distros, Homebrew, etc. all also process Python package distributions in ways that mostly just assume a ZIP container, without additionally trying to exactly match how Python's `zipfile` behaves
<https://news.ycombinator.com/item?id=44829881>
This doesn't necessarily unlock any new capabilities, but in light of the xz exploit (whereby you have a repo over there that ostensibly corresponds to the package published right here, but with the latter actually comprising a different payload of runnable code), it's not inconceivable that an attacker would take advantage of the behavior between different implementations to level up the obfuscation/misdirection and evade detection for longer.
(FWIW I regarded at the time (and still regard) the hoopla around the PyPI/Astral blog posts a tad overblown, with the purported threat vague at best—especially where the claims about the ambiguity of the ZIP format that are at the crux of the issue are already dubious. On the latter point, it's nice that the authors of the USENIX paper contrast between implementations that use the "standard" method versus otherwise.)
https://github.com/ronomon/pure
2. It's one of the libraries that the authors of the paper cited and subjected to testing. It's column/row 31—the one that is the source of the prominent vertical/horizontal bands in Table 4 (on p. 450 aka p. 21)
2. I see, thanks.
(HN obscures the end of the URL; I assumed it was Ronomon's ZIP library. The 2 in my comment also applies to that library.)
> We summarize our findings as 14 distinct parsing ambiguity types in three categories with detailed analysis, systematizing current knowledge and uncovering 10 types of new parsing ambiguities.
https://www.saurik.com/masterkey1.html
https://www.saurik.com/masterkey2.html
https://www.saurik.com/masterkey3.html
there used to be a .png picture displays totally different content on safari/firefox/IE.
Also, as the lead author's name is spelled the same as an English pronoun, we can anticipate natural language parsing ambiguities from writing about this research in English prose! For example, "You discovered that there are many opportunities for parser differentials due to the underspecified nature of the ZIP format" or "You described a practical method of bypassing plagiarism detectors and several other kinds of file content scanners".
Actually, I'm tempted to propose that for the April Fool's Did You Know? on Wikipedia next year. "Did you know ... that You won a Usenix Security award for finding ways to construct ambiguous texts?"