My Zip Isn't Your Zip: Identifying and Exploiting Semantic Gaps Between Parsers

Posted5 months agoActive4 months ago

layer8

67 points

33 comments

usenix.orgTechstory

calmpositive

Debate

40/100

SecurityParser DifferentialsZip File FormatVulnerability Research

Key topics

Security

Parser Differentials

Zip File Format

Vulnerability Research

https://www.usenix.org/system/files/usenixsecurity25-you.pdf

A research paper identifies and exploits semantic gaps between different ZIP file parsers, highlighting potential security vulnerabilities, and sparking discussion on the implications and potential mitigations.

Snapshot generated from the HN discussion

Discussion Activity

Active discussion

First comment

Peak period

84-96h

Avg / period

8.3

Comment distribution33 data points

Loading chart...

Based on 33 loaded comments

Key moments

01Story posted
Aug 21, 2025 at 4:58 AM EDT
5 months ago
Step 01
02First comment
Aug 24, 2025 at 6:59 PM EDT
4d after posting
Step 02
03Peak activity
20 comments in 84-96h
Hottest window of the conversation
Step 03
04Latest activity
Aug 26, 2025 at 10:39 AM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (33 comments)

Showing 33 comments

hinkley

4 months ago

3 replies

Maybe an argument to use zlib consistently.

aaviator42

4 months ago

2 replies

An argument for a better defined file format specification perhaps, but I don't think it's necessarily a good thing for everyone to use or have to use the same implementation.

Muromec

4 months ago

2 replies

If everyone has the same parser the whole classes of bugs just stop being exploitable. The classic one being one parser at the edge validates somethhing and the further down the line sees another result which it expects tp be rejected during validation.

Both parsers could be buggy, but when they have different kinds of bugs, you get a zero click undetectable exploit

woodruffw

4 months ago

1 reply

I don’t think it’s this simple: you can still produce observable differentials with a single parser by using different options within that parser in different places. The ZIP format itself affords ample opportunities for that.

hinkley

4 months ago

1 reply

The settings are at encode time. For two readers the results should be unambiguous.

woodruffw

4 months ago

1 reply

There are plenty of decode-time knobs, even within a single ZIP parser. Here are just a few you could set while using libzip[1].

[1]: https://libzip.org/documentation/zip_open.html#DESCRIPTION

hinkley

4 months ago

1 reply

That’s not a lot of settings, and that’s libzib, which is not zlib.

woodruffw

4 months ago

Differentials are oracular; you only need one bit. And I’m not claiming it’s in zlib, since zlib isn’t a ZIP library. TFA here is about ZIP differentials, not differentials in DEFLATE stream parsers.

aaviator42

4 months ago

It significantly increases the attack surfaces of bugs that do exist in the parser if the same implementation is used everywhere.

socalgal2

4 months ago

1 reply

As someone who works on specs that are shared across different organizations' implementations, you can write all the specs you want but no conformance tests = no conformance.

aaviator42

4 months ago

A good point! Conformance tests seem like a great idea to me to go along with specs.

blibble

4 months ago

1 reply

zlib (deflate) is just the compression type usually (not always) used in zips

zip is the container around it

pdw

4 months ago

zlib comes with a basic ZIP implementation (libminizip).

woodruffw

4 months ago

Unless, of course, the differential occurs between versions of zlib. I think the bigger problem here is that ZIP is just not a very well defined format.

actionfromafar

4 months ago

1 reply

Tampering with signed binaries sounds pretty serious

tptacek

4 months ago

1 reply

It depends on how they're signed. A signature format that works on individual objects inside of an archive, rather than on a whole signed archive, seems crazy. In this case, it's a JAR file loader; doesn't seem like that big a deal?

layer8Author

4 months ago

If you want to have the archive contain the signature, you can’t sign the whole archive. Signed documents (docx, odf) work that way.

tptacek

4 months ago

2 replies

This is a really good paper that reaches a bunch of fun conclusions, but to my eyes the practical findings are kind of marginal --- you can defeat an AV scanner, but you could already defeat AV scanners; you can defeat plagiarism-detectors, but you could already defeat plagiarism-detectors; you can package a malicious Java class in a benign-looking JAR, but that attack presumes you're convincing a target to load a JAR file you control.

The one legit-practical attack I see is the one where they trick the VS Code Extension marketplace into serving extensions with trusted publishers, but even there I'm struck by the fact that the security model for verifying extensions would depend on ZIP metadata.

I do not at all mean to talk this work down; this is my favorite species of vulnerability research, and I can see why it did well at Usenix Security.

layer8Author

4 months ago

The attack vector for publishing extensions existed for Firefox (and was fixed): https://bugzilla.mozilla.org/show_bug.cgi?id=1534483

FreakLegion

4 months ago

It's a decent systematic look at something people have been doing ad hoc for a long time. In 2010 or so I realized:

1. Authenticode signatures have unauthenticated sections.

2. ZIP files don't require headers.

So you can shove a ZIP file (i.e. JAR, DOCM, APK, etc.) into a signed Windows executable without breaking its signature, and then depending on the extension it will do any number of things when clicked.

(The extent to which this works has changed a lot in the intervening years, but prior to a patch in 2013 it was especially bad, and the patches never made their way into the spec, so custom Authenticode validators like Wine's or, say, the one in Palo Alto Networks gear, were still vulnerable the last time I checked.)

Anyway, at the same time:

1. Cybersecurity products lean on Authenticode to keep false positives down for specific publishers.

2. Those same products cache everything by hash without regard for file type.

Put all of this together and you could, as of 2020 at least, not only execute whatever you wanted, you could also have it misreported by CrowdStrike or whoever as a signed Windows component.

Fun stuff, but I agree that it's kind of marginal.

captn3m0

4 months ago

1 reply

Also related to ZIP parsing differentials, recently reported and fixed at PyPi: https://blog.pypi.org/posts/2025-08-07-wheel-archive-confusi...

tptacek

4 months ago

1 reply

It's good to see stuff like this getting found and fixed, but let me ask: given how the Python packaging ecosystem works, what is the practical scenario in which this would be exploitable?

cxr

4 months ago

1 reply

woodruffw writes in the corresponding HN thread:

> security scanners are a simple example, but Linux distros, Homebrew, etc. all also process Python package distributions in ways that mostly just assume a ZIP container, without additionally trying to exactly match how Python's `zipfile` behaves

<https://news.ycombinator.com/item?id=44829881>

This doesn't necessarily unlock any new capabilities, but in light of the xz exploit (whereby you have a repo over there that ostensibly corresponds to the package published right here, but with the latter actually comprising a different payload of runnable code), it's not inconceivable that an attacker would take advantage of the behavior between different implementations to level up the obfuscation/misdirection and evade detection for longer.

(FWIW I regarded at the time (and still regard) the hoopla around the PyPI/Astral blog posts a tad overblown, with the purported threat vague at best—especially where the claims about the ambiguity of the ZIP format that are at the crux of the issue are already dubious. On the latter point, it's nice that the authors of the USENIX paper contrast between implementations that use the "standard" method versus otherwise.)

tptacek

4 months ago

I actually talked to 'woodruffw just before writing that comment. :)

pabs3

4 months ago

1 reply

A linter for zip files that can probably detect some of these:

https://github.com/ronomon/pure

cxr

4 months ago

1 reply

1. Describing this as a "linter for zip files" is kind of weird—this library is a full-on ZIP implementation that is meant to be used for the kinds of things people use any sort of ZIP library for.

2. It's one of the libraries that the authors of the paper cited and subjected to testing. It's column/row 31—the one that is the source of the prominent vertical/horizontal bands in Table 4 (on p. 450 aka p. 21)

pabs3

4 months ago

1 reply

1. I think you are thinking of https://github.com/ronomon/zip? The description for pure says it is a static analysis tool for zip files. That makes it a linter in my book.

2. I see, thanks.

cxr

4 months ago

Yes, I was wrong.

(HN obscures the end of the URL; I assumed it was Ronomon's ZIP library. The 2 in my comment also applies to that library.)

o11c

4 months ago

Key line from the abstract, since zip parser differences in general are old news:

> We summarize our findings as 14 distinct parsing ambiguity types in three categories with detailed analysis, systematizing current knowledge and uncovering 10 types of new parsing ambiguities.

saurik

4 months ago

I'm cited on the first page of this paper (reference 20) for my work on the Android Master Key vulnerability (which I didn't find, to be clear, but I did most of the exploitation people saw), and, while this paper looks AWESOME (and I'm very excited to read it in detail), if you are interested in this concept but feel you need something a bit more concrete--maybe with diagrams and some hand-holding--to understand what is going on, I will recommend my series of articles on Master Key as an introduction.

https://www.saurik.com/masterkey1.html

https://www.saurik.com/masterkey2.html

https://www.saurik.com/masterkey3.html

pixl97

4 months ago

Zip is a fun minefield across different OS's, libraries, and ages of system. Zip64 is a fun one I've seen companies forget to test and end up with data loss with over 65535 files in a zip when interacting with more modern systems. There are really so many things you need to test that going with some other compression without the pitfalls is your best choice if possible.

est

4 months ago

IIRC similar attacks exist on DEFLATE

there used to be a .png picture displays totally different content on safari/firefox/IE.

schoen

4 months ago

This is great. It feels like a central example of the phenomenon of parser differentials (and nice use of tools to find them more efficiently).

Also, as the lead author's name is spelled the same as an English pronoun, we can anticipate natural language parsing ambiguities from writing about this research in English prose! For example, "You discovered that there are many opportunities for parser differentials due to the underspecified nature of the ZIP format" or "You described a practical method of bypassing plagiarism detectors and several other kinds of file content scanners".

Actually, I'm tempted to propose that for the April Fool's Did You Know? on Wikipedia next year. "Did you know ... that You won a Usenix Security award for finding ways to construct ambiguous texts?"

View full discussion on Hacker News

ID: 44970583Type: storyLast synced: 11/20/2025, 2:43:43 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN