Understanding Deflate

Posted4 months agoActive4 months ago

ingve

83 points

7 comments

jjrscott.comTechstory

calmpositive

Debate

20/100

Compression AlgorithmsDeflateHuffman Coding

Key topics

Compression Algorithms

Deflate

Huffman Coding

The article 'Understanding Deflate' breaks down the DEFLATE compression algorithm, with commenters providing additional insights and context on its implementation and related techniques.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

Peak period

78-84h

Avg / period

2.3

Key moments

01Story posted
Sep 14, 2025 at 3:12 PM EDT
4 months ago
Step 01
02First comment
Sep 17, 2025 at 9:07 PM EDT
3d after posting
Step 02
03Peak activity
5 comments in 78-84h
Hottest window of the conversation
Step 03
04Latest activity
Sep 18, 2025 at 4:38 AM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (7 comments)

Showing 7 comments

Dylan16807

4 months ago

2 replies

> That said, even in this simple case, decoding by hand was a pain.

Well doing it by hand at this level is mostly decoding quasi-ASCII by hand, and the compression isn't making it significantly more painful.

If you reorder the static huffman table and byte align things, the encoded data could look more like:

  T O B E O R N O T T 0x85 0x09 T 0x88 0x0F 0xFF

So short lengths become 0x80 + length and short distances encode as themselves.

This version is pretty easy to decode by hand, and still 16 bytes. I'm arguably cheating by removing the 3 bit header, but even 17 bytes would be good.

Though the encoder deciding to repeat those Ts doesn't make much sense, shouldn't this be compressed to literal("TOBEORNOT") copy(6,9) copy(9,15)?

ack_complete

4 months ago

1 reply

Many Deflate encoders, including zlib/gzip, are greedy encoders that work forwards, either for speed or to support streaming compression. They encode runs as they are found scanning forward, with limited lookahead to try to allow longer matches to preempt immediately preceding shorter matches. There is an "optimal parse" strategy that maximizes the runs found by processing the entire file backwards.

If you repack the plaintext using zopfli, it does encode the text as you suggest.

Dylan16807

4 months ago

1 reply

Being greedy is fine here, though. It's the exact same match but somehow cut one byte short.

Also I just tested something. If you stick a random extra letter onto the start of the string, the mistake goes away and the output shrinks by a byte. Is it possibly an issue with finding matches that start at byte 0...?

More testing: Removing the Ts that fail to match to get TOBEORNOTOBEOROBEORNOT shrinks the output by 2 bytes. Changing the first character to ZOBEORNOTOBEOROBEORNOT stays at the same shrunken size. Then removing that Z to get OBEORNOTOBEOROBEORNOT makes it balloon back up as now it fails to match the O twice.

If I take the original string and start prepending unique letters, the first one shrinks and fixes the matching, and then each subsequent letter adds one byte. So it's not that matching needs some particular alignment, it's only failing to match the very first byte.

A new test: XPPPPPPPP appears to encode as X, P, then a copy command. And PPPPPPPPP encodes as P, P, then a copy. Super wrong.

ack_complete

4 months ago

Interesting, I can also reproduce this. I wonder if it's an artifact of zlib's sliding window setup. The odd part is that if I try various libs with advzip, both the libdeflate and 7z modes show the same artifact, only the zopfli mode is able to avoid it. Doesn't seem to be a format violation as gzip -t doesn't complain about a copy at position 0.

zmodem

4 months ago

> Though the encoder deciding to repeat those Ts doesn't make much sense, shouldn't this be compressed to literal("TOBEORNOT") copy(6,9) copy(9,15)?

IIRC zlib won't consider matches at offset 0: https://github.com/madler/zlib/blob/5a82f71ed1dfc0bec044d970...

lifthrasiir

4 months ago

Not illustrated here, but the vast majority of deflated data uses a dynamic Huffman code which compresses the Huffman tree itself with another (fixed) Huffman tree. This sort of nested compression is in fact moderately popular; JPEG XL for example uses the same technique.

userbinator

4 months ago

but I think the bits must be the opposite "endian" here, ie 2 instead of one

I remember asking the zlib developers many years ago why DEFLATE's bit order was unusual in that it made table-based decoding slightly more complex than necessary, and was told that only Phil Katz knew.

Also of note is that LHA/LZH uses an extremely similar data format where Huffman is also used to encode both literals and lengths in one "alphabet": http://justsolve.archiveteam.org/wiki/LHA Given that LHA very slightly predated ZIP, I suspect the latter was developed based on the former's algorithms.

View full discussion on Hacker News

ID: 45242434Type: storyLast synced: 11/20/2025, 3:32:02 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN