The Burrows-Wheeler Transform
Posted3 months agoActive3 months ago
sandbox.bioTechstory
supportivepositive
Debate
20/100
AlgorithmsData CompressionBioinformatics
Key topics
Algorithms
Data Compression
Bioinformatics
The Burrows-Wheeler Transform is a fascinating algorithm for data compression and search, with a engaging interactive explanation that sparked a supportive and informative discussion among HN users.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
1h
Peak period
26
0-12h
Avg / period
6.7
Comment distribution40 data points
Loading chart...
Based on 40 loaded comments
Key moments
- 01Story posted
Oct 9, 2025 at 4:00 PM EDT
3 months ago
Step 01 - 02First comment
Oct 9, 2025 at 5:29 PM EDT
1h after posting
Step 02 - 03Peak activity
26 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 14, 2025 at 9:09 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45532352Type: storyLast synced: 11/20/2025, 2:52:47 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Once upon a time I found one that did the entire round trip, I neglected to bookmark it. So when I eventually forgot how it worked, or wanted to show others, I was stuck. Never did find it.
1. Put the BWT string in the right-most empty column
2. Sort the rows of the matrix such that the strings read along the columns of the matrix are in lexicographical order starting from the top-row????
3. Repeat step 1 and 2 until matrix is full
4. Extract the row of the matrix that has the end-delimiter in the final column
It's the "sort matrix" step that seems under-explained to me.
1. We sorted the first column to get the BWT column. Thereby we created the structure where the BWT column comes before the sorted column.
2. Therefore if we insert the BWT column before a sorted column, the order of row elements is preserved
3. If we now sort again, the order of characters across individual rows is again preserved
4. Going to step 2 again preserves row order
5. Once all columns are populated, therefore all rows are in the correct original order. And thanks to the end marker we can get the original string from any row.
The seemingly magical part is that it seems like you "lose" information during the transformation. After all if I give you the sorted string, it's not recoverable. But the key is that the the first and last columns of the BWT matrix end up being the lookup table, from any character to the preceding character. And of course given the BWT encoded string, you can get back the sorted string which is the first column. And with this info of which character precedes which character, it shouldn't be too magical that you can work backwards to reconstruct the original string.
Still, the really clever part is that the BWT encoded string embeds the decoding key directly into the character and positional information. And it just so happens that the way it does this captures character pair dependencies in a way that makes it a good preprocessor for "symbol at a time" encoders.
https://news.ycombinator.com/item?id=35738839 has a nice pithy way of phrasing it
>It sorts letters in a text by their [surrounding] context, so that letters with similar context appear close to each other. In a strange vaguely DFT-like way, it switches long-distance and short-distance patterns around. The result is, in a typical text, long runs of the same letter, which can be easily compressed.
The author later wrote many now-famous articles, including A Trip Through the Graphics Pipeline.
By chance I once sat near him and the rest of Farbrausch at a demoparty, but I was too shy to say hi.
Now I'm wondering: in the era of software 2.0, everything is figured out by AI. What are the chances AI would discover this algorithm at all?
This doesn't follow the original paper very closely, although I think the device of putting an end-of-string marker into the data instead of including the start index as a separate datum is an improvement over the original algorithm. My own "explorable explanation" of the algorithm, at http://canonical.org/~kragen/sw/bwt, is visually uglier, and is very similar to this page in many ways, but it may still be of some interest: it follows the original paper more closely, which is likely to be useful if you're trying to understand the paper. Also, I used an inverse-transform algorithm that was efficient enough that you could imagine actually using in practice, while still being easy to explain (from the original paper)—but, like the page linked, I didn't use an efficient suffix-array-construction algorithm to do the forward transform. I did link to the three that have been discovered.
I forget when I did this but apparently sometime around 02010: https://web.archive.org/web/20100616075622/http://canonical....
I think my JS code in that page is a lot more readable than the minified app.js served up with this page.