Removing Newlines in Fasta File Increases Zstd Compression Ratio by 10x
Posted4 months agoActive4 months ago
log.bede.imTechstoryHigh profile
calmmixed
Debate
60/100
Data CompressionBioinformaticsFile Formats
Key topics
Data Compression
Bioinformatics
File Formats
The article discusses how removing newlines from FASTA files significantly improves ZSTD compression ratios, sparking a discussion on data compression techniques, file formats, and bioinformatics.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
42m
Peak period
63
60-72h
Avg / period
12.6
Comment distribution113 data points
Loading chart...
Based on 113 loaded comments
Key moments
- 01Story posted
Sep 12, 2025 at 12:25 PM EDT
4 months ago
Step 01 - 02First comment
Sep 12, 2025 at 1:07 PM EDT
42m after posting
Step 02 - 03Peak activity
63 comments in 60-72h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 18, 2025 at 6:16 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45223827Type: storyLast synced: 11/20/2025, 6:39:46 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Anyway, BWT-based compressors like Bzip2 do a good job on "repetition, but with random differences". Better than LZ-based compressors. However, they are not competitive on speed, and it's gotten relatively worse as computers got faster since the Burrows-Wheeler transform can't be parallelized very well and is inherently cache-unfriendly.
I really don't see how better supporting this "second order repetition" feature in the encoding would cause such a big problem. LZ variants already track repeating strings.
Depending on what you need to represent, you can get a 4x reduction in data size without compression at all, by just representing a GATC with 2 bits, rather than 8.
Compression on top of that "should" result in the same compressed size as the original text (after all, the "information" being compressed is the same), except that compression isn't perfect.
Newlines are an example of something that's "information" in the text format that isn't relevant, yet the compression scheme didn't know that.
[1] Many of the context models in a typical PPM compressor will be byte-by-byte, so even that isn't fully clear-cut.
> Ultimately, Zstd is a byte-oriented compressor that doesn't understand the semantics of the data it compresses
Have any of you tested it?
Or another way to look at this is that the total bytes saved of the dictionary will plateau. So your dictionary may save 50% of the first MB, 10% of the next MB and 5% of the rest of the first 10MB. It matters a lot if you are compressing 2MB of data (7% savings!) but not so much if you are compressing 1GB (<1%).
PS: what the author implicitly suggests cannot be replaced with zstd tweaks. It'll be interesting to look at the file in imhex- especially if I can find an existing Pattern File.
You always need resource limits when dealing with untrusted data. RAM is one of the obvious ones. They could introduce a memory limit parameter; require passing --long with a value equal to or greater than what the stream requires to successfully decompress; require seeking support for the input stream so they can look back that way (TMTO); fall back to using temp files; or interactively prompt the user if there's a terminal attached. Lots of options, each with pros and cons of course, that would all allow a scenario where the required information for the decoder is stored in the compressed data file
Seeking back in the input might theoretically work, but I feel like that could easily get very bad (aka exponential runtime); never mind needing actual seeking.
[1] https://datatracker.ietf.org/doc/html/rfc8878#name-window-de...
Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)
I’ve worked with large genomic datasets on my own dime, and the default formats show their limits quickly. With FASTA, the first step for me is usually conversion: unzip headers from sequences, store them in Arrow-like tapes for CPU/GPU processing, and persist as Parquet when needed. It’s straightforward, but surprisingly underused in bioinformatics — most pipelines stick to plain text even when modern data tooling would make things much easier :(
It feels like Benzene in some ways. Use it correctly and gdamn. Just don’t huff it - i mean - use it for your enterprise backend - and it’s worth it.
> Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)
I even confused myself about this while writing :-)
For me it was an increasing number (think of unix timestamps in a data logger that stores one entry per second, so you are just counting up until there's a gap in your data), in the article it's a fixed value every 60 bytes
Of course, our brains are exceedingly good at finding patterns (to the point where we often find phantom ones). I was just expecting some basic checks like "does it make sense to store the difference instead of the absolute value for some of these bytes here". Seeing as the difference is 0 between every 60th byte in the submitted article, that should fix both our issues
Bzip2 performed much better for me but it's also incredibly slow. If it were only the compressor, that might be fine for many applications, but also decompressing is an exercise in patience so I've moved to Zstandard at the standard thing to use
But Bzip2 is also a pretty bad BWT-based compressor. Not only does it use block sizes from a time when 8mb memory was a lot, it does silly things which doesn't help compression at all.
> "intentionally or not, bioinformatics found a way to survive: obfuscation. By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques"
1. https://madhadron.com/science/farewell_to_bioinformatics.htm...
- The original index format could not handle large chromosomes, so now there are two index formats: .bai and .csi
- For BAM, the CIGAR (alignment description) operation count is limited to 16 bits, which means that very long alignments cannot be represented. One workaround I've seen (but thankfully not used) is saving the CIGAR as a string in a tag
- SAM cannot unambiguously represent sequences with only a single base (e.g. after trimming), since a '*' in the quality column can be interpreted either as a single Phred score (9) or as a special value meaning "no qualities". BAM can represent such sequences unambiguously, but most tools output SAM
A parser to stream FASTA can be written in like 30 lines [0], much easier than say CSV where the edge cases can get hairy.
If you need something like fast random reads, use the FAIDX format [1], or even better just store it in an LMDB or SQLite embedded db.
People forget FASTA was from 1985, and it sticks around because (1) it's easy to parse and write (2) we have mountains of sequences in that format going back 4 decades.
[O] https://gist.github.com/jszym/9860a2671dabb45424f2673a49e4b5...
[1] https://seqan.readthedocs.io/en/main/Tutorial/InputOutput/In...
A binary format with a tool that renders it to text works the same as a text format; if the rendering is lossless, you could even consume the text format rather than the binary.
A "text" format is built to be understandable, but that's not a requirement; you could write a text format that isn't descriptive, and you'd have just as much trouble understanding what 'A' means as you would understanding what 'C0' means for a binary format.
Undocumented formats are a pain, whether they're in text or binary.
It's a lack of tooling problem. Because if you're a bioinformatics researcher, you want to devote your time, money and energy towards bioinformatics. You don't want to spend weeks getting tooling written to handle an arcane file format, nor pay for that tooling, nor hire a "brilliant" programmer. That tooling needs to be written, packaged and maintained for perhaps dozens of programming languages.
Instead, you want to use the format that can be read and written by a rank novice with a single programming course under their belt, because that's what makes the field approachable by dewey-eyed undergrads eager to get their feet wet. Giving those folks an easy on-ramp is how you grow the field.
And then you want to compress that format with bog-standard compression algorithms, and you might get side-tracked investigating how to improve that process without exploding your bioinformatics-focused codebase. Which is an interesting show of curiosity, not a reason to insult a class of scientists.
There's also a distribution problem. When people break history, by introducing new file formats, or updating old file formats, that impedes archival and replication. And once you've got a handful of file formats, now you've got the classic n+1 problem where no single format is optimal in all ways so people are always inventing new formats to see what sticks. And now an archivist needs to maintain tooling with an ever-increasing overhead. Here we see a clash of wisdom versus intellect, and if you're trying to foster a healthy field of research, wisdom wins the long game.
This is good, but you're implicitly saying there's a tradeoff made to achieve this, because otherwise we wouldn't be talking about alternatives. And an on-ramp for novices is good, but I don't see a step where the standards of the field move on from what novices can handle.
When FASTA was invented, Sanger sequencing reads would be around a thousand bases in length. Even back then, disk space wasn't so precious that you couldn't spend several kilobytes on the results of your experiment. Plus, being able to view your results with `more` is a useful feature when you're working with data of that size.
And, despite its simplicity, it has worked for forty years.
The simplicity of FASTA seems like a dream compared to the GenBank flat file format used before then. And around the year 2000, less computationally-inclined scientists were storing sequence in Microsoft Word binary .doc files.
A lot of file formats (including bioinformatics formats!) have come and gone in that time period. I don't think many would design it this way today, but it has a lot of nice features like the ones you point out that led to its longevity.
On those grounds, the lack of pre-tokenization in html/css/js ranks at this point as a planet killing level of poor choices.
This is not really a fair statement. Literally all of software bears the weight of some early poor choice that then keeps moving forward via weight of momentum. FASTA and FASTQ formats are exceptionally dumb though.
It makes a lot of sense that this would work: bacteria have many subsequences in common, but if you insert non-semantic newlines at effectively random offsets then compression tools will not be able to use the repetition effectively.
Guessing which PNG filters to use can make a huge difference to compression with only a tiny change to write speed. Or (like Adobe 20+ years ago) you can screw it up and get worse compression and slower speeds. These days brutal "try everything" modes exist which can squeeze out those last few bytes by trying even the unlikeliest combinations.
I can imagine a filter layer which says this textual data comes in 78 character blocks punctuated with \n so we're going to strip those out, then compress and in the opposite direction we decompress then put back the newlines.
For FASTA we can just unconditionally choose to remove the extra newlines but that may not be true for most inputs, so the filters would help there.
I'm still pretty amazed that periodic newlines hurt compression ratios so much, given the compressor can use both a huffman coding and a lookback dictionary.
The best rule in sequence data storage is to store as little of it as possible.
first author on a paper like this indicates the most significant contributions https://academic.oup.com/bioinformatics/article/25/14/1754/2...
Or, for an even simpler example:
becomes, on disk, something like which is hard to compress, while is just and then, if you want, you can reflow the text when it's time to render to the screen.Surprisingly, it gave an answer along the lines of the parent comment.
However, it seems it didn't figure this out by itself but it says:
> It’s widely known in bioinformatics that “one-line Fasta” files compress much better with LZ-based algorithms, and this is discussed in forums, papers, and practical guides on storing genomic data.
e.g. https://github.com/ArcInstitute/binseq
(I currently store a lot of data as FASTQ, and smaller file sizes could save us a bunch of money. But FASTQ + zstd is very good.)
At the loosest end a format can leave lots of space for new symbols, and you can just use those to represent something new. But then not everyone agrees on what the new symbol means, and worse multiple groups can use symbols to mean different things.
On the other end of the spectrum, you can be strict about the format, and not leave space for new symbols. Then to represent new things you need a new standard, and people to agree on it.
It's mostly a question of how well code can be updated and agreed upon, how strict you can require your tooling to be w.r.t. formats.
[1] See 'release.v16' in the fasta2 release at https://fasta.bioch.virginia.edu/wrpearson/fasta/fasta_versi...
[2] https://iupac.qmul.ac.uk/misc/naabb.html
CRAM compresses unmapped fastq pretty well, and can do even better with reference-based compression. If your institution is okay with it, you can see additional savings by quantizing quality scores (modern Illumina sequencers already do this for you). If you're aligning your data anyways, probably retaining just the compressed CRAM file with unmapped reads included is your best bet.
There are also other fasta/fastq specific tools like fqzcomp or MZPAQ. Last I checked, both of these could about halve the size of our fastq.gz files.
When I was shown BioPerl I was tempted to write a better, C++ version, but was overwhelmed by other university stuff and let it go.
Ultimately, Zstd is a byte-oriented compressor that doesn't understand the semantics of the data it compresses. Improvements are certainly possible if you can recognize and separate that framing to recover a contiguous view of the underlying data.
[0] https://github.com/facebook/zstd/blob/v1.5.7/lib/compress/zs...
(I am one of the maintainers of Zstd.)
I absolutely adore ZSTD, it has worked so well for me compressing json metadata for a knowledge engine.
The first stage of Zstd does LZ77 matching, which transforms the input into "sequences", a series of instructions each of which describes some literals and one match. The literals component of the instruction says "the next L bytes of the message are these L bytes". The match component says "the next M bytes of the input are the M bytes N bytes ago".
If you want to construct a match between two strings that differ by one character, rather than saying "the next N bytes are the N bytes M bytes ago except for this one byte here which is X instead", Zstd just breaks it up into two sequences, the first part of the match, and then a single literal byte describing the changed byte, and then the rest of the match, which is described as being at offset 0. The encoding rules for Zstd define offset 0 to mean "the previously used match offset". This isn't as powerful as a Levenshtein edit, but it's a reasonable approximation.
The big advantage of this approach is that it doesn't require much additional machinery on the encoder or decoder, and thus remains very fast. Whereas implementing a whole edit description state machine would (I think) slow down decompression and especially compression enormously.
[0] https://datatracker.ietf.org/doc/html/rfc8878#name-repeat-of...
What lessons can we take from this?
Endogenous retroviruses [1] are interesting bits of genetics that helps link together related species. A virus will inject a bit of it's genetics into the host which can effectively permanently scar the host's DNA and all their offspring's DNA.
[1] https://en.wikipedia.org/wiki/Endogenous_retrovirus
I thought I'd raise yesterday's HN discussion on 'The unreasonable effectiveness of modern sort algorithms' https://news.ycombinator.com/item?id=45208828
That blog post isn't about DNA per se, but it is about sorting data when you know there are only 4 numbers. I guess DNA has 5 - A,T,G,C,N the unknown base - but there's a huge space of DNA-specific compression research that outperforms ZSTD.
If the linefeeds were treated as semantic characters and not allowed to break the hash size, you would get similar results without pre-filtering and post-filtering. It occurs to me that this strategy is so obvious that there must be some reason it won't work.