The Two Versions of Parquet
Key topics
The article discusses the two versions of the Parquet file format, highlighting the challenges of adopting Version 2 due to limited compatibility in major engines and tools, sparking a discussion on the trade-offs between compatibility and performance.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
3d
Peak period
27
84-96h
Avg / period
8.3
Based on 50 loaded comments
Key moments
- 01Story posted
Aug 21, 2025 at 5:34 AM EDT
4 months ago
Step 01 - 02First comment
Aug 24, 2025 at 4:31 PM EDT
3d after posting
Step 02 - 03Peak activity
27 comments in 84-96h
Hottest window of the conversation
Step 03 - 04Latest activity
Aug 26, 2025 at 9:33 PM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
The author was able to rollback his changes, but in some industries an unplanned enterprise-wide data unavailability event means the end of your career at that firm, if you don’t have a CYA email from the vendor confirming you were good to go. That CYA email, and the throat to choke, is why Oracle does 7 and 8 figure licensing deals with enterprises selling inferior software solutions versus open source options.
It seems that Linux, through Linus’ leadership, has been able to solve this risk issue and fully displace commercial UNIX operating systems. I hope many other projects up and down the stack can have the same success.
Oracle is not imune to software issues. In fact, this year I lost two weekends because of a buggy upgrade on the cloud that left my production cluster in a failed state.
It’s a mammoth task for them to migrate
billed
Most times I prefer to wade through the knowledge base until I find a solution.
When the author is talking about rolling back his changes, it's not referring to a database, but a version of his library. If someone tried used his new version, I assume the only thing that would have gone wrong is that their code wouldn't work because Pandas didn't support the format.
This article is about how a new version of the Parquet format hasn't been widely adopted, and so now the Parquer community is in a split state where different forces are pulling the direction of the format in two directions, and this happens to be caused by two different areas of focus that don't need to be tightly coupled together.
I don't see how the problems the article discusses relate to the reliability of software.
If a (major) software update cause you an outage, you shouldn’t blame the software, but insufficient testing and validation. Large companies (I worked for many) are slow to adopt new technologies precisely because they are extremely cautious and want to make sure everything was properly tested before they roll it out. That’s also why they still use Oracle and SQL Server (and HP-UX, and IBMi) - these products are working and have been working for generations of employees. The grass needs to be significantly greener for them to consider the move to the other side of their fence.
#callMeOptimist
Never seen any v1 in the wild.
- https://voltrondata.com/codex/a-new-frontier (links out to others) - https://wesmckinney.com/blog/looking-back-15-years/
in short you can think of a DB as at least 3 decoupled subsystems: UI, compute (query engine), storage. DuckDB has a query engine and storage format, and several UIs (SQL, Python, etc.). Trino is only a query engine (and UIs, everything has UIs). Polars has a query engine. DataFusion is a query engine (and other things). Spark is a query engine. pandas has a query engine
typically query engines are tightly coupled with the overall “product”, but increasingly compute, data (and even more recently via DuckLake metadata), and UI are decoupled allowing you to mix and match parts for a “decomposed database” architecture
quick disclaimer: I worked at Voltron Data but it’s a dead company walking, not trying to advertise for them by any means but the doc I linked is very well written with good information IMO
Take the RLE encoding which switches between run-length encoding and bit-packing. The way bit-packing has been implemented is to generate 74,000 lines of Java to read/write every combination of bitwidth, endianness and value-length.
I just can't believe this is optimal except in maybe very specific CPU-only cases (e.g. Parquet-Java running on a giant cluster somewhere).
If it were just bit-packing I could easily offload a whole data page to a GPU and not care about having per-bitwidth optimised implementations, but having to switch encodings at random intervals just makes this a headache.
It would be really nice if actual design documents exist that specify why this is a good idea based on real-world data patterns.
I haven't though much about it, but I believe the ideal reference implementation would be a highly optimized "service like" process that you run alongside your engine using arrow to share zero copy buffers between the engine and the parquet service. Parquet predates arrow by quite a few years and java was (unfortunately) the standard for big data stuff back then, so they simply stuck with it.
> The way bit-packing has been implemented is to generate 74,000 lines of Java to read/write every combination of bitwidth, endianness and value-length
I think they did this to avoid the dynamic dispatch nature of java. If using C++ or Rust something very similar would happen, but at the compiler level which is a much saner way of doing this kind of thing.
I've just had so many issues with total lack of clarity with this format. They tell you a total_compressed_size for a page then it turns out the _uncompressed_ page header is included in this - but the documentation barely give any clues to the layout [1].
The reality:
Each column chunk includes a list of pages written back-to-back, with an optional dictionary page first. Each of these, including the dictionary are prepended with an uncompressed PageHeader in Thrift format.
It wasn't too hard to write a paragraph about it. It was quite hard looking for magic compression bytes in hex dumps.
Maybe there should be a "minimum workable reference implementation" or something that is slow but easy to understand.
[1] https://parquet.apache.org/docs/file-format/data-pages/colum...
Jesus Christ this isn't 2005 anymore and people need to learn to use the real power of the JVM. It's stuff like this that sets it apart
First paragraph under that heading as a markdown error
"Simple Binary Encoding" v2 has been stuck at release candidate stage for over 5 years and has flaws that mean it'll probably never be worth adopting.
(slams gavel)
Parquet Court.
Parquet is amazinggggggg
Yes, it is a critique (or at least its user community). It's a critique that's 100% justified too.
Have we all been so conditioned by corporate training that we've lost the ability to say "hey, this sucks" when it _does_ in fact suck?
We all lose when people communicate unclearly. Here, the people holding back evolution of the format do need to be critiqued, and named, and shamed, and the author shouldn't have been so shy about doing it.
It is a command line wrapper to generate a Pandas SF and save it as CSV (or the other way around)