F3: Open-Source Data File Format for the Future [pdf]
Posted3 months agoActive3 months ago
db.cs.cmu.eduTechstoryHigh profile
calmmixed
Debate
80/100
Data StorageFile FormatsWebassembly
Key topics
Data Storage
File Formats
Webassembly
The F3 file format embeds WebAssembly modules for decoding data, sparking discussion on its potential, security, and performance implications.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
11h
Peak period
48
12-18h
Avg / period
12.5
Comment distribution125 data points
Loading chart...
Based on 125 loaded comments
Key moments
- 01Story posted
Oct 1, 2025 at 9:52 AM EDT
3 months ago
Step 01 - 02First comment
Oct 1, 2025 at 9:08 PM EDT
11h after posting
Step 02 - 03Peak activity
48 comments in 12-18h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 5, 2025 at 4:34 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45437759Type: storyLast synced: 11/20/2025, 6:39:46 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
So, prior art! :)
[1] https://www.cs.tufts.edu/comp/150FP/archive/alan-kay/smallta... (page 4)
The proliferation of opensource file formats (i.e., Parquet, ORC) allows seamless data sharing across disparate platforms. However, these formats were created over a decade ago for hardware and workload environments that are much different from today
Each self-describing F3 file includes both the data and meta-data, as well as WebAssembly (Wasm) binaries to decode the data. Embedding the decoders in each file requires minimal storage (kilobytes) and ensures compatibility on any platform in case native decoders are unavailable.
And if you're really working on an obscure platform, implementing a decoder for a file format is probably easier than implementing a full-blown wasm runtime for that platform.
As I see it, the point is that the exact details of how the bits are encoded is not really interesting from the perspective of the program reading the data.
Consider a program that reads CSV files and processes the data in them. First column contains a timestamp, second column contains a filename, third column contains a size.
As long as there's a well-defined interface that the program can use to extract rows from a file, where each row contains one or more columns of data values and those data values have the correct data type, then the program doesn't really care about this coming from a CSV file. It could just as easily be a 7zip-compressed JSON file, or something else entirely.
Now, granted, this file format isn't well-suited as a generic file format. After all, the decoding API they specify is returning data as Apache Arrow arrays. Probably not well-suited for all uses.
How many different storage format implementations will there realistically be?
Apparently an infinite number, if we go with the approach in the paper /s
But I'm not totally clear what the relationship between F3 and Vortex is. It says their prototype uses the encoding implementation in Vortex, but does not use the Vortex type system?
→ Meta's Nimble: https://github.com/facebookincubator/nimble
→ CWI's FastLanes: https://github.com/cwida/FastLanes
→ SpiralDB's Vortex: https://vortex.dev
→ CMU + Tsinghua F3: https://github.com/future-file-format/f3
On the research side, we (CMU + Tsinghua) weren't interested in developing new encoders and instead wanted to focus on the WASM embedding part. The original idea came as a suggestion from Hannes@DuckDB to Wes McKinney (a co-author with us). We just used Vortex's implementations since they were in Rust and with some tweaks we could get most of them to compile to WASM. Vortex is orthogonal to the F3 project and has the engineering energy necessary to support it. F3 is an academic prototype right now.
I note that the Germans also released their own fileformat this year that also uses WASM. But they WASM-ify the entire file and not individual column groups:
→ Germans: https://github.com/AnyBlox
Will one of the new formats absorb the others' features? Will there be a format war a la iceberg vs delta lake vs hudi? Will there be a new consortium now that everyone's formats are out in the wild?
I would love to bring these benefits to the multidimensional array world, via integration with the Zarr/Icechunk formats somehow (which I work on). But this fragmentation of formats makes it very hard to know where to start.
Also, back on topic - is your file format encryptable via that WASM embedding?
> We first discuss the implementation considerations of the input to the Wasm-side Init() API call. The isolated linear memory space of Wasm instance is referred to as guest, while the program’s address space running the Wasm instance is referred to as host. The input to a Wasm instance consists of the contiguous bytes of an EncUnit copied from the host’s memory into the guest’s memory, plus any additional runtime options.
> Although research has shown the importance of minimizing the number of memory copies in analytical workloads, we consider the memory copy while passing input to Wasm decoders hard to avoid for several reasons. First, the sandboxed linear memory restricts the guest to accessing only its own memory. Prior work has modified Wasm runtimes to allow access to host memory for reduced copying, but such changes compromise Wasm’s security guarantees
> Security concerns. Despite the sandbox design of Wasm, it still has vulnerabilities, especially with the discovery of new attack techniques. [...] We believe there are opportunities for future work to improve Wasm security. One approach is for creators of Wasm decoding kernels to register their Wasm modules in a central repository to get the Wasm modules verified and tamper-resistant.
I see why you're doing it, but it also opens up a whole avenue of new types of bugs. Now the dataset itself could have consistency issues if, say, the unpacker produces different data based on the order it's called, and there will be a concept of bugfixes to the unpackers - how will we roll those out?
Fascinating.
https://cds.cern.ch/record/2296399/files/zebra.pdf
Is there any reason to believe that a major new browser tech, WASM, will ever have support dropped for its early versions?
Even if the old versions are not as loved (i.e.: engine optimized for it and immediately ready to be executed) as the old versions, emulation methods work wonders and could easily be downloaded on-demand by browsers needing to run "old WASM".
I'm quite optimistic for the forwards-compatibility proposed here.
<blink> is gone. In C, gets is gone. It may take only one similar minor change to make huge amounts of data unreadable by newer WASM versions.
However I don't think it matters too much. Here WASM is not targeting the browser, and there are many more runtimes for WASM, in many diverse languages, and they outnumber browsers significantly. It won't die easily.
Depends how obscure. You can play Z-machine games on pretty much anything now.
I get the "step change/foundational building block" vibe from the paper - and the project name itself implies more than a little "je ne sais quoi" - but unfortunately I only understand a few sentences per page. The pictures are pretty though, and the colour choices are tasteful yet bold. Two thumbs up from the easily swayed.
No, this is a compatibility layer for future encoding changes.
For example, ORCv2 has never shipped because we tried to bundle all the new features into a new format version, ship all the writers with the features disabled, then ship all the readers with support and then finally flip the writers to write the new format.
Specifically, there was a new flipped bit version of float encoding which sent the exponent, mantissa and sign as integers for maximum compression - this would've been so much easier to ship if I could ship a wasm shim with the new file and skip the year+ wait for all readers to support it.
We'd have made progress with the format, but we'd also be able to deprecate a reader impl in code without losing compatibility if the older files carried their own information.
Today, something like Spark's variant type would benefit from this - the sub-columnarization that does would be so much easier to ship as bytecode instead of as an interpreter that contains support for all possible recombinations from split up columns.
PS: having spent a lot of nights tweaking tpc-h with ORC and fixing OOMs in the writer, it warms my heart to see it sort of hold up those bits in the benchmark
Sometimes the best option is to do the hard political work and improve the standard and get everyone moving with it. People have pushed parquet and arrow. Which they are absolutely great technologies that I use regularly but 8 years after someone asked how to write parquet in java, the best answer is to use duckdb: https://stackoverflow.com/questions/47355038/how-to-generate...
Not having a good parquet writer for java shows a poor attempt at pushing forward a standard. Similarly arrow has problems in java land. If they can't be bothered to consider how to actually implement and roll out standards to a top 5 language, I'm not sure I want them throwing WASM into the mix will fix it.
I wouldn’t say “very quick”. They only need to read and look at the data for columns a and b, whereas, with a row-oriented approach, with storage being block-based, you will read additional data, often the entire dataset.
That’s faster, but for large datasets you need an index to make things “very quick”. This format supports that, but whether to have that is orthogonal to being row/column oriented.
This is in contrast to row based. You have to scan the full row, to get a column. Think how you'd usually read a CSV (read line, parse line).
IMO it's easier to explain in terms of workload:
- OLTP (T = transactional workloads), row based, for operating on rows - OLAP (A = analytical workloads), column based, for operating on columns (sum/min/max/...)
One big win here is that it's possible to get Apache Arrow buffers directly from the data pages - either by using the provided WASM or bringing a native decoder.
In Parquet this is currently very complicated. Parquet uses the Dremel encoding which stores primitive values alongside two streams of integers (repetition and definition levels) that drive a state machine constructed from the schema to reconstruct records. Even getting those integer streams is hard - Parquet has settled on “RLE” which is a mixture of bit-packing and run-length encoding and the reference implementation uses 74,000 lines of generated code just for the bit-packing part.
So to get Arrow buffers from Parquet is a significant amount of work. F3 should make this much easier and future-proof.
One of the suggested wins here is random access to metadata. When using GeoParquet I index the metadata in SQLite, otherwise it would take about 10 minutes as opposed to a few milliseconds to run a spatial query on e.g. Overture Maps - I'd need to parse the footer of ~500 files meaning ~150MB of Thrift would need to be parsed and queried.
However the choice of Google's Flatbuffers is an odd one. Memory safety with FlatBuffers is a known problem [1]. I seriously doubt that generated code that must mitigate these threats will show any real-world performance benefits. Actually - why not just embed a SQLite database?
[1] https://rustsec.org/advisories/RUSTSEC-2021-0122.html
> The decoding performance slowdown of Wasm is minimal (10–30%) compared to a native implementation.
so... you take 10%-30% performance hit _right away_, and you perpetually give up any opportunities to improve the decoder in the future. And you also give up any advanced decoding functions other than "decode whole block and store into memory".
I have no idea why would anyone do this. If you care about speed, then wasm is not going to cut it. If you don't care about speed, you don't need super-fancy encoding algorithms, just use any of the well-known ones.
The WASM is meant as a backup. If you have the native decoder installed (e.g., as a crate), then a system will prefer to use that. Otherwise, fallback to WASM. A 10-30% performance hit is worth it over not being able to read a file at all.
"Embedding the decoders in each file requires minimal storage (kilobytes) and ensures compatibility on any platform in case native decoders are unavailable."
The idea that software I write today can decode a data file written in ten years using new encodings is quite appealing.
And the idea that new software written to make use of the new encodings doesn't have to carry the burden of implementing the whole history of encoders for backwards compatibility likewise.
Sounds very much like the security pain from macros in Excel and Microsoft Word that could do anything.
This is why most PDF readers will ignore any javascript embedded inside PDF files.
"In case users prefer native decoding speed over Wasm, F3 plans to offer an option to associate a URL with each Wasm binary, pointing to source code or a precompiled library."
In fact, wasm was explicitly designed for me to run unverified wasm blobs from random sources safely on my computer.
The situation you describe is kind of already the case with various approaches to compression. For example, perhaps we decide to bitpack instead of use the generic compressor. Or change compressors entirely.
This sort of thing exists without WASM, and it means you have to "transcode" i.e. rewrite the file after updating your software with the new techniques.
With WASM, it's the same. You just rewrite the file.
I do agree that this pushes the costs of iteration up the stack in a vastly less efficient way. Overall this seems way more expensive, very unclear that future proofing is worth it. I've worked with exabyte-scale systems and re-encoding swaths of data regularly would not be good.
Sold
For those who don't know. Wes McKinney is the creator of Pandas, the go-to tabular analysis library for Python. That gives his format widespread buy-in from the outset, as well as a couple of decades' of Caring About The Problem which makes his insights unusually valuable.
I just hope that people will not just execute that code in an unconfined environment.
My apologies to the Chinese coauthors who I'm not familiar with.
Bookmarked for thorough reading!
[1] covers the first 40 years of databases. [2] fills in the gap of the last 20 years and gives their thoughts on the future.
My apologies, the first one was actually Stonebraker & Hellerstein and didn't involve Pavlo. They're both excellent papers though for anyone working with data.
Stonebraker, for those who don't know, is the creator of Postgres, a database you might have heard of.
1: Stonebraker & Hellerstein, "What Goes Around Comes Around", 2005, https://people.csail.mit.edu/tdanford/6830papers/stonebraker...
2: Stonebraker & Pavlo, "What Goes Around Comes Around... And Around...", 2024, https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec20...
That said, now let me see what F3 is actually about (and yes, your comment is what actually made me want to click through to the link) ...
- Parquet metadata is Thrift, but with comments saying "if this field exists, this other field must exist", and no code actually verifying the fact, so I'm pretty sure you could feed it bogus Thrift metadata and crash the reader.
- Parquet metadata must be parsed out, meaning you have to: allocate a buffer, read the metadata bytes, and then dynamically keep allocating a whole bunch of stuff as you parse the metadata bytes, since you don't know the size of the materialized metadata! Too many heap allocations! This file format's Flatbuffers approach seems to solve this as you can interpret Flatbuffer bytes directly.
- The encodings are much more powerful. I think a lot of people in the database community have been saying that we need composable/recursive lightweight encodings for a long time. BtrBlocks was the first such format that was open in my memory, and then FastLanes followed up. Both of these were much better than Parquet by itself, so I'm glad ideas from those two formats are being taken up.
- Parquet did the Dremel record-shredding thing which just made my brain explode and I'm glad they got rid of it. It seemed to needlessly complicate the format with no real benefit.
- Parquet datapages might contain different numbers of rows, so you have to scan the whole ColumnChunk to find the row you want. Here it seems like you can just jump to the DataPage (IOUnit) you want.
- They got rid of the heavyweight compression and just stuck with the Delta/Dictionary/RLE stuff. Heavyweight compression never did anything anyway, and was super annoying to implement, and basically required you to pull in 20 dependencies.
Overall great improvement, I'm looking forward to this taking over the data analytics space.
A big problem for some people is that Java support is hard as it isn’t portable so eg making a Java web server compress its responses with are isn’t so easy.
It's not like it's 1999 and there is still some Sun dogma against doing this.
"Heavyweight compression" as in zstd and brotli? That's very useful for columns of non-repeated strings. I get compression ratios in the order of 1% on some of those columns, because they are mostly ASCII and have lots of common substrings.
Build teams, weep in fear...
Parquet is surprisingly arcane. There are a lot of unpleasant and poorly documented details one has to be aware of in order to use it efficiently.
I think that more expensive compression may have made more of a difference 15 years ago when cpu was more plentiful compared to network or disk bandwidth.
Plugins for DudckDB for example to read and write to it. Heck if you got Iceberg to use it instead of parquet that could be a win.
Sometimes tech doesn’t win on the merits because of how entrenched existing stuff is and the high cost of switching.
Yes! Path dependence: https://en.wikipedia.org/wiki/Path_dependence
If you need optimized format you do it yourself, if you need standard you use w/e else is using unless it is too bad.
These new formats seem like they would only be useful as research to build on when creating a specific tool or database
Is the runtime really that large? I know with wasm 2.0 with garbage collection and exceptions is a bit of a beast but wasm 1.0? What's needed (I'm speaking from a place of ignorance here, I haven't implemented a WASM runtime)? Some contiguous memory, a stack machine, IEEE float math and some utf-8 operations. I think you can add some reasonable limitations like only a single module and a handful of available imports relevant to the domain.
I know that feature creep would almost inevitably follow, but if someone cares about minimizing complexity it seems possible.
That relies on some rather far-fetched assumptions about what the attacker might reasonably be able to control undetected, and what goals they might reasonably be able to achieve through such low-level data corruption.
Maybe information leakage? Tweak some low-order float bits in the decoded results with high-order bits from data the decoder recognizes as “interesting”?
Far-fetched, indeed.
Why is there a chess move mentioned towards the end of the paper? (page24)
CMU-DB vs. MIT-DB: #1 pe4
Clever.
[1](https://chesspathways.com/chess-openings/kings-pawn-opening/)
14 votes & no comments, 3 mo ago, for AnyBlox: A Framework for Self-Decoding Datasets [pdf] https://gienieczko.com/anyblox-paper https://news.ycombinator.com/item?id=44501743
:)
If I had a dollar for all the file formats that got deprecated or never took-off over the years...
This would allow them to use a decoder optimized for the current hardware. Eg. multi-threaded
https://dl.acm.org/doi/10.1145/3749163
They also have to specify the WASM version. In 6 years, we already have 3 versions there (https://webassembly.org/specs/). Are those 100% backwards compatible?