Beyond Indexes: How Open Table Formats Optimize Query Performance

Posted3 months agoActive3 months ago

jandrewrogers

95 points

3 comments

jack-vanlightly.comTechstory

calmpositive

Debate

20/100

Data StorageQuery PerformanceOpen Table Formats

Key topics

Data Storage

Query Performance

Open Table Formats

The article discusses how open table formats optimize query performance beyond traditional indexing methods, sparking discussion on their application and migration strategies for large databases.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

Peak period

90-96h

Avg / period

1.5

Key moments

01Story posted
Oct 8, 2025 at 11:01 AM EDT
3 months ago
Step 01
02First comment
Oct 11, 2025 at 10:37 PM EDT
3d after posting
Step 02
03Peak activity
2 comments in 90-96h
Hottest window of the conversation
Step 03
04Latest activity
Oct 12, 2025 at 6:39 AM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (3 comments)

Showing 3 comments

eisa01

3 months ago

1 reply

A lot of upvotes, but no discussion :)

How would you approach migrating a ~5 TB OLTP database that fundamentally constain analytical time series data? I’d think e.g., Apache Iceberg could be a better data store, and make writing much easier (almost just dump a parquet in there)

It’s exposed to the outside world via APIs

marklit

3 months ago

That 5 TB of data will probably be 3-400 GB in Parquet. Try and denormalise the data into a few datasets or just one dataset if you can.

DuckDB querying the data should be able to return results in milliseconds if the smaller columns are being used a better if the row-group stats can be used to answer queries.

You can host those Parquet files on a local disk or S3. A local disk might be cheaper if this is exposed to the outside world as well as giving you a price ceiling on hosting.

If you have a Parquet file with billions of records and row-groups measuring into the thousands then hosting on something like Cloudflare where there is a per-request charge could get a bit expensive if this is a popular dataset. At a minimum, DuckDB will look at the stats for each row-group for any column involved with a query. It might be cheaper just to pay for 400 GB of storage with your hosting provider.

There is a project to convert OSM to Parquet every week and we had to look into some of those issues https://github.com/osmus/layercake/issues/22

victor106

3 months ago

Great article.

Any other resources that provide a comprehensive treatment of open table formats?

View full discussion on Hacker News

ID: 45516964Type: storyLast synced: 11/20/2025, 1:45:02 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN