Optimizing Writes to Olap Using Buffers (clickhouse, Redpanda, Moosestack)
Key topics
The article discusses optimizing writes to OLAP systems using buffers, with a focus on ClickHouse, Redpanda, and MooseStack, and the discussion revolves around the trade-offs and design considerations of such optimizations.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
12m
Peak period
8
132-144h
Avg / period
3.2
Based on 16 loaded comments
Key moments
- 01Story posted
Oct 14, 2025 at 2:21 PM EDT
3 months ago
Step 01 - 02First comment
Oct 14, 2025 at 2:33 PM EDT
12m after posting
Step 02 - 03Peak activity
8 comments in 132-144h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 21, 2025 at 3:47 PM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I'm curious to know how much of this article is OLAP specific vs just generic good practice for tuning batch insert chunk size. The whole "batch your writes, use 100k rows or 1s worth of data" thing applies equally to pretty much any database, they're just ignoring the availability of builtin bulkload methods so they can arguing that INSERTs are slow so they can fix it by adding Kafka, for reasons? Maybe I'm missing something.
If you want to directly insert data into ClickHouse with MooseStack, we have a direct insert method that allows you to use ClickHouse's bulkload methods.
Here's the implementation: https://github.com/514-labs/moosestack/blob/43a2576de2e22743...
Documentation is here: https://docs.fiveonefour.com/moose/olap/insert-data#performa...
Would love to hear your thoughts on our direct insert implementation!
It's a tradeoff. Analytics databases are often filled with periodic dumps of transactional databases and this feels so dirty that it's easy to accidentally forget that it isn't just a hack, it's actually a structural workaround for the poor random-write performance of analytics DBs:
OLTP = more read amplification on analytics workflows, less write amplification of random insert
OLAP = less read amplification on analytics workflows, more write amplification of random insert
If that's too theoretical, the other day I saw 1-row updates of about 10kb data lead to 1GB of writes in Redshift: 1MB block size times 300 columns times a log+shuffle factor of about 3. That's a write amplification factor of 100000. Crazy stuff.
This is included on that page's stylesheet:
It’s recommended in the docs over the Buffer table, and is pretty much invisible to the end user.
At ClickHouse Inc itself, this scaled far beyond millions of rows per second: https://clickhouse.com/blog/building-a-logging-platform-with...
If you need to ensure that you have super durable writes, you can consider, but I really think it’s not something you need to reach for at first glance
We went from the get go to that infrastructure for multiple reasons in the first place:
* Having a durable buffer before ensures if you have big spikes that gets eaten by the buffer, not OLAP which when it is powering your online dashboard you want to keep responsive. Clickhouse cloud now has compute/compute that addresses that but open source users' don't.
* When we shipped this for the first time, clickhouse did not have the async buffering in place, so not doing some kind of buffered inserts was forwned upon. * As oatsandsugar mentioned, since them we also shipped direct insert where you don't need a kafka buffer if you don't want it
* From an architecture standpoint, with that architecture you can have multiple consumers
* Finally, having kafka enables having streaming function written in your favorite language vs using SQL. Definitely will be less performance to task ratio, but depending on the task might be faster to setup or even you can do things you couldn't directly in the database.
Disclaimer I am the CTO at Fiveonefour
Altinity is addressing this with Project Antalya builds. We have extended open source ClickHouse with stateless swarm clusters to scale queries on shared Iceberg tables.
Disclaimer: CEO of Altinity
It’s mainly focused on the insert batching which is why I was drawing attention to async_insert.
I think it’s worth highlighting the incremental transformation that CH can do via the materialised views too. That can often replace the need for a full blown streaming transformation pipelines too.
IMO, I think you can get a surprising distance with “just” a ClickHouse instance these days. I’d definitely be interested in articles that talk about where that threshold is no longer met!
https://news.ycombinator.com/item?id=45651098
One of the reasons we streaming ingests is because we often modify the schema of the data in stream. Usually to conform w ClickHouse best practices that aren't adhered to in the source data (restrictive types, denormalization, default not nullable, etc).