Data Modeling Guide for Real-Time Analytics with Clickhouse
Posted4 months agoActive4 months ago
ssp.shTechstory
calmmixed
Debate
60/100
ClickhouseReal-Time AnalyticsData ModelingDatabase Performance
Key topics
Clickhouse
Real-Time Analytics
Data Modeling
Database Performance
The article provides a guide to data modeling for real-time analytics with ClickHouse, sparking discussion on the challenges and limitations of using ClickHouse for complex data transformations and aggregations.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
5h
Peak period
19
0-12h
Avg / period
5.5
Comment distribution22 data points
Loading chart...
Based on 22 loaded comments
Key moments
- 01Story posted
Sep 5, 2025 at 8:51 AM EDT
4 months ago
Step 01 - 02First comment
Sep 5, 2025 at 1:40 PM EDT
5h after posting
Step 02 - 03Peak activity
19 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 11, 2025 at 6:34 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45137927Type: storyLast synced: 11/20/2025, 2:38:27 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
We might stick with Clickhouse, but after working with it for a year and a half, I'm curious to see whether a system that handles joins more effectively would be a better fit. To that end, my next R&D project is to set up a vertical slice of our analytics on Apache Doris to see how well it handles a similar workload.
We recently built a Snowflake-to-ClickHouse pipeline for a customer where aggregates are built hourly in Snowflake, then pushed into a ClickHouse table to power their user-facing dashboards.
By offloading dashboard queries to ClickHouse, they slashed their Snowflake bill by ~25%, which was worth millions to them.
(Admittedly, running aggregations elsewhere—for example, in Spark—could further reduce costs, but you would then need Iceberg to make the tables queryable in Snowflake.)
Which version are you working with? I recommend trying the latest version. For the last eight months, we have spent considerable time improving our JOINs.
For example, we just merged https://github.com/ClickHouse/ClickHouse/pull/80848, which will help a lot with performance in the near future.
I also share nrjames's curiosity about whether the spill-to-disk situation has improved. Not having to even think about whether a join fits in memory would be a game changer.
In playing a little with ClickHouse, issues with joins were one of the biggest things that jumped out at me.
It wouldn't change the join order to try to ensure the smaller side was in RAM. It wouldn't propagate WHERE conditions to the other side of a join, so "WHERE a.customer_id=3 AND b.customer_id=a.customer_id" wouldn't automatically filter the scan of b to blocks where b.customer_id=3. Projections could theoretically be used to speed joins on the projection's ORDER BY columns, but it didn't appear to happen. You'd sometimes need to manually set the join algo for best performance, when I'd rather the optimizer choose it, maybe adaptively based on whether it seemed to make sense to do a join in-RAM.
ClickHouse was also great at certain things, like scanning a ton of rows very efficiently and interfacing with lots of systems (its table functions could pull from lots of sources). It was easy to pick up and start playing with, and seems to have found use in a lot of organizations and at substantial scale. And there are a lot of features that look potentially quite powerful (incremental materialized views, aggregating/deduping merge tree variants, etc.). The best way I could process what I saw experimenting with it was that they'd focused on a providing really efficient low-level execution engine, but sometimes you needed to act as the query optimizer.
In general: given a huge hash join, you can shrink it (reorder, etc. so fewer rows would need hashed), sort (take advantage of sort order for an efficient join), and/or spill it to disk. ClickHouse's improvements from 2024.12 forward have made it much better at the first of the three; it could get much further with the other two.
1) Shrinking -- "Can we make it a small join?" They've done a lot to solve this one! They'll swap table order, propagate WHERE conditions, and end up using projections, so quite often if the join can be small, it will be, without special effort on the query writer's part. To be clear, this is a huge win for many use cases, and I'm not writing more because all I can say is "great."
2) Sorting -- "Can we take advantage of an existing sort order?" Huge joins can be efficient and not run out of RAM if the data is already ordered by the join key and all you have to do is walk through part of it in order. ClickHouse doesn't automatically do this; it defaults to hashing even when the tables' natural row ordering would make sort-based joins fast. Docs suggest users consider full_sorting_merge when tables are already ordered by the join key. Perhaps a future setting could make the optimizer consider it.
Joins with coinciding ORDER BYs seem like they could occur naturally in "might-have" situations, e.g. joining pageviews to conversions on a shared key.
2a) Sorting with projections -- "Can we use sort orders if we include projections?" If rows aren't already on disk in the correct order to make a sort join fast, the ClickHouse administrator could add a projection with a different ORDER BY to one or both tables to fix that! Right now ClickHouse will pick projections to aid in filtering, but won't look for projections that allow a fast sort join.
Combining 2 and 2a, ClickHouse admins would gain a pretty big hammer to use on problem joins: create a projection or two allowing a fast sort-join, and now the former problem query is efficient with no effort on query authors' part. This is valuable when the admins are willing to understand query and execution patterns precisely, but not everyone (or everything, when some queries are generated) can write queries with optimization in mind.
Query authors can force use of a projection with the mergeTreeProjection function, and force join_algorithm = 'full_sorting_merge', but that's the user acting as query optimizer, not the desired end state.
3) Spilling -- "Will we automatically spill to disk instead of having an out-of-memory error?" Looks like this is still a "no". Docs show options to stop joins at a certain size, but I don't see a way to e.g. trigger a retry with a more spill-friendly join_algorithm when a join gets too big (or if the planner expects it will be big). There is a very interesting patch in https://github.com/ClickHouse/ClickHouse/pull/72728 which introduces the idea of query-wide RAM accounting that can trigger spill to disk, but it's only used in grace_hash as of that PR.
--
I haven't really looked at DISTINCT or aggregation's RAM use or whether they take advantage of sorts. And maybe the better path is to make transparent spill-to-disk something you can rely on then reduce how often you need to spill as an optimization. Either way, ClickHouse could get from its current state--where joins that can fit in RAM are efficient but larger ones need some hand-holding--to a state where you really can throw all sorts of queries at it and reliably get results without really having to guide it.