Bloom Filters Are Good for Search That Does Not Scale
Postedabout 2 months agoActiveabout 2 months ago
notpeerreviewed.comTechstory
calmpositive
Debate
60/100
Bloom FiltersSearch AlgorithmsData Structures
Key topics
Bloom Filters
Search Algorithms
Data Structures
The article discusses the limitations of Bloom filters in search applications, sparking a thoughtful discussion on their use cases, alternatives, and related data structures.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
1h
Peak period
11
6-9h
Avg / period
3.7
Comment distribution41 data points
Loading chart...
Based on 41 loaded comments
Key moments
- 01Story posted
Nov 4, 2025 at 4:25 AM EST
about 2 months ago
Step 01 - 02First comment
Nov 4, 2025 at 5:26 AM EST
1h after posting
Step 02 - 03Peak activity
11 comments in 6-9h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 5, 2025 at 3:45 PM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45808998Type: storyLast synced: 11/20/2025, 6:45:47 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I've seen them save startups real money in caching layers - checking "did we already process this event" before hitting the database. false positives are fine because you just check the database anyway, but true negatives save you thousands of queries.
the trick is recognizing when false positives don't hurt you. most engineers learn about them in theory but never find that practical use case where they're actually the right tool. same with skip lists and a bunch of other algorithms - brilliant for 2% of problems, overkill for the rest.
100% agree when it works in your favor. We use it for exactly that situation where a non-zero false positive rate is fine and you can choose how much memory to devote to getting it closer to zero.
There have a been a couple times though where we've needed a "keep everything not on this list" operation, and unfortunately bloom filters don't work well for that situation. There are alternatives, but none as elegantly compact as bloom filters.
This asymmetry works great for I/O-bound workloads (skip-indexes) but fails for TFA's approach where every document needs its own filter.
In practice, you combine both: inverted index for the dictionary (amortizes across documents), then bloom filters per chunk of the index (amortizes across chunks). This two-level approach handles scale much better than TFA's one-filter-per-document design. It's bloom filters as an optimization layer, not a replacement.
I invented a very fast string search algorithm based on bloom filters. Our paper [1] was accepted to the Symposium of Experimental Algorithms 2024 [2]. Code can be found here [3].
[1] https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.S...
[2] https://sea2024.univie.ac.at/accepted-papers/
[3] https://github.com/nishihatapalmer/HashChain
If you want guaranteed linear performance in the worst case too, then LinearHashchain is the one to use. It is slightly more complex to implement as it builds in a KMP style verifier to make it linear (so almost two search algorithms are needed). It is actually about as fast as HashChain generally in the average case, so you don't lose out.
The others are either experimental or quite niche and not suitable for most purposes. SentinelHashchain is actually the fastest, but relies on being able to add a copy of the pattern at the end of the search text. Mostly this won't be possible in most search contexts, unless you control all memory allocations.
So I'd start with HashChain, and maybe play with the linear version later - most of it is the same, it just needs a bit more adding.
However, the performance gain is not surprising at all since Bloom filters allow the querying engine to skip large blocks of data with certainty when the blocks do not contain the target data. False negatives are impossible. False positives occur but the rate of false positives can be made very small with well-chosen parameters and trade-offs.
With 4 hash functions (k = 4), 10007 bits per bloom filter (m = 10007) and a new bloom filter for every 1000 records (n = 1000), we achieved a theoretical false-positive rate of only 1.18% ((1 - e(-k * n / m)) ^ k = 0.0118). In practice, over a period of 5 years, we found that the actual false positive rate varied between 1.13% and 1.29%.
The only downside of a false positive is that it makes the query engine read a data block unnecessarily to verify whether the target data is present. This affects performance but not correctness; much like how CPU branch misprediction affects performance but not correctness.
A 30-fold increase in querying speed with just 1.25 kB of overhead per data block of 1000 records (each block roughly 1 MB to 2 MB in size) was, in my view, an excellent trade-off. It made a lot of difference to the customer experience, turning what used to be a 2 minute wait for query results into a wait of just about 5 seconds, or in larger queries, reducing a 30 minute wait to about 1 minute.
This was a fairly large project, with roughly 3 million lines of C and C++ code which had numerous constants with special values defined throughout the code. So instead of using a less interesting number like 10000, I chose 10007 so that if we ever came across the value 10007 (decimal) or 0x2717 (hexadecimal) while inspecting a core dump in a debugger, we would immediately recognise it as the constant defining the number of bits per Bloom filter.
Rather than one hash per file I made a file for each 2 letter combinations like aa.raw, ab.raw, etc where each bit in the file represents a record. (bit 0 is file 0 etc) you could ofc do 3 or 4 letters too.
A query is split into 2 letter combinations. 1st + 2nd, 2nd + 3rd, etc the files are loaded, do a bitwise AND on the files.
a search for "bloom" would only load bl.raw, lo.raw, oo.raw, om.raw
The index is really hard to update but adding new records is easy. New records are first added as false positives until we have enough bits to push a byte to the end of each raw file.
I then got lost pondering what creative letter combinations would yield the best results. Things like xx.raw and an.raw are pretty useless. Words could be treated as if unique characters.
Characters (or other bytes) could be combined like s=z, k=c, x=y or i=e
Calculating which combination is best for the static data set was to hard for me. One could look at the ratio to see if it is worth having a letter combination.
But it works and loading a hand full of files or doing an AND is amazingly fast.
My boss was generally pretty good with obscure data structures but neither of us had encountered Bloom filters. This was about three years after Google published their paper on how they were using Bloom filters, but that company would be bankrupt before I figured it out.
Several of us were working hard to move everything into Prometheus that made any sense to be in Prometheus instead of Splunk.
Notably any time we had a production issue that it was unclear which team was responsible, Splunk became the bottleneck because we started exceeding quotas immediately.
I'm not sure what you mean by exceed quotas because Splunk is normally licensed on GB ingested per day. This can lead to bitter fights between teams over how this is allocated.
The good thing about this license model is that you can use as much hardware as you want for no extra license cost.
That’s sounds like self hosting. Which is not the only product they offer. But you still have hardware that can only run so many queries at once and then starts queuing any additional request, yeah? Once you have a dozen people on a call it went to shit. Only occasionally ran into problems like this with graphite. But you need a lot of people looking at a very large dashboard to start feeling refresh delays.
Bloom filter indexing seems like a great fit if you ever need to do substring searches in a context like that, and for log searching in general. I haven't dug into what all packages have it, but it looks like at least ClickHouse does: https://clickhouse.com/docs/optimize/skipping-indexes#bloom-...
You need to know up front whether you need to be able to dynamically add entries to the filter or if your application can tolerate rebuilding the filter entirely whenever the underlying data changes. In the latter case you have more freedom to choose data structures; many of the modern "better than Bloom" filters are more compact but don't support dynamic updates.
[1] https://en.wikipedia.org/wiki/Cuckoo_filter
[2] https://engineering.fb.com/2021/07/09/core-infra/ribbon-filt...
Cuckoo claims 70% of the size of bloom for the same error rate, and the space is logarithmic to the error rate. Looks like about 6.6 bits per record versus 9.56 bits for bloom at 1%. But at .5% error rate a cuckoo is 7.6 bpr. In fact you can get to about a .13% error rate for a cuckoo only a hair larger than the equivalent bloom filter (n^9.567 = 758.5)
My fairly niche use case for these kinds of data structures was hardware firewalls running mostly on SRAM, which needed a sub one-in-a-billion false positive rate.
I needed to filter items by tags. Bloom filter per item seemed clever - quick membership checks. But with thousands of items sharing dozens of tags, each filter re-encodes the same vocabulary. Pure waste.
Switched to an inverted index (tag → item list) with bloom filters per chunk of the index. Now the tag vocabulary is shared, and bloom filters just speed up chunk-skipping when the index grows large.
TFA's mistake is using bloom filters -instead- of an inverted index rather than on top of one. The amortization patterns stack, they don't compete.
https://github.com/FastFilter
It all started when Stavros's blog was circulated on Hacker News! The way we approached the search part was by using "Spectral Bloom Filters" - https://theory.stanford.edu/~matias/papers/sbf-sigmod-03.pdf - which is based on a paper by Saar Cohen and Yossi Matias from the early 2000s - its basically an iteration on the counting bloom filters. We used the minimal selection and minimal increase algorithm from the paper for insertion and ranking of results.
I wrote a blog on it too - https://pncnmnp.github.io/blogs/spectral-bloom-filters.html
Some slides - https://pncnmnp.github.io/blogs/sthir-talk-2020.pdf
It is not hard to see how you could start asking the front desk to track every obscure attribute and to expect to fall over for various reasons.
[1] https://github.com/danthegoodman1/bloomsearch
https://dl.acm.org/doi/pdf/10.1145/3077136.3080789
https://bitfunnel.org/strangeloop/
It's a pretty neat algorithm from a paper in 2019 for the application "to index k-mers of DNA samples or q-grams from text documents". You can take a collection of bloom filters built for documents and then combine them together to have a single filter that will tell you which docs it maps to. Like an inverted index meets a bloom filter.
I'm using it in a totally different domain for an upcoming release in InfluxDB (time series database).
There's also code online here: https://github.com/bingmann/cobs
I thought that was going to be a link to Google.com