An Overengineered Solution to `sort | Uniq -C` with 25x Throughput (hist)
Key topics
Some notes on the improvements:
1. using csv (serde) for writing leads to some big gains
2. arena allocation of incoming keys + storing references in the hashmap instead of storing owned values heavily reduced the number of allocations and improves cache efficiency (I'm guessing, I did not measure).
There are some regex functionalities and some table filtering built in as well.
happy hacking
The author created a Rust tool, hist-rs, that outperforms the traditional 'sort | uniq -c' command by 25x, sparking a discussion on optimization techniques and trade-offs.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
19h
Peak period
73
Day 5
Avg / period
18.2
Based on 109 loaded comments
Key moments
- 01Story posted
Oct 22, 2025 at 6:26 PM EDT
2 months ago
Step 01 - 02First comment
Oct 23, 2025 at 1:24 PM EDT
19h after posting
Step 02 - 03Peak activity
73 comments in Day 5
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 4, 2025 at 4:10 PM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
It's great when you quickly need to see what the distribution of classes in an input stream is. This pops up all the time. Like measuring different types of log messages, counting the variants of a field in a csv, finding the most common word or substring, etc.
There is no text encoding processing, so this only works for single byte encodings. That probably speeds it up a little bit.
Depending on the size of the benchmark input, sort(1) may have done disk-based sorting. What's the size of the benchmark input?
You could do an annotating pass for learning which of each line is the last one, and then a followup pass for printing (or otherwise echoing) only the lines that are the last of their kind. Technically still faster than sorting.
You could also keep the information on last occurrence of each line in the hash map (that's where it's going to be anyway), and once you're done with the first pass sort the map by earliest last occurrence. That will get you the lines in the right order, but you had to do a sort. If the original input was mostly duplicates, this is probably a better approach.
You could also track last occurrence of each line in a separate self-sorting structure. Now you have slightly more overhead while processing the input, and sorting the output is free.
Can't you reverse file | keep first occurrence | reverse output?
It is not difficult to construct scenarios where throughput matters but that IMHO that does not determine engineering vs overengineering. What matters is whether there are requirements that need to be met. Debating the requirements is possible but doesn't take away from whether a solution obtained with reasonable effort meets the spec. Overengineering is about unreasonable effort, which could lead to overshoot the requirements, not about unreasonable requirements.
or you found that you've optimized a game that is unfun to play and thus doesn't sell, even tho it runs fast...
yep. The stakeholder (who is paying the money) asks why the prototype can't just be "fixed up" and be sold for money, instead of paying for more dev time to rewrite. There's no answer that they can be satisfied with.
I worked licensed titles for a while and that area the quality of a title and whether it sells were largely uncorrelated haha!
Could you explain that, if you have the time? Is that for writing the output lines? Is actual CSV functionality used? That crate says "Fast CSV parsing with support for serde", so I'm especially confused how that helps with writing.
[1] https://clickhouse.com/docs/operations/utilities/clickhouse-...
Jet motorcycles but motorcycles
> The easiest way to download the latest version is with the following command:
> curl https://clickhouse.com/ | sh
Like, sure, there is some risk downloading a binary or running an arbitrary installer. But this is just nuts.
Please don't say that. It denigrates the work of all the packagers that actually keep our supply chains clean. At least in the major distributions such as Red Hat/Fedora and Debian/Ubuntu.
The distro model is far from perfect and there are still plenty of ways to insert malware into the process, but it certainly is far better than running binaries directly from a web page. You have no idea who have access to that page and its mirrors and what their motives are. The binary isn't even signed, let alone reviewed by anyone!
Previously discussed here: https://news.ycombinator.com/item?id=11348798
Article now resides here: https://www.davidhaney.io/npm-left-pad-have-we-forgotten-how...
It's literally exactly the same thing
They want to own the claims made.
It likely won’t matter much here, but invoking cat is unnecessary.
will do the job just fine. GNU’s sort also has a few flags controlling buffer size and parallelism. Those may matter more (see https://www.gnu.org/software/coreutils/manual/html_node/sort...)You're right that the `cat` is unnecessary - and removing it actually had some marginal gains to the naive solution. I've updated the benchmarks to show this
Cheers
https://github.com/donatj/unic
It may not be the most fair comparison because with these random fastqs I'm generating the vast majority of the input is unique so it could be overloading the cuckoo filter.
Would be interesting to try this out
https://blog.cloudflare.com/when-bloom-filters-dont-bloom/
https://github.com/majek/mmuniq
``` clang \ -g -ggdb -O3 \ -Wall -Wextra -Wpointer-arith \ -D_FORTIFY_SOURCE=2 -fPIE \ mmuniq.c \ -lm \ -Wl,-z,now -Wl,-z,relro \ -o mmuniq mmuniq.c:1:10: fatal error: 'byteswap.h' file not found 1 | #include <byteswap.h> | ^~~~~~~~~~~~ 1 error generated. make: ** [mmuniq] Error 1 ```
Arguably, this will result in a slower result in most cases, but the reason for the rejection is wasting developer time (not to mention time to test for correctness) to re-develop something that is already available in the OS.
However, I've never found a use for it. Apparently it was written for the Version 7 Unix build system to sort libraries for passing to the linker. And still used.[1][2] But of the few times I've needed a topological sort, it was part of a much larger problem where shell scripting was inappropriate, and implementing it from scratch using a typical sort routine isn't that difficult. Still, I'm waiting for an excuse to use it someday, hopefully in something high visibility so I can blow people's minds.
[1] https://github.com/openbsd/src/blob/17290de/share/mk/bsd.lib... [2] https://github.com/NetBSD/src/blob/7d8184e/share/mk/bsd.lib....
Not to mention that these days people often ask ChatGPT "what's the best way to do this" before proceeding, and whatever you ask in interviews is completely irrelevant.
It is exactly for these reasons we never ask such questions in our interviews. There are much more important aspects of a candidate to evaluate.
It costs me more effort to read and understand a screenful of unfamiliar code than the equivalent "sort -k 1.1" or "uniq" while skimming through a shell script. This adds up.
> It costs me more effort to read and understand ...
You don't need to. LLMs are meant for that. You probably will roll your own script anyway and none of this matters.
You are worrying about things that, in my experience, do not make any noticeable impact on overall productivity.
For example my python interpreter imports my custom List and Path classes and I could just do the following to get the same result:
List(List(Path("filepath").read_text_file().splitlines()).group_by_key(lambda x:x).items()).map(lambda x:(len(x[1]),x[0])).sorted()
and if used often enough, it could made an utility method:
Path("filepath").read_sorted_by_most_common()
So I find it shortsighted to reject someone based on that without giving them a chance to explain their reasoning.
I think generally people really underestimate how much more productive you can be with a good utility library.
Sure, but you can do that for any functionality in any practical language.
What happens when everybody comes to the job with their own utility library and start working on the same codebase?
Would you like it if you had to get up to speed with several utility libraries your coworkers developed for themselves?
A common set of tools, like the Unix commands, makes it easier for people to collaborate. They were put in an official standard for a reason.
Interviewer: "Welcome to your hammer-stuff interview, hope you're ready to show your hammering skills, we see from your resumé you've been hammering for a while now."
Schmuck: "Yeah, I just love to make my bosses rich by hammering things!"
Interviewer: "Great, let's get right into the hammer use ... here's a screw, show me how you'd hammer that."
Schmuck: (Thinks - "Well, of course, I wouldn't normally hammer those; but I know interviewers like to see weird things hammered! Here goes...")
[Hammering commences]
Interviewer: "Well thanks for flying in, but you've failed the interview. We were very impressed that you demonstrated some of the best hammering we've ever seen. But, of course, wanted to see you use a screwdriver here in your hammering interview at We Hammer All The Things."
What is the definition of wasting developer time? If a developer takes a 2 hours break to recover mental power and avoid burnout, is it considered time wasted?
They are tricky and not very portable. Sorting depends on locales and the GNU tools implementation.
This Rust implementation uses hashmap, if you have a lot of unique values, you will need a lot of RAM.
So in those settings I think it's absolutely worth it
perl -ne 'print if ! $a{$_}++'
Looking at the code, there are a few things I would consider optimizing. I'd start by trying (my) StringZilla for hashing and sorting.
HashBrown collections under the hood use aHash, which is an excellent hash function, but on both short and long inputs, on new CPUs, StringZilla seems faster [0]:
A similar story with sorting strings. Inner loops of arbitrary length string comparisons often dominate such workloads. Doing it in a more Radix-style fashion can 4x your performance [1]: Bear in mind that "compares/s" is a made-up metric here; in reality, I'm comparing from the duration.[0] https://github.com/ashvardanian/StringWars?tab=readme-ov-fil...
[1] https://github.com/ashvardanian/StringWars?tab=readme-ov-fil...
This is a strange benchmark [0] -- here is what this random FASTQ looks like:
There are going to be very few [*] repeated strings in this 100M line file, since each >seq.X will be unique and there are roughly a trillion random 4-letter (ACGT) strings of length 20. So this is really assessing the performance of how well a hashtable can deal with reallocating after being overloaded.I did not have enough RAM to run a 100M line benchmark, but the following simple `awk` command performed ~15x faster on a 10M line benchmark (using the same hyperfine setup) versus the naïve `sort | uniq -c`, which isn't bad for something that comes standard with every *nix system.
[0] https://github.com/noamteyssier/hist-rs/blob/main/justfile[*] Birthday problem math says about 250, for 50M strings sampled from a pool of ~1T.
There are definitely other benchmarks that we could try as well to test other characteristics as well.
I've actually just added in this `awk` implementation you provided to the benchmarks well.
Cheers!
Which could be totally useful in itself, but not even close to what "sort" is doing.
Did you run sort with a buffer size larger than the data? Your specialized one-pass program is likely faster, but at least the numbers would mean something.
That said, I don't see what is over-engineered here. It's pretty straightforward and easy to read.
But you're right - it will be limited by RAM in a way the unix tools are not.
I did add a test with sort using a larger buffer size to the benchmarks as well.
Somebody, implement `uniq --global` switch already. Put it into your resume, it's a legitimate thing to brag about.
You don't really end up using these results in any specific analysis but it's super helpful for troubleshooting tools or edge-cases.
And then there would be countless arguments about whether you have to count the time it takes to stage the data into the cluster as part of the task completion benchmark...
What I'm really optimizing here is the functional equivalent of `sort | uniq -c | sort -n`
Are there any tools that tolerate slight mismatches across lines while combining them (e.g., a timestamp, or only one text word changing)?
I attempted this with a vector DB, but the embeddings calculation for millions of lines is prohibitive, especially on CPU.
https://github.com/JoshRodd/mll