Compilebench: Can AI Compile 22-Year-Old Code?
Posted3 months agoActive3 months ago
quesma.comTechstoryHigh profile
calmmixed
Debate
60/100
Artificial IntelligenceCompilationLegacy Code
Key topics
Artificial Intelligence
Compilation
Legacy Code
The CompileBench project tests AI's ability to compile 22-year-old code, sparking discussion on AI's potential in handling legacy codebases and complex compilation tasks.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
27m
Peak period
49
0-12h
Avg / period
10.8
Comment distribution65 data points
Loading chart...
Based on 65 loaded comments
Key moments
- 01Story posted
Sep 22, 2025 at 8:59 AM EDT
3 months ago
Step 01 - 02First comment
Sep 22, 2025 at 9:26 AM EDT
27m after posting
Step 02 - 03Peak activity
49 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 28, 2025 at 12:51 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45332814Type: storyLast synced: 11/20/2025, 5:39:21 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
So far in this benchmark we based the tasks on a couple of open-source projects (like curl, jq, GNU Coreutils).
Even on those "simple" projects we managed to make the tasks difficult - Claude Opus 4.1 was the only one to correctly cross-compile curl for arm64 (+ make it statically-linked) [1].
In the future we'd like to test it with projects like FFmpeg or chromium - those should be much more difficult.
[1] https://www.compilebench.com/curl-ssl-arm64-static/
That's not the (only) problem: Even if you take the internet away, we know/assume that all LLMs are heavily trained on public GitHub repositories. Therefore, they know/remember details of the code and organization in a way they can't for your private (or new, past knowledge cut-off date) code.
Come to think of it, in terms of trying to get old code building, the CVS days of Firefox should be interesting... because the first command in that build step is "download the source code" and that CVS server isn't running anymore. And some of the components are downloaded from a CVS tag rather than trunk, and the converted CVS repositories I'm aware of all only converted the trunk and none of the branches or tags.
The entire coreutils is reduced to one utility (sha1sum) and the test doesn't even try to feed a real file to it (just a stdin string)[0], same goes to the jq task, there isn't even a json file feed to it, what's being verified[1] is barely a calculator.
These project ship with "make check", please tell AI to use it.
[0] https://github.com/QuesmaOrg/CompileBench/blob/86d9aeda88a16...
[1] https://github.com/QuesmaOrg/CompileBench/blob/86d9aeda88a16...
I found that "just" there to be so funny in terms of how far the goal posts moved over these last few years (as TFA does mention). I personally am certain that it would have taken me significantly longer than that to do it myself.
And here's me, after 4 straight days of wrangling an obscure cross-compilation toolchain to resurrect some ill-fated piece of software from year 2011 in a modern embedded environment.
Of course, I will probably do this with OpenAI's option, not $20 of Anthropic API credits.
it does seem the machine is faster than me since I would have to spend a minute to copy each of the --disable-whatever flags for curl
it's somewhat cool to see a computer can do the same half-assed process I do of seeing what linker failures happen and googling for the missing lib flag
I did this upgrade a few times, and works for simple stuff like charm (e.g. removing requirements.txt and adding proper pyproject.toml).
Even in Claude Code, it takes some prompting and CLAUDE.md so that it consistently runs uv, rather sometimes `uv run python`, other times `python3 -m` being surprised that some dependency is not available.
But just a copy-paste of a piece of a project of mine
In dire cases, use the PreToolUse hook to inspect/intercept (though it's usually not necessary).
Granted, I haven't tried it for huge projects yet, but after doing that my small-medium sized projects all got ported nicely.
(If you must change the prompt, mention PEP723 as well, it seems to have the same effect as showing a shiny trinket to a magpie ;)
Asking Claude Code to build it - literally prompting it "fix whatever needs to be fixed until you get the binary to run" - and waiting ~20 minutes was the best investment of non-time I could do... It definitely felt magical. Claude would tweak headers, `make` it, try to run it, and apply more fixes based on the errors it got back.
Now that I think of it, I regret not opening an issue/PR with its findings...!
(((I then went on to make more vibe-changes to the Doom code and made a video out of those which went semi-viral, which I will now unashamedly plug [1])))
[0] https://github.com/chocolate-doom/chocolate-doom
[1] https://www.youtube.com/watch?v=LcnBXtttF28
I'm also celebrating (although I forgot to do this - my bad!) that this automated discovery (i.e. of how to fix the build system for machines such as mine) could have been brought back to the Chocolate Doom community, and made the software better for everyone.
And finally, I'm also celebrating that this allowed my (if I may speak so boldly) creativity to express itself by helping me quickly bring a funny idea to life and share it, hopefully entertaining the world/making at least one person laugh/chuckle.
I don't see how any of this makes me redundant though. Efficient? Lazy? Both? Neither? But not redundant. I think! :-)
Also, you would own a chisel and the chisel does not spy on you. The "AI" factories are owned by oligopolies and you have to pay a steep monthly fee in order to continue receiving your warez that are derivative works of actually creative people's IP. Also, the "AI" factories know everything you do and ask and what kind of code you write.
Plus, as other commenters have pointed out already, you can run this stuff entirely free from risk of an AI company spying on what you are doing. The models that run locally got really good in the past 12 months, and if they don't work on your own machine you can rent a capable cloud GPU machine for a few bucks an hour.
I feel like there's a real metaphor here. 86+ people did work over two decades to maintain a cross-platform codebase and that "definitely deserves to be commended", but what "definitely felt magical" was Claude bumbling through header tweaks from compilation errors until the project compiled. And in the end what has AI wrought? A viral video but not anything to give back to the original project. Really there are multiple layers here :)
The point was to get it running, not solve world peace. Without AI, the problem might not have been tackled at all.
$ git clone --depth=1 https://github.com/chocolate-doom/chocolate-doom
$ cd c*doom; ls
Ok, there is a CMakeFile.txt, so it's probably a cmake project, so:
$ cmake .
Ok, that seems to work, but three libraries are missing, SDL2_Mixer, SDL2_Net and FluidSynth, so lets install them:
$ sudo apt install libsdl2-mixer-dev libsdl2-net-dev libfluidsynth-dev
Let's try again:
$ cmake .
Works, so now for compiling:
$ cmake --build . -j $(nproc)
Build completed in a few seconds first try.
https://buildd.debian.org/status/package.php?p=chocolate-doo...
I've mentioned this before, but "sufficiently smart compiler" would be the dream here. Start with high level code or pseudo code, end up with something optimized.
Then, I'd start to trust in its ability to manage context and reliably work through complex tasks.
You are saying that you'd trust the new and unproven technology more if it didn't rely on old and proven technology and instead reinvented everything from scratch. That's a somewhat illogical take.
But by "correct", I meant that it would need to be able to work through such multi-level tasks as a compiler with semantic analysis, error checking, optimization, and code generation to reliably transcribe the source code. Not just emit lorum ipsum executables.
It's about as useful as requiring your engineers to forge computers from sand on upwards.
Now if it could fix React Native builds after package upgrades I'd be impressed...
The newer blog posts appear to scan forums like this one for objections ("AI" does not work for legacy code bases) and then create custom "benchmarks" for their sales people to point to if they encounter these objections.