Deepcodebench: Real-World Codebase Understanding by Q&a Benchmarking
Posted4 months agoActive4 months ago
qodo.aiTechstory
calmmixed
Debate
40/100
Code UnderstandingBenchmarkingAI
Key topics
Code Understanding
Benchmarking
AI
The post introduces DeepCodeBench, a benchmarking framework for evaluating code understanding capabilities, sparking discussion on its effectiveness and comparison to existing solutions like Codex.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
1h
Peak period
1
1-2h
Avg / period
1
Key moments
- 01Story posted
Sep 11, 2025 at 5:29 AM EDT
4 months ago
Step 01 - 02First comment
Sep 11, 2025 at 6:45 AM EDT
1h after posting
Step 02 - 03Peak activity
1 comments in 1-2h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 11, 2025 at 1:52 PM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45209532Type: storyLast synced: 11/20/2025, 4:38:28 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
The approach that they're taking here of working backwards from a OS repo pull request and reverse engineering a question is unusually well thought out for a benchmark.
I haven't dug into more of the dataset questions yet, but the example they give in the blog post for the question generated for Hugging Face Transformer's repo gives me hope that this could actually be a solid benchmark:
> How do the fast image and video processor base classes prevent shared mutable state when instantiating multiple instances?
It also would be nice if the article clearly mentioned what specific model settings were used for Claude Code and Codex. Both of those allow changing the reasoning level, so if the benchmark was done using the default settings, it seems a little unfair - they have a result of their own agent at high reasoning as a separate entry.