Deepcodebench: Real-World Codebase Understanding by Q&a Benchmarking

Posted4 months agoActive4 months ago

blazercohen

84 points

5 comments

qodo.aiTechstory

calmmixed

Debate

40/100

Code UnderstandingBenchmarkingAI

Key topics

Code Understanding

Benchmarking

The post introduces DeepCodeBench, a benchmarking framework for evaluating code understanding capabilities, sparking discussion on its effectiveness and comparison to existing solutions like Codex.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

Peak period

1-2h

Avg / period

Key moments

01Story posted
Sep 11, 2025 at 5:29 AM EDT
4 months ago
Step 01
02First comment
Sep 11, 2025 at 6:45 AM EDT
1h after posting
Step 02
03Peak activity
1 comments in 1-2h
Hottest window of the conversation
Step 03
04Latest activity
Sep 11, 2025 at 1:52 PM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (5 comments)

Showing 5 comments

four_fifths

4 months ago

1 reply

If you do a bit of digging into most of the popular benchmarks that all the big labs report on, you'll see pretty quickly that they have almost zero correlation with any real world tasks.

The approach that they're taking here of working backwards from a OS repo pull request and reverse engineering a question is unusually well thought out for a benchmark.

I haven't dug into more of the dataset questions yet, but the example they give in the blog post for the question generated for Hugging Face Transformer's repo gives me hope that this could actually be a solid benchmark:

> How do the fast image and video processor base classes prevent shared mutable state when instantiating multiple instances?

qsort

4 months ago

I particularly like their usage of LLM-as-a-judge. They don't go "hey chatgpt, sort these from best to worst based on vibes", rather they extract a set of ground truths and check how the answer compares, a task that SOTA LLM can do kind of reliably. It's a very smart way to circumvent the problems introduced by pure LLM-as-a-judge methods.

Tiberium

4 months ago

Seems like an interesting benchmark, but my takeaway from the results is that Codex is almost as good enough as their custom solution (no mention of the underlying model) and only requires a $20 ChatGPT subscription to start using it (of course with limits), without having to shell out $$$ for an enterprise Qodo plan to use Qodo Aware - https://www.qodo.ai/products/qodo-aware/. The "free" plan in Qodo Aware only lets users work with 100 hand-picked open-source repositories.

It also would be nice if the article clearly mentioned what specific model settings were used for Claude Code and Codex. Both of those allow changing the reasoning level, so if the benchmark was done using the default settings, it seems a little unfair - they have a result of their own agent at high reasoning as a separate entry.

esafak

4 months ago

This is in relation to their newly-announced "context agent": https://www.qodo.ai/blog/introducing-qodo-aware-deep-codebas...

asdev

4 months ago

Agentic search is good enough for code search and code understanding, indexing/fancy techniques will only slight outperform for a lot more effort

View full discussion on Hacker News

ID: 45209532Type: storyLast synced: 11/20/2025, 4:38:28 PM

Want the full context?