Why Most AI Coding Benchmarks Are Misleading (compass Paper)

Posted4 months agoActive4 months ago

jmeaden

13 points

2 comments

arxiv.orgTechstory

calmpositive

Debate

20/100

AIBenchmarkingCoding Performance

Key topics

Benchmarking

Coding Performance

The COMPASS paper challenges the validity of current AI coding benchmarks by comparing LLM coding performance to a large dataset of human submissions, sparking discussion and inviting feedback from the community.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

Peak period

0-1h

Avg / period

Key moments

01Story posted
Sep 19, 2025 at 8:08 AM EDT
4 months ago
Step 01
02First comment
Sep 19, 2025 at 8:11 AM EDT
2m after posting
Step 02
03Peak activity
1 comments in 0-1h
Hottest window of the conversation
Step 03
04Latest activity
Sep 19, 2025 at 3:34 PM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (2 comments)

Showing 2 comments

jmeadenAuthor

4 months ago

1 reply

Hi, I’m one of the authors. Happy to answer questions about the dataset (LLM coding performance compared to 390k+ human submissions), the scoring approach, or the methodology behind COMPASS. Feedback and critique are welcome.

sieep

4 months ago

Hello, I'm someone who does not have a background in CS, so my apologies for not being able to read the paper in-full. Is there any clear-cut strategy you would recommend to model developers so they can improve in not just correctness, but in quality & efficiency? I'm sure it's in the paper & I wish I could understand it in-depth.

If you don't mind me asking a more personal question, I would love to go back to uni for a master's in computer science & hopefully assist with papers like this one day. Do you have any advice for someone with industry CS experience (SWE) vs. academic to make the leap to the academic side? I genuinely love this kind of stuff and already make a decent living so it's not for money.

View full discussion on Hacker News

ID: 45300695Type: storyLast synced: 11/20/2025, 8:37:21 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN