Terminal-Bench 2.0 and Harbor

Postedabout 2 months agoActiveabout 2 months ago

falcor84

1 points

1 comments

tbench.aiTechstory

calmmixed

Debate

20/100

AI BenchmarkingTerminal-BenchHarbor Tool

Key topics

AI Benchmarking

Terminal-Bench

Harbor Tool

Terminal-Bench 2.0 was announced with a new evaluation approach using Harbor, significantly reshuffling the leaderboard and sparking discussion about the implications for assessing AI capabilities.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

Peak period

0-1h

Avg / period

Key moments

01Story posted
Nov 11, 2025 at 4:59 AM EST
about 2 months ago
Step 01
02First comment
Nov 11, 2025 at 4:59 AM EST
1s after posting
Step 02
03Peak activity
1 comments in 0-1h
Hottest window of the conversation
Step 03
04Latest activity
Nov 11, 2025 at 4:59 AM EST
about 2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (1 comments)

Showing 1 comments

falcor84Author

about 2 months ago

I just saw that Terminal Bench introduced a new evaluation approach based on the new Harbor tool [0] and was surprised to see that it completely reshuffled the leaderboard, with the top 4 places now held by variants of gpt-5, whereas in terminal-bench@1.0 you had to scroll down to the 7th place to see gpt-5.

Does anyone here have any insight on whether this genuinely reflects capabilities better? I'm asking because last I checked, Codex+gpt-5 significantly underperformed Claude Code for my use case.

[0] https://github.com/laude-institute/harbor

View full discussion on Hacker News

ID: 45885723Type: storyLast synced: 11/17/2025, 6:00:29 AM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN