Deep Dive Into G-Eval: How Llms Evaluate Themselves
Posted2 months agoActive2 months ago
medium.comTechstory
calmmixed
Debate
40/100
Large Language ModelsG-EvalAI Evaluation
Key topics
Large Language Models
G-Eval
AI Evaluation
The article explores G-Eval, a method for LLMs to evaluate themselves, sparking discussion on its stability and practical usefulness across different models and runs.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
17m
Peak period
5
0-1h
Avg / period
3
Key moments
- 01Story posted
Nov 5, 2025 at 10:28 AM EST
2 months ago
Step 01 - 02First comment
Nov 5, 2025 at 10:46 AM EST
17m after posting
Step 02 - 03Peak activity
5 comments in 0-1h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 5, 2025 at 6:59 PM EST
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Discussion (6 comments)
Showing 6 comments
eeasss
2 months ago
2 repliesAre there any llms in particular that work best with g-evals?
zlatkovAuthor
2 months ago
I haven’t come across any research showing that a specific LLM consistently outperforms others for this. It generally works best with strong reasoning models that produce consistent outputs.
lyuata
2 months ago
LLM Benchmark leaderboard for common evals sounds like a fun idea to me.
kirchoni
2 months ago
1 replyInteresting overview, though I still wonder how stable G-Eval really is across different model families. Auto-CoT helps with consistency, but I’ve seen drift even between API versions of the same model.
zlatkovAuthor
2 months ago
That's true. Even small API or model version updates can shift evaluation behavior. G-Eval helps reduce that variance, but it doesn’t eliminate it completely. I think long-term stability will probably require some combination of fixed reference models and calibration datasets.
sirlapogkahn
2 months ago
We’ve tried geval but it hasn’t been super useful in practice. If we run the same input on the same model and same geval 10 times we get significantly different results, so you can’t really arrive at any conclusions based on the results.
View full discussion on Hacker News
ID: 45823885Type: storyLast synced: 11/20/2025, 2:33:22 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.