New Paper: a Single Character Can Make or Break Your LLM Evals
Posted3 months ago
arxiv.orgResearchstory
calmneutral
Debate
0/100
LLM EvaluationAI ResearchNatural Language Processing
Key topics
LLM Evaluation
AI Research
Natural Language Processing
A new research paper examines how a single character can significantly impact LLM evaluations, sparking discussion on the nuances of AI assessment.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
N/A
Peak period
1
Start
Avg / period
1
Key moments
- 01Story posted
Oct 10, 2025 at 9:09 AM EDT
3 months ago
Step 01 - 02First comment
Oct 10, 2025 at 9:09 AM EDT
0s after posting
Step 02 - 03Peak activity
1 comments in Start
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 10, 2025 at 9:09 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45538593Type: storyLast synced: 11/17/2025, 11:13:31 AM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
- MMLU performance varies by +/- 23% depending on the choice of delimiter across leading open model families (Llama, Qwen, and Gemma). - Closed models, GPT-4o, are also brittle to the choice of delimiter.