Tau² Benchmark: How a Prompt Rewrite Boosted GPT-5-Mini by 22%
Posted4 months agoActive4 months ago
quesma.comTechstoryHigh profile
calmmixed
Debate
60/100
LLM OptimizationPrompt EngineeringAI Benchmarking
Key topics
LLM Optimization
Prompt Engineering
AI Benchmarking
The article discusses how rewriting prompts using Claude improved GPT-5-mini's performance by 22% on the Tau² benchmark, sparking discussion on the effectiveness and limitations of this approach.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
53m
Peak period
50
0-12h
Avg / period
8.1
Comment distribution65 data points
Loading chart...
Based on 65 loaded comments
Key moments
- 01Story posted
Sep 17, 2025 at 9:03 AM EDT
4 months ago
Step 01 - 02First comment
Sep 17, 2025 at 9:56 AM EDT
53m after posting
Step 02 - 03Peak activity
50 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 24, 2025 at 10:44 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45275354Type: storyLast synced: 11/20/2025, 4:11:17 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
The simplest test would be to make previously “unreachable” tasks succeed through obvious prompt tweaks — like reordering instructions or emphasizing key parts.
That said, my methodology intentionally avoided exposing the model to actual tasks. Instead, I focused on the domain as a whole: refining the instructions so a smaller model could understand and act reliably.
In this (telecom) benchmark you can review agent policies and manuals here: 1) https://github.com/sierra-research/tau2-bench/blob/main/data... 2) https://github.com/sierra-research/tau2-bench/blob/main/data...
Of course these are just parts of the prompt, you can inspect benchamark code to see how these are rendered to actual LLM calls.
In case someone is not familiar with framework methodology I've wrote a separate article covering that (with some of my thoughts) -> https://quesma.com/blog/tau2-from-llm-benchmark-to-blueprint...
1. Structure & Flow
2. AI Agent Optimizations 3. Cognitive Load Reduction 4. Actionable LanguageIt’s quite amazing because it means programming is fully entering the natural language phase of the timeline.
If you aren’t a solid clear writer, you may not make it in the brave new world.
We already have people praying to the machine gods, so I guess your future is next week?
In fact, according to theory, we're writing executable proofs.
"100 baskets of apples" is easier to hold in your head than "23 baskets of red, small-ish apples, 12 of large red, 6 of any size green...", but my no means does it permit a more clear view of the Truth.
Different usage of the word "clear".
Not the way most people do it.
Have you not heard of all the AI startups that can turn a 3-word thought into very clearly written prose to be lovingly poured into the waiting mouth of your AI agent?
Soon enough Im sure we’ll start to see programming languages that are geared towards interacting with llms
https://en.wikipedia.org/wiki/Lojban
If the model creators themselves arent sharing this magic-word bullshitteryy then why is anyone spending time on this? It is just going to change with every model release
Should be available now, although it might take a while for CDN to propagate.
Definitely interesting, thank you!
Is Claude rewriting generic instructions once, or is it rewriting the core task statement each time? If so, I'm not sure how you prevent information leakage: Claude might easily be "solving" some of the tasks and inserting subtle hints on the approach. I think this result is very interesting if it holds after rewriting only the generic instructions, even if the performance boost is lower.
So no leakage — it wasn’t solving or hinting at any of the specific test cases, since none of the tasks were ever exposed to it.
It definitely makes sense that improving formatting and clarity for these smaller models would really help with performance, but I'm wondering if gpt5-mini is already smart enough to handle that reformatting, and can rewrite the prompt itself, before handing it off to another instance of itself.
Overall an awesome article!
Great point. Indeed my methodology was to treat the prompt refactoring as one-off task, therefore I didn't care much about cost/latency.
As for having GPT-5-mini do the rewriting — that’s a really interesting idea. I think the biggest challenge is avoiding cognitive overload. The Tau² agent policies are pretty complex: it’s easy to grasp the overall task, but the detailed rules for each user case aren’t always obvious.
I'm not sure if how easy it is to actually overload GPT-5-mini, so that's definitely worth exploring.
https://github.com/mieciu/tau2-bench/pull/1/files
Would be interesting to compare both the benchmark result as well as the way other models approached the whole refactoring process!
That's like being able to see the test before taking it
If a model only performs well once the rules are clarified, that’s still revealing something important about its agency: it’s brittle when policies are ambiguous, but much stronger when they’re structured.
I agree with you that there’s a fine line between genuinely helping the model 'understand' the task and just 'teaching to the test'.
That said, Tau² is framed as a very specific use case — and we showed it can be solved more reliably. At the end of the day, that means we now have an agent built on a cheaper, faster model that still performs its job with higher reliability.
I work at OpenAI and you can partly blame me for our emphasis on Telecom. While we no doubt highlight the evals that make us look good, let me defend why the emphasis on Telecom isn't unprincipled cherry picking.
Telecom was made after Retail and Airline, and fixes some of their problems. In Retail and Airline, the model is graded against a ground truth reference solution. Grading against a reference solution makes grading easier, but has the downside that valid alternative solutions can receive scores of 0 by the automatic grading. This, along with some user model issues, is partly why Airline and Retail scores stopped climbing with the latest generations of models and are stuck around 60% / 80%. I'd bet you $100 that a superintelligence would probably plateau around here too, as getting 100% requires perfect guessing of which valid solution is written as the reference solution.
In Telecom, the authors (Barres et al.) made the grading less brittle by grading against outcome states, which may be achieved via multiple solutions, rather than by matching against a single specific solution. They also improved the user modeling and some other things too. So Telecom is the much better eval, with a much cleaner signal, which is partly why models can score as high as 97% instead of getting mired at 60%/80% due to brittle grading and other issues.
Even if I had never seen GPT-5's numbers, I like to think I would have said ahead of time that Telecom is much better than Airline/Retail for measuring tool use.
Incidentally, another thing to keep in mind when critically looking at OpenAI and others reporting their scores on these evals is that the evals give no partial credit - so sometimes you can have very good models that do all but one thing perfectly, which results in very poor scores. If you tried generalizing to tasks that don't trigger that quirk, you might get much better performance than the eval scores suggest (or vice versa, if your tasks trigger a quirk not present in the eval).
Here's the tau2-bench paper if anyone wants to read more: https://arxiv.org/abs/2506.07982
Can we get some (hypothetical) examples of ground truths?
For example for the Airline domain, what kind of facts are these ground truth facts? All the airports, the passenger lines between them, etc? Or does it mean detailed knowledge of the airplane manuals for pilots, maintenance, ...?
Prompt changes affect output substantially (just look up arxiv), the difficult part is find an optimal structure to yield the best results. It is a bit expensive to do a lot of testing on your own, so it all boils down to feels and experience at the moment. Then you mix up tool calls, other agent calls, client functions and this gets terribly hard to evaluate.
I am still puzzled how distance between policies can have an effect on the output. And how a simple retry fixes everything.
I think this rewriting of prompts technique is the reason "reasoning" models perform well - they know exactly how to rewrite the prompts for a context.
FWIW I don't trust these benchmarks fully because a huge bump like this is not expected - I would expect OpenAI to optimise enough to let such gaps open.
Into the trash it goes.