Making Llms Cheaper and Better via Performance-Efficiency Optimized Routing
Key topics
A new paper proposes a method to improve the efficiency and cost-effectiveness of Large Language Models (LLMs) by routing queries to the most suitable model, sparking discussion on the potential benefits and limitations of this approach.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
29m
Peak period
16
0-3h
Avg / period
3.5
Based on 28 loaded comments
Key moments
- 01Story posted
Aug 22, 2025 at 10:43 AM EDT
4 months ago
Step 01 - 02First comment
Aug 22, 2025 at 11:12 AM EDT
29m after posting
Step 02 - 03Peak activity
16 comments in 0-3h
Hottest window of the conversation
Step 03 - 04Latest activity
Aug 23, 2025 at 9:49 PM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I've thought for a while that ensembling approaches would become the next stage of LLM development after CoT, since it provides yet another effective, independent axis for scaling laws. Great to see that perspective is taking off. The open weight community has an opportunity to take these ideas and run with them better than OpenAI has.
And then maybe you could just customize and optimize your own mode for local use. Almost like mixing and matching different modules. It would be nice to have a model that only knows and does what you need it to
My understanding is that GPT5 already does this by varying the quantity of CoT done (in addition to the kind of super-model-level routing described in the post), and I strongly suspect it's only going to get more sophisticated
This approach is much more efficient than the paper of this HN submission, because request based routing requires you to recalculate the KV cache from scratch as you switch from model to model.
Yeah, the signals they get will improve things over time. You can do a lot of heavy lifting with embedding models nowadays, get "satisfaction" signals from chats, and adjust your router based on those. It will be weird at first, some people will complain, but at the end of the day, you don't need imo-gold levels of thinking to write a fitness plan that most likely the user won't even follow :)
Signal gathering is likely the driver of most of the subsidised model offerings we see today.
Also the paper has some pie chart crimes on page 6.
I'm sure that has been documented/tried before, and this almost certainly doesn't work in practice. The typical counter-example would be to take a simple-sounding query that actually requires complex reasoning, but because the query is close in the embedding space to other simple-sounding queries, it would be sent to a "dumber model" for efficency.
I guess in their benchmarks that works out, because from what it sounds like, they do per-dataset clustering, so the embedding clusters may actually be able to cluster "complexity levels". However, if you were to mix all datasets into one (similar to how you would encounter it for most real-world use-cases) and cluster against that, this approach would surely break down.
at the end most of those principles are not part of the LLM but part of the API design in front of the LLM. I understand the goal is trying to abstract this fact to sell more magic.