Ask HN: Imagine coding LLM's 1M times faster; what uses might there be?
No synthesized answer yet. Check the discussion below.
More perf means more attempts in parallel with some sort of arbiter model deciding what to pick. This can happen at the token, prompt, or agent level or all of them.
I ask it another question later, or maybe troubleshoot. It tells me B is causing the problem, and I should do A instead.
However for search query dynamic on the fly and other stuff it would be a game changer.
I’d love to have millions of one off ai gen code that auto updates stuff quickly for personalization at scale.
You could instead use Cerebras to get 3,000 tps; but people really dont do that according to stats. 100x the speed but its not like everyone is rushing to their service.
The speed of LLMs after the advent of MOE has hit a good spot. What we now need is smarter.
For the moment, though, I'd take a "smarter" but slower model over x times faster. The current models are plenty fast already. They pump out 300-500 LoC files in seconds/minutes. That's plenty speed.