Adaptive-Learning Speculator System (atlas): Faster LLM Inference
Posted3 months agoActive3 months ago
together.aiTechstory
excitedpositive
Debate
60/100
LLM InferenceSpeculative DecodingAI Optimization
Key topics
LLM Inference
Speculative Decoding
AI Optimization
Together.ai introduces ATLAS, a system that accelerates LLM inference by up to 2.65x using adaptive speculative decoding, sparking discussion on its implications for performance, quality, and pricing.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
3h
Peak period
19
3-6h
Avg / period
6.7
Comment distribution47 data points
Loading chart...
Based on 47 loaded comments
Key moments
- 01Story posted
Oct 12, 2025 at 4:37 AM EDT
3 months ago
Step 01 - 02First comment
Oct 12, 2025 at 7:13 AM EDT
3h after posting
Step 02 - 03Peak activity
19 comments in 3-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 13, 2025 at 6:14 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45556474Type: storyLast synced: 11/20/2025, 6:24:41 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
and yet, if you click on: https://openrouter.ai/moonshotai/kimi-k2-0905
You'll see Groq averaging 1,086tps vs Together doing 59tps. Groq and Cerebras often feel like the only games in town. I'd love that to be different (because I'd like more models!), but nobody else is coming close right now.
Comparing how quickly gpt-oss-120b runs gives a broader picture: https://openrouter.ai/openai/gpt-oss-120b -- Vertex (Google) and SambaNova do pretty good on it too, but still, the difference between a top provider and an also-ran is giant.
God I love OpenRouter.
What I don't understand is, Groq reporting 200tps for the same model: https://console.groq.com/docs/model/moonshotai/kimi-k2-instr...
OpenRouter numbers look fishy.
Not just custom chips, but custom chips which derive much of their performance from enormous amounts of SRAM. There's no denying that approach is fast, but it's also incredibly expensive, and SRAM scaling has slowed to a crawl so it won't get much cheaper any time soon.
AFAIU, speculative decoding (and this fancier version of spec. decoding) does not reduce accuracy.
[0] https://openrouter.ai/moonshotai/kimi-k2-0905/performance
SambaNova should be similar...they've got a similar specialized hardware approach
I'm currently on the Cerebras Code subscription for like 50 USD a month because it more or less makes the rate limits I used to deal with other platforms disappear (without making me spend upwards of 100 USD paying per token): https://www.cerebras.ai/blog/introducing-cerebras-code
At the same time, their Qwen Coder 480B model is fine but I still find myself going for Claude or GPT-5 or Gemini 2.5 Pro for more complex issues (or ones where I need good usage of Latvian language), at least for programming tasks it'd eventually be super cool if they could offer more models.
Or have some sort of a partnership with Anthropic or whoever, because getting my questions answered at around 500-1500 TPS is really, really pleasant, especially for agentic use cases with code modifications, even if I still bump into the 128k context limits occasionally.
IMO this likely is what you get from running the model correctly as-is (i.e. using the same weight and activation dtype), so Together is not bad.
Moonshot AI themselves and Groq likely uses some sampler tricks to eliminate schema validation errors.
So really the only thing this shows is: Nebius, Chutes, AtlasCloud could be running something else (for example further quantized model). Or bugs.
Anyway, Novita is doing significantly better on the vendor verifier chart than Together, so the low quality must be partially Together's fault at least.
I don't see why that would be true. As I understand, the verifier is checking if the tokens are good-enough, not if they're the exact same tokens it would have selected. The predicted tokens could be consistently slightly worse, which could have a cascading effect to make the overall output a lot worse.
The TLDR/key (from my understanding) is that verifying N tokens can be faster than generating N tokens.
Yes. This is because to generate token n+1 you need token n etc. So generating from scratch is a sequential (thus slow) process. When we verify tokens, we can, for each token, use all preceding tokens as input and check that the output token matches the expectation. But since the full sequence we want to verify already exist, we can do it in parallel for each token we want to verify and not sequentially.
This is why training transformer models is much faster than RNN, we do the same thing during training, it's just that the sequence we compare to is the ground truth and not coming from another model.
That's up to you, depends on how you implement it and how much you want to prioritize speed at the expense of quality, this is not an intrinsic attribute of speculative decoding. The verifier checks if the tokens predicted by the draft model are part of the top-k tokens predicted by the full size model at each steps. Set k to 1 and you will only accept perfect matches. Set k to > 1 and you will indeed start selecting "good enough" tokens, but will get faster inference.
But no matter what value you choose for k, the technique described in the article can apply and will result in faster inference at no loss when compared to a setup without this technique, with the same value of k.
You can do exact verification, and as soon as a token mismatches you reject everything after that token from your draft. Relaxed acceptance techniques measure how wrong that mispredicted token is via some metric, and accept it if it’s close enough. So you get longer draft lengths with higher acceptance rates.
That said, I still think some providers are cheating. Please correct me if the test below is flawed.
I generated texts at temperature = 0 vs temperature = 2. At high temperature, the distributions effectively become flatter, meaning the difference between real and draft effective distributions (the D_LK used in theorem 3.5 of 2211.17192) becomes smaller. When T=2, the model speaks complete gibberish, so the effective distribution must be pretty flat. This should mean fewer rejections --> a lot faster speculative decoding. Yet, I see no increase in throughput at all...
However, if you have higher temperature but still are operating under a top-k sampling where k is small, not sure it's going to translate to any noticeable difference, since this will make your actual distributions very much non-uniform.
I didn't set a top-k. So it seems like Together must be doing something weird in their speculative decoding implementation.
TIL. Bit of an aha moment - never understood till now how a big model can verify faster than it can generate
Cool hack though, kudos. Wonder if they can make Groq or Cerebras do the same thing?
A light-weight speculative model adapts to usage, keeping the acceptance rate for the static heavy-weight model within acceptable bounds.
Do they adapt with LoRAs?