A Year of Fast Apply – Our Path to 10k Tokens Per Second

Posted2 months agoActive2 months ago

eborgnia

48 points

6 comments

relace.aiTechstory

calmpositive

Debate

20/100

Performance OptimizationMachine LearningAI Infrastructure

Key topics

Performance Optimization

Machine Learning

AI Infrastructure

The post discusses Relace's achievement of reaching 10k tokens per second with their 'Fast Apply' technology, with commenters exploring the technical details and potential applications.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

Peak period

6-7h

Avg / period

1.2

Key moments

01Story posted
Oct 29, 2025 at 1:04 PM EDT
2 months ago
Step 01
02First comment
Oct 29, 2025 at 3:44 PM EDT
3h after posting
Step 02
03Peak activity
2 comments in 6-7h
Hottest window of the conversation
Step 03
04Latest activity
Oct 30, 2025 at 12:33 AM EDT
2 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (6 comments)

Showing 6 comments

swyx

2 months ago

1 reply

> To streamline the process while maintaining quality, we built our own internal evaluation tool: a Git-style diff viewer with annotation tools for categorizing merge outcomes.

vibecoding internal eval tools is the single best use case of ai accelerating ai i know of! nice to see

(sorry if this gets asked a lot) - any philsophical/methodology differences to MorphLLM that you'd call out since you seem to be a direct alternative?

eborgniaAuthor

2 months ago

1 reply

Hey, happy to answer! The manual evals we did showed that both morph-v3-fast and morph-v3-large had significantly more smoothing and hallucination behaviors.

It's hard to know for sure because their methods aren't public, but my guess is the dataset they constructed pushes the Fast Apply model to more aggressively fix mistakes introduced by the frontier model in the edit snippet.

This aligns with the fact that their flagship model (morph-v3-large) is 4x slower than ours -- the smoothings/hallucinations are not in the initial code or the edit snippet so they break speculative continuations more frequently. Their 2x faster model (morph-v3-fast) is likely quantized more aggressively (maybe fp4? and run on B200s?) because it exhibits very strange behaviors like hallucinating invalid characters at random points that make the code non-compilable.

From an accuracy POV, auto-smoothing is helpful for fixing obvious mistakes in the edit snippet like missed imports from well known packages. However, it does increase the frequency of code breaking hallucinations like invalid local imports among other functional changes that you might not want a small apply model to perform.

swyx

2 months ago

thank you! referring to it as smoothing is interesting, makes me think of code as a series of bumps in multiple dimensions than discrete tokens.

bn-l

2 months ago

1 reply

You guys have to bring the cost down

eborgniaAuthor

2 months ago

1 reply

How do you expect it to be priced? We do give discounts for high volume users.

bn-l