Cerebras Systems Raises $1.1b Series G
Posted3 months agoActive3 months ago
cerebras.aiTechstoryHigh profile
excitedmixed
Debate
60/100
AI HardwareCerebrasLLM Performance
Key topics
AI Hardware
Cerebras
LLM Performance
Cerebras Systems raises $1.1B Series G funding, with users praising its fast LLM performance but also raising concerns about pricing, reliability, and model availability.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
1m
Peak period
59
0-12h
Avg / period
15
Comment distribution75 data points
Loading chart...
Based on 75 loaded comments
Key moments
- 01Story posted
Sep 30, 2025 at 11:54 AM EDT
3 months ago
Step 01 - 02First comment
Sep 30, 2025 at 11:55 AM EDT
1m after posting
Step 02 - 03Peak activity
59 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 7, 2025 at 4:34 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45427111Type: storyLast synced: 11/20/2025, 4:50:34 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I also wonder why they have not been acquired yet. Or is it intentional?
I will say, their pricing and deployment strategy is a bit murky and unclear. Paying $1500-$10,000 per month plus usage costs? I'm assuming that it has to do with chasing and optimizing for higher value contracts and deeper-pocketed customers, hence the minimum monthly spend that they require.
I'm not claiming to be an expert, but as a CEO/CTO, there were other providers in the market that had relatively comparable inference speed (obviously Cerebras is #1), easier onboarding, better response from people that worked there (all of my experience with Cerebras have been days/weeks late or simply ignored). IMHO, if Cerebras wants to gain more mindshare, they'll have to look into this aspect.
Recently there was a fiasco I saw posted on r/localllama where many of the OpenRouter providers were degraded on benchmarks compared to base models, implying they are serving up quantized models to save costs, but lying to customers about it. Unless you’re actually auditing the tokens you’re purchasing you may not be getting what you’re paying for even if the T/s and $/token seems better.
Do you have information on this? This seems like brand destroying for both OpenRouter and the model providers.
Yeah wait, why rent chips instead of sell them? Why wouldn't customers want to invest money in competition for cheaper inference hardware? It's not like Nvidia has a blacklist of companies that have bought chips from competitors, or anything. Now that would be crazy! That sure would make this market tough to compete in, wouldn't it. I'm so glad Nvidia is definitely not pressuring companies to not buy from competitors or anything.
1. They’re useless for training in 2025. They were designed for training prior to LLM explosion. They’re not practical for training anymore because they rely on SRAM which is not scalable.
2. No one is going to spend the resources to optimize models to run on their SDK and hardware. Open source inference engines don’t optimize for Cerebras hardware.
Given the above two reasons, it makes a lot of sense that no one is investing in their hardware and they have switched to a cloud model selling speed as the differentiator.
It’s not always “Nvidia bad”.
Not 10x. But at greater cost and lower scale.
I thought it was the SRAM scaling that was impressive, no?
1. To achieve high speeds, they put everything on SRAM. I estimated that they needed over $100m of chips just to do Qwen 3 at max context size. You can run the same model with max context size on $1m of Blackwell chips but at a slower speed. Anandtech had an article saying that Cerebras was selling a single chip for around $2-3m. https://news.ycombinator.com/item?id=44658198
2. SRAM has virtually stopped scaling in new nodes. Therefore, new generations of wafer scale chips won’t gain as much as traditional GPUs.
3. Cerebras was designed in the pre-ChatGPT era where much smaller models were being trained. It is practically useless for training in 2025 because of how big LLMs have gotten. It can only do inference but see above 2 problems.
4. To inference very large LLMs economically, Cerebras would need to use external HBM. If it has to reach outside for memory, the benefits of a wafer scale chip greatly diminishes. Remember that the whole idea was to put the entire AI model inside the wafer so memory bandwidth is ultra fast.
5. Chip interconnect technology might make wafer scale chips more redundant. TSMC has a roadmap for glueing more than 2 GPU dies together. Nvidia’s Feynman GPUs might have 4 dies glued together. IE, the sweet spot for large chips might not be wafer scale but perhaps 2, 4, 8 GPUs together.
6. Nvidia seems to be moving much faster in terms of development and responding to market needs. For example, Blackwell is focused on FP4 inferencing now. I suppose the nature of designing and building a wafer scale chip is more complex than a GPU. Cerebras also needs to wait for new nodes to fully mature so that yields can be higher.
There exists a niche where some applications might need super fast token generation regardless of price. Hedge funds and Wallstreet might be good use cases. But it won’t challenge Nvidia in training or large scale inference.
But there are several 1T memories that are still scaling, more or less — eDRAM, MRAM, etc. Is there anything preventing their general architecture from moving to a 1T technology once the density advantages outweigh the need for pipelining to hide access time?
I will point out (again :)), that this math is completely wrong. There is no need (nor performance gains) to store the entire weights of the model in SRAM. You simply store n transformer blocks on-chip and then stream block l+n from external memory to on-chip when you start computing block l, this completely masks the communication time behind the compute time, and specifically does not require you to buy 100M$ worth of SRAM. This is standard stuff that is done routinely in many scenarios, e.g. FSDP.
https://www.cerebras.ai/blog/cerebras-software-release-2.0-5...
There’s no way that could be the case if the technology was competitive.
And beyond pure performance competitiveness there are many things that make it hard for Cerebras and to be actually competitive: can they ship enough chips to meet the need of large clusters ? What about the software stack and lack of great support compared to nvidia? Lack of ml engineers that know how to use them, when everyone knows how to use CUDA and there are many things developed on top of it by the community (e.g triton).
Just look at the valuation difference between AMD and Nvidia, when AMD is already very competitive. But being 99% of the way there is still not enough for customers that are going to pay 5B$ for their clusters.
However, the whole point that even HBM is a problem is the available bandwidth is insufficient, so if you’re marrying SRAM and HBM I would expect the performance gains to be overall modest for models that exceed available SRAM in a meaningful way.
As a similar example I have trained video models on ~1000 H100 where the vast majority of parameters are sharded and so need to be first fetched on the network before being available on HBM, which is similar imbalance to the HBM vs SRAM story. We were able to fully mask comms time such that not sharding (if it was even possible) would offer no performance advantage.
There is just no need to have parameters or kv cache for layer 48 in SRAM when you are currently computing layer 3, you have all the time in the world to move that to SRAM when you get to layer 45 or whatever the maths work out to be for your specific model.
In that same thread, Cerebras executive disputed my $100m number but did not dispute that they store the entire model on SRAM.
They can make chips at cost and claim it isn’t $100m. But Anandtech did estimate/report $2-3m per chip.
Same techniques apply.
> but did not dispute that they store the entire model on SRAM.
No idea what they did or did not do for that specific test (which was about delivering 1800 tokens/sec though, not simply running qwen-3) since they didn't provide any detail. I don't think there is any point storing everything in SRAM, even if you do happen to have 100M$ worth of chips lying around in a test cluster at the office, since WSE-3 is designed from the ground up for data parallelism (see [1] section 3.2) and inference is sequential both within a single token generation (you need to go through layer 1 before you can go through layer 2 etc.) and between tokens (autoregressive, so token 1 before token 2). This means most of your weights loaded in SRAM would be just sitting unused most of the time, and when they need to be used they need to be broadcasted to all chips from the SRAM of the chip that has the particular layer you care about, this is extremely fast, but external memory is certainly fast enough to do this if you fetch the layer in advance. So the way to get the best ROI on such a system would be to pack the biggest batch size you can (so many users' queries) and process them all in parallel, streaming the weights as needed. The more your SRAM is occupied by batch activations and not parameters, the better the compute density and thus $/flops.
You can check the Cerebras doc to see how weight streaming works [2]. From the start, one of the selling point of Cerebras is the possibility to scale memory independently of compute, and they have developped an entire system specifically for weight streaming from that decoupled memory. Their docs seems to keep things fairly simple assuming you can only fit one layer in SRAM and thus they fetch things sequentially, but if you can store at least 2 layers in those 44GB of SRAM then you can simply fetch l+1 when l is starting to compute, completely masking latency cost. Its possible they already mask the latency even within a single layer by streaming by tiles for matmul though, unclear from their docs. They mention that in passing in [3] section 6.3.
All of their doc is for training since it seems for inference play they have pivoted to selling API access rather than chips, but inference is really the same thing, just without the backprop (especially in their case were they aren't doing pipeline parallelism where you could claim doing fwd+back prop gives you better compute density). At the end of the day whether you are doing training or inference, all you care about is that your cores have the data they need in their registers at the moment they are free to compute, so streaming to SRAM works the same way in both cases.
Ultimately I can't tell you how much it cost to run Qwen-3, you can certainly do it on a single chip + weight streaming, but their specs are just too light on the exact FLOPs and bandwidth to know what the memory movement cost would be in this case (if any), and we don't even know the price of single chip (everyone is saying 3M$ though, regardless of that comment on the other thread). But I can tell you that your math of doing `model_size/sram_per_chip * chip_cost` just isn't the right way to think about this, and so the 100M$ figure doesn't make sense.
[1]: https://arxiv.org/html/2503.11698v1#S3.
[2]: https://training-api.cerebras.ai/en/2.1.0/wsc/cerebras-basic....
[3]: https://8968533.fs1.hubspotusercontent-na2.net/hubfs/8968533...
Here is an oversimplified explanation that gets the gist accross:
The standard architecture for transformer based LLMs is as follows: Token Embedding -> N Layers each consisting of an attention sublayer and an MLP sublayer -> Output Embedding.
Most attention implementations use a simple KV caching strategy. In prefill you first calculate the KV cache entries by performing GEMM against the W_K, W_V, W_Q tensors. In the case of token generation, you only need to calculate against the current token. Next comes the quadratic part of attention. You need to calculate softmax(Q K^T)V. This is two matrix multiplications and has a linear cost with respect to the number of entries in the KV cache for generating the next token, as you need to re-read the entire KV cache plus the new entry. For prefill you are processing n tokens, so the cost is quadratic. The KV cache is unique for every user session. It also grows with the size of the context. This means the KV cache is really expensive memory wise. It consumes both memory capacity and bandwidth and it also doesn't permit batching.
Meanwhile the MLP sublayer is so boring I won't bother going into the details, but the gist is that you have a simple gating network with two feed forward layers that project the token vector into a higher dimension (e.g. more outputs than inputs) known as up gate and then you element-wise multiply these vectors and then feed them into a down gate which reduces it back to the original dimension of the token vector. Since the matrices are always the same, you can process the tokens of multiple users at once.
Now here are the implications of what I wrote above: Prefill is generally compute bound is therefore mostly uninteresting,or rather, interesting for ASIC designers because FLOPS are cheap and SRAM is expensive. Token generation meanwhile is a mix of being memory bandwidth bound and compute bound in the batched case. The MLP layer is trivially parallelized through GEMM based batching. Having lots of SRAM is beneficial for GEMM, but it is not super critical in a double buffered implementation that performs loading and computation simultaneously with the memory bandwidth being chosen so that both finish roughly at the same time.
What SRAM buys you for GEMM is the following: Given two square matrices A, B and their output A*B = C of the same dimension, where A and B are both 1 GiB in size and x MiB of SRAM, you tile the GEMM operation so that each sub-matrix is x/3 MiB in size. Let's say x=120MiB which means 40 MiB per matrix. You will split the matrices A and B into approximately 25 tiles. For every tile in A, you have to load all tiles in B. Meaning (A) 25 + 25*25 (A*B) = 650 load operations of 40 MiB matrices for a total amount of reads of 26000 MiB. If you double the SRAM you now have 13 tiles of size 80 MiB. 13 + 13*13 = 182. 182 * 80 MiB = 14560 MiB. Loosely speaking, doubling SRAM reduces the needed memory bandwidth by half. This is boring old linear scaling, because fewer tiles also means bigger tiles, so the quadratic gain of 4x reduction in loads is outweighed by 2x bigger load operations. Having more SRAM is good though.
Now onto Flash Attention. If I had to dumb down flash attention, it's a very quirky way of arranging two GEMM operations to reduce the amount of memory allocated to the intermediate C matrix of the first Q*K^T multiplication. Otherwise it is the same as two GEMM with smaller tiles. Doubling SRAM halves the necessary memory bandwidth.
Final conclusion: In the batched multi user inference case your goal is to allocate the KV cache to SRAM for attention nodes and achieve as large of a batch size as possible for MLP nodes and use the SRAM to operate on as large tiles as possible. If you achieve both, then the required memory bandwidth scales reciprocal to the amount of SRAM. Storing full tensors in SRAM is not necessary at large batch sizes.
Of course since I only looked at the memory aspects, it shouldn't be left out that you need to evenly match compute and memory resources. Having SRAM on its own doesn't buy you anything really.
Indirectly, it sounds like you're aware about the inference speed? Imagine if it took 2 minutes instead of 10 minutes, that's what the parent means.
I‘ve found that the best way for myself to do LLM assisted coding at this point in time is in a somewhat tight feedback loop. I find myself wanting to refine the code and architectural approaches a fair amount as I see them coming in and latency matters a lot to me here.
https://github.com/Kilo-Org/kilocode https://www.cerebras.ai/blog/introducing-cerebras-code
Saying "technically" is really underselling the difference in intelligence in my opinion. Claude and Gemini are much, much smarter and I trust them to produce better code, but you honestly can't deny the excellent value that Qwen-3, the inference speed and $50/month for 25M tokens/per day brings to the table.
Since I paid for the Cerebras pro plan, I've decided to force myself to use it as much as possible for the duration of the month for developing my chat app (https://github.com/gitsense/chat) and here so some of my thoughts so far:
- Qwen3 Coder is a lot dumber when it comes to prompting as Gemini and Claude are much better at reading between the lines. However since the speed is so good, I often don't care as I can go back to the message and make some simple clarifications and try again.
- The max context window size of 128k for Qwen 3 Coder 480B on their platform can be a serious issue if you need a lot of documentation or code in context.
- I've never come close to the 25M tokens per day limit for their Pro Plan. The max I am using is 5M/day.
- The inference speed + a capable model like Qwen 3 will open up use cases most people might not have thought of before.
I will probably continue to pay for the $50 dollar plan for these use cases.
1. Applying LLM generated patches
Qwen 3 coder is very much capable of applying patches generated by Sonnet and Gemini. It is slower than what https://www.morphllm.com/ provides but it is definitely fast enough for most people to not care. The cost savings can be quite significant depending on the work.
2. Building context
Since it is so fast and because the 25M token limit per day is such a high limit for me, I am finding myself loading more files into context and just asking Qwen to identify files that I will need and/or summarize things so I can feed it into Sonnet or Gemini to save me significant money.
3. AI Assistant
Due to it's blazing speed, you can analyze a lot data fast for deterministic searches and because it can review results at such a great speed, you can do multiple search and review loops without feeling like you are waiting forever.
Given what I've experienced so far, I don't think Cerebras can be a serious platform for coding if Qwen 3 Coder is the only available model. Having said that, given the inference speed and Qwen being more than capable, I can see Cerebras becoming a massive cost savings option for many companies and developers, which is where I think they might win a lot of enterprise contracts.
Disappointed quite a bit with this fund raise. They were expected to IPO this year and give us poor retail investors a chance at investing in them.
Memory hierarchy management across HBM/DDR/Flash is much more difficult but necessary to achieve practical inference economics.
[1] https://ieeexplore.ieee.org/abstract/document/9623424
Everyone else went the CoWoS direction, which enables heterogeneous integration and much more cost effective inference.
I think while being fast, cerebra’s probably not very economical in fleets at scale.
That's the problem.
Unless the majority of the value is on the other end of the curve, it's a tough sell.
OpenAI is still developing their own chips with Broadcom, but they are not operational yet. So for now, they're buying GPUs from Nvidia to build up their own revenue income (to later spend it on their own chips)
By 2030, eventually many companies will be looking for alternatives to Nvidia like Cerebras or Lightmatter for both training and inference use-cases.
For example [0] Meta just acquired a chip startup for this exact reason - "An alternative to training AI systems" and "to cut infrastructure costs linked to its spending on advanced AI tools.".
[0] https://www.reuters.com/business/meta-buy-chip-startup-rivos...
If that's 5 years into the future, that looks bad for Nvidia, if it's >10 years in the future, that doesn't affect Nvidia's current stock price very much.
https://www.cerebras.ai/pricing
$50/month for one person for code (daily token limit), or pay per token, or $1500/month for small teams, or an enterprise agreement (contact for pricing).
Seems high.
(or other people that read the litepaper https://www.extropic.ai/future)
He reads to me like someone who markets better than he does things. I am disinclined to take him as an authority in this space.
How do you believe this is related to Cerebras?
Just tried https://cloud.cerebras.ai wow is it fast!
Groq also runs OSS models at speed which is my preferred way to access Kimi K2 on their free quotas.
Just plug it in with normal chat interface like Jan or Cherry studio and its incredibly fast.