Cerebras Systems Raises $1.1b Series G

Posted3 months agoActive3 months ago

fcpguru

126 points

75 comments

cerebras.aiTechstoryHigh profile

excitedmixed

Debate

60/100

AI HardwareCerebrasLLM Performance

Key topics

AI Hardware

Cerebras

LLM Performance

Cerebras Systems raises $1.1B Series G funding, with users praising its fast LLM performance but also raising concerns about pricing, reliability, and model availability.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

0-12h

Avg / period

Comment distribution75 data points

Loading chart...

Based on 75 loaded comments

Key moments

01Story posted
Sep 30, 2025 at 11:54 AM EDT
3 months ago
Step 01
02First comment
Sep 30, 2025 at 11:55 AM EDT
1m after posting
Step 02
03Peak activity
59 comments in 0-12h
Hottest window of the conversation
Step 03
04Latest activity
Oct 7, 2025 at 4:34 PM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (75 comments)

Showing 75 comments

OGEnthusiast

3 months ago

1 reply

I'm surprised how under-the-radar Cerebras is. Being able to get near-instantaneous responses from Qwen3 and gpt-oss is pretty incredible.

data-ottawa

3 months ago

I wish I could invest in them. Agree they're under the radar.

maz1b

3 months ago

6 replies

Cerebras has been a true revelation when it comes to inference. I have a lot of respect for their founder, team, innovation, and technology. The colossal size of the WS3 chip, utilizing DRAM to mind-boggling scale, it's definitely ultra cool stuff.

I also wonder why they have not been acquired yet. Or is it intentional?

I will say, their pricing and deployment strategy is a bit murky and unclear. Paying $1500-$10,000 per month plus usage costs? I'm assuming that it has to do with chasing and optimizing for higher value contracts and deeper-pocketed customers, hence the minimum monthly spend that they require.

I'm not claiming to be an expert, but as a CEO/CTO, there were other providers in the market that had relatively comparable inference speed (obviously Cerebras is #1), easier onboarding, better response from people that worked there (all of my experience with Cerebras have been days/weeks late or simply ignored). IMHO, if Cerebras wants to gain more mindshare, they'll have to look into this aspect.

oceanplexian

3 months ago

2 replies

I’ve been using them as a customer and have been fairly impressed. The thing is, a lot of inference providers might seem better on paper but it turns out they’re not.

Recently there was a fiasco I saw posted on r/localllama where many of the OpenRouter providers were degraded on benchmarks compared to base models, implying they are serving up quantized models to save costs, but lying to customers about it. Unless you’re actually auditing the tokens you’re purchasing you may not be getting what you’re paying for even if the T/s and $/token seems better.

teruakohatu

3 months ago

> many of the OpenRouter providers were degraded on benchmarks compared to base models, implying they are serving up quantized models to save costs,

Do you have information on this? This seems like brand destroying for both OpenRouter and the model providers.

dlojudice

3 months ago

OpenRouter should be responsible for this quality control, right? It seems to me to be the right player in the chain with the duties and scale to do so.

throw123890423

3 months ago

1 reply

> I will say, their pricing and deployment strategy is a bit murky and unclear. Paying $1500-$10,000 per month plus usage costs? I'm assuming that it has to do with chasing and optimizing for higher value contracts and deeper-pocketed customers, hence the minimum monthly spend that they require.

Yeah wait, why rent chips instead of sell them? Why wouldn't customers want to invest money in competition for cheaper inference hardware? It's not like Nvidia has a blacklist of companies that have bought chips from competitors, or anything. Now that would be crazy! That sure would make this market tough to compete in, wouldn't it. I'm so glad Nvidia is definitely not pressuring companies to not buy from competitors or anything.

aurareturn

3 months ago

1 reply

Their chips weren’t selling because:

1. They’re useless for training in 2025. They were designed for training prior to LLM explosion. They’re not practical for training anymore because they rely on SRAM which is not scalable.

2. No one is going to spend the resources to optimize models to run on their SDK and hardware. Open source inference engines don’t optimize for Cerebras hardware.

Given the above two reasons, it makes a lot of sense that no one is investing in their hardware and they have switched to a cloud model selling speed as the differentiator.

It’s not always “Nvidia bad”.

arisAlexis

3 months ago

1 reply

They have 10x inference speed. Try Mistral and get awed.

aurareturn

3 months ago

1 reply

They're roughly 2.5x faster. https://www.cerebras.ai/press-release/maverick

Not 10x. But at greater cost and lower scale.

arisAlexis

3 months ago

Sources for greater cost and lower scale?

nsteel

3 months ago

1 reply

> utilizing DRAM to mind-boggling scale

I thought it was the SRAM scaling that was impressive, no?

maz1b

3 months ago

oops, typo! S and D are next to each other on the keyboard. thanks for pointing this out

aurareturn

3 months ago

4 replies

  I also wonder why they have not been acquired yet. Or is it intentional?

A few issues:

1. To achieve high speeds, they put everything on SRAM. I estimated that they needed over $100m of chips just to do Qwen 3 at max context size. You can run the same model with max context size on $1m of Blackwell chips but at a slower speed. Anandtech had an article saying that Cerebras was selling a single chip for around $2-3m. https://news.ycombinator.com/item?id=44658198

2. SRAM has virtually stopped scaling in new nodes. Therefore, new generations of wafer scale chips won’t gain as much as traditional GPUs.

3. Cerebras was designed in the pre-ChatGPT era where much smaller models were being trained. It is practically useless for training in 2025 because of how big LLMs have gotten. It can only do inference but see above 2 problems.

4. To inference very large LLMs economically, Cerebras would need to use external HBM. If it has to reach outside for memory, the benefits of a wafer scale chip greatly diminishes. Remember that the whole idea was to put the entire AI model inside the wafer so memory bandwidth is ultra fast.

5. Chip interconnect technology might make wafer scale chips more redundant. TSMC has a roadmap for glueing more than 2 GPU dies together. Nvidia’s Feynman GPUs might have 4 dies glued together. IE, the sweet spot for large chips might not be wafer scale but perhaps 2, 4, 8 GPUs together.

6. Nvidia seems to be moving much faster in terms of development and responding to market needs. For example, Blackwell is focused on FP4 inferencing now. I suppose the nature of designing and building a wafer scale chip is more complex than a GPU. Cerebras also needs to wait for new nodes to fully mature so that yields can be higher.

There exists a niche where some applications might need super fast token generation regardless of price. Hedge funds and Wallstreet might be good use cases. But it won’t challenge Nvidia in training or large scale inference.

addaon

3 months ago

1 reply

> SRAM has virtually stopped scaling in new nodes.

But there are several 1T memories that are still scaling, more or less — eDRAM, MRAM, etc. Is there anything preventing their general architecture from moving to a 1T technology once the density advantages outweigh the need for pipelining to hide access time?

aurareturn

3 months ago

1 reply

I’m pretty sure that HBM4 can be 20-30x faster in terms of bandwidth than eDRAM. That makes eDRAM not an option for AI workloads since bandwidth is the main bottleneck.

addaon

3 months ago

HBM4 is limited to a few thousand bits of width per stack. eDRAM bandwidth scales with chip area. A full-wafer chip could have astonishing bandwidth.

sailingparrot

3 months ago

4 replies

> I estimated that they needed over $100m of chips just to do Qwen 3 at max context size

I will point out (again :)), that this math is completely wrong. There is no need (nor performance gains) to store the entire weights of the model in SRAM. You simply store n transformer blocks on-chip and then stream block l+n from external memory to on-chip when you start computing block l, this completely masks the communication time behind the compute time, and specifically does not require you to buy 100M$ worth of SRAM. This is standard stuff that is done routinely in many scenarios, e.g. FSDP.

https://www.cerebras.ai/blog/cerebras-software-release-2.0-5...

MichaelZuo

3 months ago

1 reply

So then what explains such a low implied valuation at series G?

There’s no way that could be the case if the technology was competitive.

sailingparrot

3 months ago

I’m not saying it’s particularly competitive, I’m saying claiming it cost 100M$ to run Qwen is complete lunacy. There is a gulf between those 2 things.

And beyond pure performance competitiveness there are many things that make it hard for Cerebras and to be actually competitive: can they ship enough chips to meet the need of large clusters ? What about the software stack and lack of great support compared to nvidia? Lack of ml engineers that know how to use them, when everyone knows how to use CUDA and there are many things developed on top of it by the community (e.g triton).

Just look at the valuation difference between AMD and Nvidia, when AMD is already very competitive. But being 99% of the way there is still not enough for customers that are going to pay 5B$ for their clusters.

vlovich123

3 months ago

1 reply

I did experiments with this on traditional consumer GPU and the larger the discrepancy between model size and VRAM, the faster it dropped off (exponentially) to as if you didn’t even have any VRAM in the first place (over PCIe). This technique is well known and works when you have more than enough bandwidth.

However, the whole point that even HBM is a problem is the available bandwidth is insufficient, so if you’re marrying SRAM and HBM I would expect the performance gains to be overall modest for models that exceed available SRAM in a meaningful way.

sailingparrot

3 months ago

This is highly dependent on exact model size, architecture and hardware configurations. If the compute time for some unit of work is larger than the time it takes to transfer the next batch of Params you are good to go. If you are doing it sequentially though then yes you will pay a heavy price, but the idea is to fetch a future layer not the one you need right away.

As a similar example I have trained video models on ~1000 H100 where the vast majority of parameters are sharded and so need to be first fetched on the network before being available on HBM, which is similar imbalance to the HBM vs SRAM story. We were able to fully mask comms time such that not sharding (if it was even possible) would offer no performance advantage.

bubblethink

3 months ago

1 reply

That blog is about training. For inference, the weights and kv cache are in SRAM. Having said that, the $100M number is inaccurate/meaningless. It's a niche product that doesn't have economies of scale yet.

sailingparrot

3 months ago

The blog is about training but the technique applies equally well to inference, just like FSDP and kv cache sharing are routinely done in inference on GPUs.

There is just no need to have parameters or kv cache for layer 48 in SRAM when you are currently computing layer 3, you have all the time in the world to move that to SRAM when you get to layer 45 or whatever the maths work out to be for your specific model.

aurareturn

3 months ago

1 reply

What about for inference?

In that same thread, Cerebras executive disputed my $100m number but did not dispute that they store the entire model on SRAM.

They can make chips at cost and claim it isn’t $100m. But Anandtech did estimate/report $2-3m per chip.

sailingparrot

3 months ago

> What about for inference?

Same techniques apply.

> but did not dispute that they store the entire model on SRAM.

No idea what they did or did not do for that specific test (which was about delivering 1800 tokens/sec though, not simply running qwen-3) since they didn't provide any detail. I don't think there is any point storing everything in SRAM, even if you do happen to have 100M$ worth of chips lying around in a test cluster at the office, since WSE-3 is designed from the ground up for data parallelism (see [1] section 3.2) and inference is sequential both within a single token generation (you need to go through layer 1 before you can go through layer 2 etc.) and between tokens (autoregressive, so token 1 before token 2). This means most of your weights loaded in SRAM would be just sitting unused most of the time, and when they need to be used they need to be broadcasted to all chips from the SRAM of the chip that has the particular layer you care about, this is extremely fast, but external memory is certainly fast enough to do this if you fetch the layer in advance. So the way to get the best ROI on such a system would be to pack the biggest batch size you can (so many users' queries) and process them all in parallel, streaming the weights as needed. The more your SRAM is occupied by batch activations and not parameters, the better the compute density and thus $/flops.

You can check the Cerebras doc to see how weight streaming works [2]. From the start, one of the selling point of Cerebras is the possibility to scale memory independently of compute, and they have developped an entire system specifically for weight streaming from that decoupled memory. Their docs seems to keep things fairly simple assuming you can only fit one layer in SRAM and thus they fetch things sequentially, but if you can store at least 2 layers in those 44GB of SRAM then you can simply fetch l+1 when l is starting to compute, completely masking latency cost. Its possible they already mask the latency even within a single layer by streaming by tiles for matmul though, unclear from their docs. They mention that in passing in [3] section 6.3.

All of their doc is for training since it seems for inference play they have pivoted to selling API access rather than chips, but inference is really the same thing, just without the backprop (especially in their case were they aren't doing pipeline parallelism where you could claim doing fwd+back prop gives you better compute density). At the end of the day whether you are doing training or inference, all you care about is that your cores have the data they need in their registers at the moment they are free to compute, so streaming to SRAM works the same way in both cases.

Ultimately I can't tell you how much it cost to run Qwen-3, you can certainly do it on a single chip + weight streaming, but their specs are just too light on the exact FLOPs and bandwidth to know what the memory movement cost would be in this case (if any), and we don't even know the price of single chip (everyone is saying 3M$ though, regardless of that comment on the other thread). But I can tell you that your math of doing `model_size/sram_per_chip * chip_cost` just isn't the right way to think about this, and so the 100M$ figure doesn't make sense.

[1]: https://arxiv.org/html/2503.11698v1#S3.

[2]: https://training-api.cerebras.ai/en/2.1.0/wsc/cerebras-basic....

[3]: https://8968533.fs1.hubspotusercontent-na2.net/hubfs/8968533...

sinuhe69

3 months ago

1 reply

No, only Groq uses the all SRAM approach, Cerebras only use SRAM for local context while the weights are still loaded from RAM (or HBM). With 48 Kbytes per node, the whole wafer has only 44 GB SDRAM, much lower than the amount needed for loading the whole networks.

aurareturn

3 months ago

2 replies

In order to achieve the speeds they're claiming, they are putting the entire model on multiple chips and completely on SRAM.

imtringued

3 months ago

1 reply

This is actually completely unnecessary in the batched inference case.

Here is an oversimplified explanation that gets the gist accross:

The standard architecture for transformer based LLMs is as follows: Token Embedding -> N Layers each consisting of an attention sublayer and an MLP sublayer -> Output Embedding.

Most attention implementations use a simple KV caching strategy. In prefill you first calculate the KV cache entries by performing GEMM against the W_K, W_V, W_Q tensors. In the case of token generation, you only need to calculate against the current token. Next comes the quadratic part of attention. You need to calculate softmax(Q K^T)V. This is two matrix multiplications and has a linear cost with respect to the number of entries in the KV cache for generating the next token, as you need to re-read the entire KV cache plus the new entry. For prefill you are processing n tokens, so the cost is quadratic. The KV cache is unique for every user session. It also grows with the size of the context. This means the KV cache is really expensive memory wise. It consumes both memory capacity and bandwidth and it also doesn't permit batching.

Meanwhile the MLP sublayer is so boring I won't bother going into the details, but the gist is that you have a simple gating network with two feed forward layers that project the token vector into a higher dimension (e.g. more outputs than inputs) known as up gate and then you element-wise multiply these vectors and then feed them into a down gate which reduces it back to the original dimension of the token vector. Since the matrices are always the same, you can process the tokens of multiple users at once.

Now here are the implications of what I wrote above: Prefill is generally compute bound is therefore mostly uninteresting,or rather, interesting for ASIC designers because FLOPS are cheap and SRAM is expensive. Token generation meanwhile is a mix of being memory bandwidth bound and compute bound in the batched case. The MLP layer is trivially parallelized through GEMM based batching. Having lots of SRAM is beneficial for GEMM, but it is not super critical in a double buffered implementation that performs loading and computation simultaneously with the memory bandwidth being chosen so that both finish roughly at the same time.

What SRAM buys you for GEMM is the following: Given two square matrices A, B and their output A*B = C of the same dimension, where A and B are both 1 GiB in size and x MiB of SRAM, you tile the GEMM operation so that each sub-matrix is x/3 MiB in size. Let's say x=120MiB which means 40 MiB per matrix. You will split the matrices A and B into approximately 25 tiles. For every tile in A, you have to load all tiles in B. Meaning (A) 25 + 25*25 (A*B) = 650 load operations of 40 MiB matrices for a total amount of reads of 26000 MiB. If you double the SRAM you now have 13 tiles of size 80 MiB. 13 + 13*13 = 182. 182 * 80 MiB = 14560 MiB. Loosely speaking, doubling SRAM reduces the needed memory bandwidth by half. This is boring old linear scaling, because fewer tiles also means bigger tiles, so the quadratic gain of 4x reduction in loads is outweighed by 2x bigger load operations. Having more SRAM is good though.

Now onto Flash Attention. If I had to dumb down flash attention, it's a very quirky way of arranging two GEMM operations to reduce the amount of memory allocated to the intermediate C matrix of the first Q*K^T multiplication. Otherwise it is the same as two GEMM with smaller tiles. Doubling SRAM halves the necessary memory bandwidth.

Final conclusion: In the batched multi user inference case your goal is to allocate the KV cache to SRAM for attention nodes and achieve as large of a batch size as possible for MLP nodes and use the SRAM to operate on as large tiles as possible. If you achieve both, then the required memory bandwidth scales reciprocal to the amount of SRAM. Storing full tensors in SRAM is not necessary at large batch sizes.

Of course since I only looked at the memory aspects, it shouldn't be left out that you need to evenly match compute and memory resources. Having SRAM on its own doesn't buy you anything really.

lostmsu

3 months ago

In batched inference Cerebras have no advantage but cost more AFAIU.

sinuhe69

3 months ago

Why are you claiming on false facts and not checking their documents?

arisAlexis

3 months ago

But apparently they serve all the models super fast and in production so you must be wrong somewhere

liuliu

3 months ago

They were acquisition target since 2017 (from the OpenAI internal emails). So lacking of acquisition is not because lacking of interests. Let you wonder what happened in these due-diligence.

OkayPhysicist

3 months ago

The UAE has sunk a lot of money into them, and I suspect it's not purely a financial move. If that's the case, an acquisition might be more complicated than it would seem at first glance.

ramshanker

3 months ago

3 replies

I am not able to guess, what is preventing Cerebras from replacing few of the cores in the Wafer-Scale package with HBM memory? It seems the only constraint with their WSE3 is memory capacity. Considering the size of NVDA chips, Only a small subset of wafer area should easily exceed the memory size of contemporary models.

xadhominemx

3 months ago

I don’t think so. The reason why Cerebras is so fast for inference is that the KV cache sits in the SRAM.

aurareturn

3 months ago

If you replace some cores with HBM on package, you basically get the traditional GPU + HBM model.

reliabilityguy

3 months ago

DRAMs (core of the HBM memories) use different technology nodes than logic and SRAM. Also, stacking that many DRAMs on waver will complicate the packaging quite a bit I think.

Shakahs

3 months ago

3 replies

Sonnet/Claude Code may technically be "smarter", but Qwen3-Coder on Cerebras is often more productive for me because it's just so incredibly fast. Even if it takes more LLM calls to complete a task, those calls are all happening in a fraction of the time.

nerpderp82

3 months ago

2 replies

We must have very different workflows, I am curious about yours. What tools are you using and how are you guiding Qwen3-Coder? When I am using Claude Code, it often works for 10+ minutes at a time, so I am not aware of inference speed.

CaptainOfCoit

3 months ago

1 reply

> When I am using Claude Code, it often works for 10+ minutes at a time, so I am not aware of inference speed.

Indirectly, it sounds like you're aware about the inference speed? Imagine if it took 2 minutes instead of 10 minutes, that's what the parent means.

yodon

3 months ago

1 reply

2 minutes is the worst delay. With 10 minutes, I can and do context switch to something else and use the time productively. With 2 min, I wait and get frustrated and bored.

dataangel

3 months ago

Context switching makes you less productive compared to if you could completely finish one task before moving to the other though. in the limit an LLM that responds instantly is still better.

solarkraft

3 months ago

You must write very elaborate prompts for 10 minutes to be worth the wait. What permissions are you giving it and how much do you care about the generated code? How much time did you spend on initial setup?

I‘ve found that the best way for myself to do LLM assisted coding at this point in time is in a somewhat tight feedback loop. I find myself wanting to refine the code and architectural approaches a fair amount as I see them coming in and latency matters a lot to me here.

ripped_britches

3 months ago

1 reply

Do you use cursor or what? Interested in how you set this up

Shakahs

3 months ago

I use it via the Kilo Code extension for VSCode, which is invoking Qwen3-Coder via a Cerebras Code subscription.

https://github.com/Kilo-Org/kilocode https://www.cerebras.ai/blog/introducing-cerebras-code

sdesol

3 months ago

> Sonnet/Claude Code may technically be "smarter", but Qwen3-Coder on Cerebras is often more productive for me because it's just so incredibly fast.

Saying "technically" is really underselling the difference in intelligence in my opinion. Claude and Gemini are much, much smarter and I trust them to produce better code, but you honestly can't deny the excellent value that Qwen-3, the inference speed and $50/month for 25M tokens/per day brings to the table.

Since I paid for the Cerebras pro plan, I've decided to force myself to use it as much as possible for the duration of the month for developing my chat app (https://github.com/gitsense/chat) and here so some of my thoughts so far:

- Qwen3 Coder is a lot dumber when it comes to prompting as Gemini and Claude are much better at reading between the lines. However since the speed is so good, I often don't care as I can go back to the message and make some simple clarifications and try again.

- The max context window size of 128k for Qwen 3 Coder 480B on their platform can be a serious issue if you need a lot of documentation or code in context.

- I've never come close to the 25M tokens per day limit for their Pro Plan. The max I am using is 5M/day.

- The inference speed + a capable model like Qwen 3 will open up use cases most people might not have thought of before.

I will probably continue to pay for the $50 dollar plan for these use cases.

1. Applying LLM generated patches

Qwen 3 coder is very much capable of applying patches generated by Sonnet and Gemini. It is slower than what https://www.morphllm.com/ provides but it is definitely fast enough for most people to not care. The cost savings can be quite significant depending on the work.

2. Building context

Since it is so fast and because the 25M token limit per day is such a high limit for me, I am finding myself loading more files into context and just asking Qwen to identify files that I will need and/or summarize things so I can feed it into Sonnet or Gemini to save me significant money.

3. AI Assistant

Due to it's blazing speed, you can analyze a lot data fast for deterministic searches and because it can review results at such a great speed, you can do multiple search and review loops without feeling like you are waiting forever.

Given what I've experienced so far, I don't think Cerebras can be a serious platform for coding if Qwen 3 Coder is the only available model. Having said that, given the inference speed and Qwen being more than capable, I can see Cerebras becoming a massive cost savings option for many companies and developers, which is where I think they might win a lot of enterprise contracts.

JLO64

3 months ago

1 reply

My experience with Cerebras is pretty mixed. On the one hand for simple and basic requests, it truly is mind blowing how fast it is. That said, I’ve had nothing but issues and empty responses whenever I try to use them for coding tasks (Opencode via Openrouter, GPT-OSS). It’s gotten to a point where I’ve disabled them as a provider on Openrouter.

divmain

3 months ago

I experienced the same, but I think it is a limitation of OpenRouter. When I hit Cerebra’s OpenAI endpoint directly, it works flawlessly.

allisdust

3 months ago

2 replies

If the idiots at AMZN have any brains left, they would acquire this and make it the center of their inference offerings. But considering how lackluster their performance and strategy as a company has been off late, I doubt that.

Disappointed quite a bit with this fund raise. They were expected to IPO this year and give us poor retail investors a chance at investing in them.

reliabilityguy

3 months ago

1 reply

Amazon has their own chips for inference and training: Trainium1/2.

allisdust

3 months ago

2 replies

Nothing (may be except groq ?) comes even close to Cerebras in inference speed. I seriously don't get why these guys aren't more popular. The difference in using them as a inference provider vs anything else for any use case is like night and day. I hope more inference providers focus on speed. And this is where AMZN will benefit a lot since their entire cloud model is to have something people would anyway want and mark it up by 3x. God forbid if AVGO acquires this.

xadhominemx

3 months ago

1 reply

Cerebras hasn’t made any technical breakthroughs, they are just putting everything in SRAM. It’s a brute force approach to get very high inference throughput but comes at extremely high cost per token per second and is not useful for batched inferencing. Groq uses the same approach.

Memory hierarchy management across HBM/DDR/Flash is much more difficult but necessary to achieve practical inference economics.

twothreeone

3 months ago

2 replies

I don't think you realize the history of wafer-scale integration and what it means for the chip industry [1]. The approach was famously taken by Gene Amdahl's Trilogy Systems in the 80ies, but failed dramatically leading to (among others) deployment of "accelerator cards" in the form of.. the NVIDIA GeForce 256, the first GPU in 1999. It's not like NVIDIA hasn't been trying to integrate multiple dies in the same package, but doing that successfully has been a huge technological hurdle so far.

[1] https://ieeexplore.ieee.org/abstract/document/9623424

averne_

3 months ago

The main reason a wafer scale chip works there is because their cores are extremely tiny, and silicon area that gets fused off in the event of a defect is much lower than on NVIDIA chips, where a whole SM can get disabled. AFAIU this approach is not easily applicable to complex core designs.

xadhominemx

3 months ago

I understand that topic well. They stitched top metal layers across the reticle - not that challenging, and the foundational IP is not their own.

Everyone else went the CoWoS direction, which enables heterogeneous integration and much more cost effective inference.

reliabilityguy

3 months ago

Optimizing for one metric only, e.g., speed, leads to suboptimal outcomes, e.g., cost, scalability, etc.

I think while being fast, cerebra’s probably not very economical in fleets at scale.

onlyrealcuzzo

3 months ago

It would be hard to beat designing their own in-house offering that is 50% as good, at 20% the cost.

That's the problem.

Unless the majority of the value is on the other end of the curve, it's a tough sell.

rvz

3 months ago

1 reply

Sooner or later, lots of competitors including Cerebras are going to take apart Nvidia's data center market share and it will cause many AI model firms to question the unnecessary spend and hoarding of GPUs.

OpenAI is still developing their own chips with Broadcom, but they are not operational yet. So for now, they're buying GPUs from Nvidia to build up their own revenue income (to later spend it on their own chips)

By 2030, eventually many companies will be looking for alternatives to Nvidia like Cerebras or Lightmatter for both training and inference use-cases.

For example [0] Meta just acquired a chip startup for this exact reason - "An alternative to training AI systems" and "to cut infrastructure costs linked to its spending on advanced AI tools.".

[0] https://www.reuters.com/business/meta-buy-chip-startup-rivos...

onlyrealcuzzo

3 months ago

There's so much optimization to be made when developing the model and the hardware it runs on, most of the big players are likely to run a non-trivial percentage of their workloads on proprietary chips eventually.

If that's 5 years into the future, that looks bad for Nvidia, if it's >10 years in the future, that doesn't affect Nvidia's current stock price very much.

arjie

3 months ago

1 reply

I just tried out Qwen-3-480B-Coder on them yesterday and to be honest it's not good enough. It's very fast but has trouble on lots of tasks that Claude Code just solves. Perhaps part of it is that I'm using Charm's Crush instead of Claude Code.

arisAlexis

3 months ago

They make chips. Potentially you could also test Claude on them in the future.

dgfitz

3 months ago

1 reply

Valued at 8.1 billion dollars.

https://www.cerebras.ai/pricing

$50/month for one person for code (daily token limit), or pay per token, or $1500/month for small teams, or an enterprise agreement (contact for pricing).

Seems high.

arisAlexis

3 months ago

1 reply

The valuation is for the best inference chips. You know Nvidia cloud pricing is irrelevant and so it's here too.

dgfitz

3 months ago

So their angle is an exit?

fcpguruAuthor

3 months ago

1 reply

does Guillaume Verdon from https://www.extropic.ai/ have thoughts on on cerebras?

(or other people that read the litepaper https://www.extropic.ai/future)

landl0rd

3 months ago

1 reply

Beff has shipped zero chips and shitposted a lot. It is a cool idea but he has made tons of promises and it's starting to seem more like vaporware. Don't get me wrong, I hope it works, but doubt it will. Less podcasts more building please.

He reads to me like someone who markets better than he does things. I am disinclined to take him as an authority in this space.

How do you believe this is related to Cerebras?

fcpguruAuthor

3 months ago

that's the first thing I thought of when I read "cerebras faster chip for ai". Beff "sold me" on it a year ago. I guess I drank the kool-aid. Thinking about un-drinking now...

fcpguruAuthor

3 months ago

Their core product is the Wafer Scale Engine (WSE-3) — the largest single chip ever made for AI, designed to train and run models much faster and more efficiently than traditional GPUs.

Just tried https://cloud.cerebras.ai wow is it fast!

lvl155

3 months ago

Last I tried, their service was spotty and unreliable. I would wait maybe a year or so to retry.

tibbydudeza

3 months ago

Damm they are fast.

mythz

3 months ago

Running Qwen3 coder at speed is great, but would also prefer to have access to other leading OSS models like GLM 4.6, Kimi K2 and DeepSeek v3.2 before considering switching subs.

Groq also runs OSS models at speed which is my preferred way to access Kimi K2 on their free quotas.

redwood

3 months ago

Would be interesting if IBM were to acquire. Seems like the big iron approach to GPUs

rbitar

3 months ago

Congrats to the team, I'm surprised the industry hasn't been as impressed with their benchmarks on token throughput. We're using the Qwen 3 Coder 480b model and seeing ~2000 tokens/second, which is easily 10-20x faster then most LLM models on the market. Even some of the fastest models still only achieve 100-150 tokens / second (see OpenRouter stats by provider). I do feel after around 300-400 tokens/second the gains in speed feel more incremental, so if there was a model at 300+ tokens/second, I would consider that a very competitive alternative.

darkbatman

3 months ago

Its so useful to use Cerebras api for other tasks too not just coding with qwen coder but even simpler things like lets say analysing with gpt-120 oss or llama.

Just plug it in with normal chat interface like Jan or Cherry studio and its incredibly fast.

View full discussion on Hacker News

ID: 45427111Type: storyLast synced: 11/20/2025, 4:50:34 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN