Nvidia Dgx Spark In-Depth Review: a New Standard for Local AI Inference

Posted3 months agoActive3 months ago

yvbbrjdr

115 points

93 comments

lmsys.orgTechstoryHigh profile

skepticalmixed

Debate

80/100

Nvidia Dgx SparkAI InferenceHardware Review

Key topics

Nvidia Dgx Spark

AI Inference

Hardware Review

The NVIDIA DGX Spark is reviewed, with the community discussing its performance, pricing, and value proposition for local AI inference, raising concerns about its competitiveness with other hardware options.

Snapshot generated from the HN discussion

Discussion Activity

Active discussion

First comment

Peak period

3-6h

Avg / period

6.2

Comment distribution93 data points

Loading chart...

Based on 93 loaded comments

Key moments

01Story posted
Oct 13, 2025 at 9:07 PM EDT
3 months ago
Step 01
02First comment
Oct 14, 2025 at 12:19 AM EDT
3h after posting
Step 02
03Peak activity
17 comments in 3-6h
Hottest window of the conversation
Step 03
04Latest activity
Oct 15, 2025 at 10:47 PM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (93 comments)

Showing 93 comments

SethTro

3 months ago

5 replies

Article doesn't seem to mention price which is $4,000 which makes it comparable to a 5090 but with 128GB of unified LPDDR5x vs the 5090's 32GB DDR7.

CamperBob2

3 months ago

1 reply

And about 1/4 the memory bandwidth, which is what matters for inference.

threeducks

3 months ago

1 reply

More precisely, the RTX 5090 has a memory bandwidth of 1792 GB/s, while the DGX Spark only has 273 GB/s, which is about 1/6.5.

For inference, the DGX Spark does not look like a good choice, as there are cheaper alternatives with better performance.

CamperBob2

3 months ago

2 replies

My understanding is that the Jetson Thor is just as good a platform, and likely more readily available.

Then there's the Mac Studio, which outdoes them in all respects except FP8 and FP4 support. As someone on Reddit put it: https://old.reddit.com/r/LocalLLaMA/comments/1n0xoji/why_can...

KeplerBoy

3 months ago

The jetson thor seems to be quite different. The Thor whitepaper lists 8 TFlop/s of FP32 compute where the DGX sparks seems to be closer to 30 TFlop/s. Also 48 SMs on the Spark vs 20 on the Jetson.

The DGX seems vastly more capable.

altspace

3 months ago

I’ve been thinking the same… I have jetson Thor and only difference I can imagine is the capability to connect two DGX sparks together… but then I’d rather go for RTX pro 6000 instead of buying two DGX spark units, because I prefer the higher memory bandwidth, more Cuda cores, tensor cores and RT cores over 256 GB memory for my use case.

nialse

3 months ago

3 replies

Well, that’s disappointing since the Mac Studio 128GB is $3,499. If Apple happens to launch a Mac Mini with 128GB RAM it would eat Nvidia Sparks’ lunch every day.

newman314

3 months ago

1 reply

Agreed. I also wonder why they chose to test against a Mac Studio with only 64GB instead of 128GB.

yvbbrjdrAuthor

3 months ago

1 reply

Hi, author here. I crowd-sourced the devices for benchmarking from my friends. It just happened that one of my friend has this device.

ggerganov

3 months ago

5 replies

FYI you should have used llama.cpp to do the benchmarks. It performs almost 20x faster than ollama for the gpt-oss-120b model. Here are some samples results on my spark:

  ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes
  | model                          |       size |     params | backend    | ngl | n_ubatch | fa |            test |                  t/s |
  | ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | --------------: | -------------------: |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |          pp4096 |       3564.31 ± 9.91 |
  | gpt-oss 20B MXFP4 MoE          |  11.27 GiB |    20.91 B | CUDA       |  99 |     2048 |  1 |            tg32 |         53.93 ± 1.71 |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |          pp4096 |      1792.32 ± 34.74 |
  | gpt-oss 120B MXFP4 MoE         |  59.02 GiB |   116.83 B | CUDA       |  99 |     2048 |  1 |            tg32 |         38.54 ± 3.10 |

yvbbrjdrAuthor

3 months ago

1 reply

I see! Do you know what's causing the slowdown for ollama? They should be using the same backend..

alecco

3 months ago

1 reply

Dude, ggerganov is the creator of llama.cpp. Kind of a legend. And of course he is right, you should've used llama.cpp.

Or you can just ask the ollama people about the ollama problems. Ollama is (or was) just a Go wrapper around llama.cpp.

ilc

3 months ago

Was. They've been diverging.

__mharrison__

3 months ago

1 reply

Curious to how this compares to running on a Mac.

xs83

3 months ago

1 reply

TTFT on a Mac is terrible and only increases as the context increases, thats why many are selling their M3 Ultra 512GB

Eggpants

3 months ago

So so many… eBay search shows only 15 results, 6 of them being ads for new systems…

https://www.ebay.com/sch/i.html?_nkw=mac+studio+m3+ultra+512...

xs83

3 months ago

1 reply

Now this looks much more interesting! Is the top one input tokens and the second one output tokens?

So 38.54 t/s on 120B? Have you tested filling the context too?

ggerganov

3 months ago

Yes, I provided detailed numbers here: https://github.com/ggml-org/llama.cpp/discussions/16578

rajatgupta314

3 months ago

Is this the full weight model or quantized version? The GGUFs distributed on Hugging Face labeled as MXFP4 quantization have layers that are quantized to int8 (q8_0) instead of bf16 as suggested by OpenAI.

Example looking at blk.0.attn_k.weight, it's q8_0 amongst other layers:

https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/tree/main?s...

Example looking at the same weight on Ollama is BF16:

https://ollama.com/library/gpt-oss:20b/blobs/e7b273f96360

nialse

3 months ago

Makes sense you have one of the boxes. What's your take on it? [Respecting any NDAs/etc/etc of course]

moondev

3 months ago

1 reply

Just don't try to run a NCCL

zackangelo

3 months ago

2 replies

Wouldn't you be able to test nccl if you had 2 of these?

moondev

3 months ago

Not with Mac studio(s), but yes multi host NCCL over RoCE with two DGX Sparks or over PCI with one

ddelnano

3 months ago

What kind of NCCL testing are you thinking about? Always curious what’s hardest to validate in people’s setups.

pjmlp

3 months ago

2 replies

Only if it runs CUDA, MLX / Metal isn't comparable as ecosystem.

People that keep pushing for Apple gear tend to forget Apple has decided what industry considers industry standards, proprietary or not, aren't made available on their hardware.

Even if Metal is actually a cool API to program for.

omneity

3 months ago

2 replies

CUDA is equally proprietary and not an industry standard though, unless you were thinking of Vulcan/OpenCL which doesn’t bring much in this situation.

pjmlp

3 months ago

Yes it is an industry standard, there is even a technical term for it.

It is called De facto standard, which you can check in your favourite dictionary.

NewsaHackO

3 months ago

CUDA isn't the industry standard? What is then?

thom

3 months ago

It depends what you're doing. I can get valuable work done with the subset of Torch supported on MPS and I'm grateful for the speed and RAM of modern Mac systems. JAX support is worse but hopefully both continue to develop.

EnPissant

3 months ago

2 replies

A 5090 is $2000.

sandworm101

3 months ago

1 reply

Msrp, but try getting your hands on one without a bulk order and/or camping out in a tent all weekend. I have seen people in my area buying pre-biult machines as they often cost less than trying to buy an individual card.

EnPissant

3 months ago

It’s not that hard to come across MSRP 5090s these days. It took me about a week before I found one. But if you don’t want to put any effort or waiting into it, you can buy one of the overpriced OC models right now for $2500.

adrian_b

3 months ago

But you put in a $1500 PC (with 128 GB DRAM).

Still, a PC with a 5090 will give in many cases a much better bang for the buck, except when limited by the slower speed of the main memory.

The greater bandwidth available when accessing the entire 128 GB memory is the only advantage of NVIDIA DGX, while a cheaper PC with discrete GPU has a faster GPU, a faster CPU and a faster local GPU memory.

bilekas

3 months ago

2 replies

$4,000 is actually extremely competitive. Even for an at-home enthusiast setup this price is not our of reach. I was expecting something far higher, that said, nVidia's MSRP is something of a pipe dream recently so we'll see when it's actually released and the availability. Curious also to see how they may scale together.

Xss3

3 months ago

2 replies

A warning to any home consumer throwing money at hardware for AI (fair enough if you have other use cases)...

Things are changing rapidly and there is a non insignificant chance that it'll seem like a big waste of money within 12 months.

eadwu

3 months ago

1 reply

For this form factor it will be likely ~2 years for the next one based on Vera CPU and whatever GPU. The 50W CPU will probably improve power efficiency.

If SOCAMM2 is used it will still probably be at most near the range of 512/768 GB/s bandwidth, unless LPDDR6X / LPDDR7X or SOCAMM2 is that much better, SOCAMM on the DGX Station is just 384 GB/s w/ LPDDR5X.

Form factor will be neutered for the near future, but will probably retain the highest compute for the form factor.

The only way there will be a difference is if Intel or AMD pump their foot on the gas, which this makes maybe 2/3 years of it, with another 2 years unless they have something cooking it isn't going to happen.

Xss3

3 months ago

Software driven changes could occur too! Maybe the next model will beat the pants off of this with far inferior hardware. Or maybe itll be so amazing with higher bandwidth hardware that anyone running at less than 500gbs will be left feeling foolish.

Maybe a company is working on something totally different in secret that we cant even imagine. The amount of £ thrown into this space at the moment is enormous.

metadat

3 months ago

Based on what data? I'm not denying the possibility but this seems like baseless FUD. We haven't even seen what folks have done with this hardware yet.

Tepix

3 months ago

If you compare DGX Spark with Ryzen AI Max 395, do you still think that $4000 for the NVidia device is very competitive?

To me it seems like you're paying more than twice the price mostly for CUDA compatibility.

Tepix

3 months ago

They're in a different ballback in memory bandwidth. The right comparison is the Ryzen AI Max 395 with 128GB DDR5-8000 which can be bought for around $1800 / 1750€.

pixelpoet

3 months ago

4 replies

I wonder why they didn't test against the broadly available Strix Halo with 128GB of 256 GB/s memory bandwidth, 16 core full-fat Zen5 with AVX512 at $2k... it is a mystery...

yvbbrjdrAuthor

3 months ago

1 reply

Hi, author here. I crowd-sourced the devices for benchmarking from my friends. It just happened that none of my friend has this device.

EnPissant

3 months ago

2 replies

Something is wrong with your numbers: gpt-oss-20b and gpt-oss-120b should be much much faster than what you are seeing. I would suggest you familiarize yourself with llama-bench instead of ollama.

Running gpt-oss-120b with a rtx 5090 and 2/3 of the experts offloaded to system RAM (less than half of the memory bandwidth of this thing), my machine gets ~4100tps prefill and ~40tps decode.

Your spreadsheet shows the spark getting ~94tps prefill and ~11tps decode.

Now, it's expected that my machine should slaughter this thing in prefill, but decode should be very similar or the spark a touch faster.

yvbbrjdrAuthor

3 months ago

1 reply

We actually profiled one of the models, and saw that the last GeMM, which is completely memory bound, is taking a lot of time, which reduces the token speed by a lot.

lostmsu

3 months ago

The parent is right, the issue is on your side.

hnuser123456

3 months ago

1 reply

Your system RAM is probably 1/20th the VRAM bandwidth of the 5090 (way way less than half) unless you're running a workstation board with quad or 8 channel RAM, then it's only about 1/10th or 1/5th respectively.

EnPissant

3 months ago

I'm saying its less than half of this DGX Spark: dual channel DDR5-6000 vs quad channel LPDDR5-8000.

mortsnort

3 months ago

The strix halo can also be used as a capable gaming/dev Pac with your OS of choice.

pixelpoet

3 months ago

There are some benches on reddit: https://old.reddit.com/r/LocalLLaMA/comments/1o6163l/dgx_spa...

tl;dr it gets absolutely smashed by Strix Halo, at half the price.

EnPissant

3 months ago

Strix Halo has the problem that prefill is incredibly slow if your context is not very small.

The only thing that might be interesting about this DGX Spark is it's prefill manages to be faster due to better compute. I haven't compared the numbers yet, but they are included in the article.

hank808

3 months ago

3 replies

You guys that continue to compare DGX Spark to the Mac Studios, please remember two things:

1. Virtually every model that you'd run was developed on Nvidia gear and will run on Spark. 2. Spark has fast-as-hell interconnects. The sort of interconnects that one would want to use in an actual AI DC, so you can use more than one Spark at the same time, and RDMA, and actually start to figure out how things work the way they do and why. You can do a lot with 200 Gb of interconnect.

pavlov

3 months ago

1 reply

It would be very interesting to read a tutorial on case 2.

hank808

3 months ago

@pavlov here's the tutorial that you wanted. https://youtu.be/rKOoOmIpK3I?si=WgLTee3Kc1SnUbDZ

nialse

3 months ago

2 replies

Also remember that the Mx Ultras have 2-3x the memory bandwidth. Looking at the benchmarks even Strix Halo seems to beat the Spark. Buying a 200 Gbps switch is $10k-$100k+ so don't imagine anyone actually will use the interconnect. The logical thing for Nvidia would be to sell a kit with three machines and cabling, and make it a ring with the dual ports per machine. Helps for some scenarios but not others with the 10 times slower network than memory bandwidth.

miladyincontrol

3 months ago

1 reply

On another note to remember, you can also ring topology mac studios using TB5 for 120Gbps per link with four such ports, all using cheaply available cable

wmf

3 months ago

You could also connect Sparks in a 200 Gbps ring with cheapish ($90) cables.

moondev

3 months ago

| Buying a 200 Gbps switch is $10k-$100k+

$1,295.00

https://www.balticnetworks.com/products/mikrotik-crs812-ddq-...

m00x

3 months ago

At best this is a cheap setup to test distributed training/inference code.

limoce

3 months ago

1 reply

> ollama gpt-oss 120b mxfp4 1 94.67 11.66

This is insanely slow given its 200+GB/s memory bandwidth. As a comparison, I've tested GPT OSS 120B on Strix Halo and it obtains 420tps prefill and >40tps decode.

nialse

3 months ago

Probably the quants have higher perplexity, but the Sparks performance seems to be lack lustre. The reviewer videos I've seen so far tries their best not to offend Nvidia or, rather, not break their contracts.

mwilcox

3 months ago

1 reply

Just get 5 Mac minis.

ionwake

3 months ago

would it perform better?

OliverGuy

3 months ago

2 replies

How representative is this platform of the bigger GB200 and GB300 chips?

Could I write code that runs on Spark and effortlessly run it on a big GB300 system with no code changes?

xs83

3 months ago

If you mean CUDA specific then yes. The biggest benefit of these machines over the others is the CUDA ecosystem and tools like cuDF, cuGraph etc

egeres

3 months ago

All three (GB10, GB200 and GB300) are part of the Blackwell family, which means they have Compute Capability >= 10.X. You could potentially develop kernels to optimize MoE inference (given the large available unified memory, 128Gb, it makes the most sense to me) with CUDA >= 12.9 then ship the fatbins to the "big boys". As many people have pointed out across the thread, the spark doesn't really has the best perf/$, it's rather a small portable platform for experimentation and development

aurareturn

3 months ago

6 replies

It isn't that good for local LLM inferencing. It's not designed to be as such.

It's designed to be a local dev machine for Nvidia server products. It has the same software and hardware stack as enterprise Nvidia hardware. That's what it is designed for.

Wait for M5 series Macs for good value local inferencing. I think the M5 Pro/Max are going to be very good values.

NaomiLehman

3 months ago

2 replies

because of possible hardware-accelerated matmul in GPU cores?

aurareturn

3 months ago

Yes. Matmul in M5 GPU, memory bandwidth, consumer/prosumer friendly OS, and they are just excellent portable laptops.

christkv

3 months ago

Massive memory bandwidth for the most part. M3 ultra had like 810 Gb/s vs ~300 for the DGX Spark. Also you can get up to 512 GB memory with a 256 GB config as well

bn-l

3 months ago

1 reply

I wish I could run Linux on them (the m5)

hoppp

3 months ago

1 reply

There was an arch linux version that supports apple silicon

porphyra

3 months ago

Asahi Linux has long switched from Arch to Fedora (though a janky Arch version still exists). But they don't support anything newer than M2.

spaceywilly

3 months ago

1 reply

What is the value proposition for buying one of these vs renting time on similar hardware from a cloud provider?

xmichael909

3 months ago

I don't think there is one. Honestly this version 1 is dead on arrival.

teleforce

3 months ago

If I understand correctly the DGX is for the development, and the AGX Thor is more geared toward local LLM inferencing [1],[2].

[1] (Updated) NVIDIA Jetson AGX Thor Developer Kit to Launch in Mid-August with 2070 TFLOPS AI Performance, Priced at $3499:

https://linuxgizmos.com/updated-nvidia-jetson-agx-thor-devel...

[2] AAEON Announces BOXER-8741AI with NVIDIA Jetson Thor T5000 Module:

https://linuxgizmos.com/aaeon-announces-boxer-8741ai-with-nv...

jamesblonde

3 months ago

Given that most of Nvidia's enterprise software products are all single server designed to run on DGX boxes, like NIMs, this makes sense.

I am still amazed at how many companies buy a ton of DGX boxes and then are surprised that Nvidia does not have any Kubernetes native platform for training and inferencing across all the DGX machines. The Run.ai acquisition did not change anything, as you leave all the work to the user to integrate with distributed training frameworks like Ray or scalable inference platforms, like KServe/vLLM.

kirillzubovsky

3 months ago

Fascinating that we didn't have to wait too long. Apple announced M5 this morning. Does it compare though?

incomingpain

3 months ago

1 reply

That memory bandwidth choked out their performance. How can you claim 1000 tflops if it's not capable of delivering it. Seems they chose to sandbag the spark in favour of the rtx pro 6000.

I guess my next one I'm looking out for is the Orange Pi AI studio pro. Should have 192gb of ram, so able to run qwen3 235b, even though it's ddr4, it's nearly double the bandwidth of the spark.

realityenigma

3 months ago

1 reply

Good luck with any kind of coherent ecosystem and support. Also, if you're in the U.S., there is a good chance you'll get hit with tariffs which would wipe out any potential value. I'd much rather stick with nVidia that has an ecosystem (even Apple for that matter), than touch a system like this off of Alibaba.

incomingpain

3 months ago

>Good luck with any kind of coherent ecosystem and support.

Admittedly I'm not a huge fan of debian; likely would end up going Arch on this one.

>Also, if you're in the U.S.,

Im not.

> I'd much rather stick with nVidia that has an ecosystem (even Apple for that matter), than touch a system like this off of Alibaba.

I get that. Realistically I'm waiting for medusa halo, some affordable datacenter card, something.

jerlam

3 months ago

1 reply

"Metal foam" sounds cool but it just looks like a steel wool pad you would use for cleaning dishes.

harias

3 months ago

2 replies

Helps with the cooling is my guess. Increased surface area

jerlam

3 months ago

Possibly if the case is being used as a heat sink, but in that case it would be unsafe to touch. I think it's just being used instead of a traditional mesh panel.

egeres

3 months ago

I'm pretty sure they just want to be coherent with the which has that "steel scrubber finish" on the hardware

(photo for reference: https://www.wwt.com/api-new/attachments/5f033e355091b0008017...)

themgt

3 months ago

M5 Macs may be launching as early as today. Inference should see a significant boost w/ matmul acceleration.

khurdula

3 months ago

Bruh, if it were priced at like $2,499 it would make sense, but this is just too much.

richardczl

3 months ago

Memory indeed would be an issue

andrewstuart

3 months ago

Nvidia always short changes its own products and stunts them in some way.

No doubt that’s present here too somehow.

Gotta cut off something important so you’ll spend more on the next more expensive product.

altspace

3 months ago

Any views on how Isaac Lab / Isaac Sim would perform on DGX Spark?

incomingpain

3 months ago

GPT 120B is your goto model:

DGX Spark

pp - 1723.07/s

tg - 38.55/s

Ryzen AI Max+ 395

pp - 711.67/s

tg - 40.25/s

Is it worth the money?

whitehexagon

3 months ago

I think my 2001 MBP M1 Pro is ~200GB/s memory bandwidth, but it handles qwen3:32b quite nicely, albeit maxed out at ~70W.

I somehow expected the Spark to be the 'God in a Box' moment for local AI, but it feels like they went for trying to sell multiple units instead.

I'd be more tempted by a 2nd hand 128GB M2 ultra at ~800GB/s but the prices here are still high, and I'm not sure the Spark is going to convince people to part with those, unless we see some M5 glutenous RAM boxes soon. An easy way for Apple to catch up again.

andrewgleave

3 months ago

Looks like MLX is not a supported backend in Ollama so the numbers for the Mac could be significantly higher in some cases.

It would be interesting to swap out Ollama for LM Studio and use their built-in MLX support and see the difference.

ta12653421

3 months ago

Two questions:

a) what is the noise level? In that small box, it should be immense?

b) how many frames do we get in Q3A at max. resolution and will it be able to run Crysis? ;-) LOL (SCNR)

View full discussion on Hacker News

ID: 45575127Type: storyLast synced: 11/20/2025, 6:42:50 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN