Nvidia Dgx Spark In-Depth Review: a New Standard for Local AI Inference
Posted3 months agoActive3 months ago
lmsys.orgTechstoryHigh profile
skepticalmixed
Debate
80/100
Nvidia Dgx SparkAI InferenceHardware Review
Key topics
Nvidia Dgx Spark
AI Inference
Hardware Review
The NVIDIA DGX Spark is reviewed, with the community discussing its performance, pricing, and value proposition for local AI inference, raising concerns about its competitiveness with other hardware options.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
3h
Peak period
17
3-6h
Avg / period
6.2
Comment distribution93 data points
Loading chart...
Based on 93 loaded comments
Key moments
- 01Story posted
Oct 13, 2025 at 9:07 PM EDT
3 months ago
Step 01 - 02First comment
Oct 14, 2025 at 12:19 AM EDT
3h after posting
Step 02 - 03Peak activity
17 comments in 3-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 15, 2025 at 10:47 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45575127Type: storyLast synced: 11/20/2025, 6:42:50 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
For inference, the DGX Spark does not look like a good choice, as there are cheaper alternatives with better performance.
Then there's the Mac Studio, which outdoes them in all respects except FP8 and FP4 support. As someone on Reddit put it: https://old.reddit.com/r/LocalLLaMA/comments/1n0xoji/why_can...
The DGX seems vastly more capable.
Or you can just ask the ollama people about the ollama problems. Ollama is (or was) just a Go wrapper around llama.cpp.
https://www.ebay.com/sch/i.html?_nkw=mac+studio+m3+ultra+512...
So 38.54 t/s on 120B? Have you tested filling the context too?
Example looking at blk.0.attn_k.weight, it's q8_0 amongst other layers:
https://huggingface.co/ggml-org/gpt-oss-20b-GGUF/tree/main?s...
Example looking at the same weight on Ollama is BF16:
https://ollama.com/library/gpt-oss:20b/blobs/e7b273f96360
People that keep pushing for Apple gear tend to forget Apple has decided what industry considers industry standards, proprietary or not, aren't made available on their hardware.
Even if Metal is actually a cool API to program for.
It is called De facto standard, which you can check in your favourite dictionary.
Still, a PC with a 5090 will give in many cases a much better bang for the buck, except when limited by the slower speed of the main memory.
The greater bandwidth available when accessing the entire 128 GB memory is the only advantage of NVIDIA DGX, while a cheaper PC with discrete GPU has a faster GPU, a faster CPU and a faster local GPU memory.
Things are changing rapidly and there is a non insignificant chance that it'll seem like a big waste of money within 12 months.
If SOCAMM2 is used it will still probably be at most near the range of 512/768 GB/s bandwidth, unless LPDDR6X / LPDDR7X or SOCAMM2 is that much better, SOCAMM on the DGX Station is just 384 GB/s w/ LPDDR5X.
Form factor will be neutered for the near future, but will probably retain the highest compute for the form factor.
The only way there will be a difference is if Intel or AMD pump their foot on the gas, which this makes maybe 2/3 years of it, with another 2 years unless they have something cooking it isn't going to happen.
Maybe a company is working on something totally different in secret that we cant even imagine. The amount of £ thrown into this space at the moment is enormous.
To me it seems like you're paying more than twice the price mostly for CUDA compatibility.
Running gpt-oss-120b with a rtx 5090 and 2/3 of the experts offloaded to system RAM (less than half of the memory bandwidth of this thing), my machine gets ~4100tps prefill and ~40tps decode.
Your spreadsheet shows the spark getting ~94tps prefill and ~11tps decode.
Now, it's expected that my machine should slaughter this thing in prefill, but decode should be very similar or the spark a touch faster.
tl;dr it gets absolutely smashed by Strix Halo, at half the price.
The only thing that might be interesting about this DGX Spark is it's prefill manages to be faster due to better compute. I haven't compared the numbers yet, but they are included in the article.
1. Virtually every model that you'd run was developed on Nvidia gear and will run on Spark. 2. Spark has fast-as-hell interconnects. The sort of interconnects that one would want to use in an actual AI DC, so you can use more than one Spark at the same time, and RDMA, and actually start to figure out how things work the way they do and why. You can do a lot with 200 Gb of interconnect.
$1,295.00
https://www.balticnetworks.com/products/mikrotik-crs812-ddq-...
This is insanely slow given its 200+GB/s memory bandwidth. As a comparison, I've tested GPT OSS 120B on Strix Halo and it obtains 420tps prefill and >40tps decode.
Could I write code that runs on Spark and effortlessly run it on a big GB300 system with no code changes?
It's designed to be a local dev machine for Nvidia server products. It has the same software and hardware stack as enterprise Nvidia hardware. That's what it is designed for.
Wait for M5 series Macs for good value local inferencing. I think the M5 Pro/Max are going to be very good values.
[1] (Updated) NVIDIA Jetson AGX Thor Developer Kit to Launch in Mid-August with 2070 TFLOPS AI Performance, Priced at $3499:
https://linuxgizmos.com/updated-nvidia-jetson-agx-thor-devel...
[2] AAEON Announces BOXER-8741AI with NVIDIA Jetson Thor T5000 Module:
https://linuxgizmos.com/aaeon-announces-boxer-8741ai-with-nv...
I am still amazed at how many companies buy a ton of DGX boxes and then are surprised that Nvidia does not have any Kubernetes native platform for training and inferencing across all the DGX machines. The Run.ai acquisition did not change anything, as you leave all the work to the user to integrate with distributed training frameworks like Ray or scalable inference platforms, like KServe/vLLM.
I guess my next one I'm looking out for is the Orange Pi AI studio pro. Should have 192gb of ram, so able to run qwen3 235b, even though it's ddr4, it's nearly double the bandwidth of the spark.
Admittedly I'm not a huge fan of debian; likely would end up going Arch on this one.
>Also, if you're in the U.S.,
Im not.
> I'd much rather stick with nVidia that has an ecosystem (even Apple for that matter), than touch a system like this off of Alibaba.
I get that. Realistically I'm waiting for medusa halo, some affordable datacenter card, something.
(photo for reference: https://www.wwt.com/api-new/attachments/5f033e355091b0008017...)
No doubt that’s present here too somehow.
Gotta cut off something important so you’ll spend more on the next more expensive product.
DGX Spark
pp - 1723.07/s
tg - 38.55/s
Ryzen AI Max+ 395
pp - 711.67/s
tg - 40.25/s
Is it worth the money?
I somehow expected the Spark to be the 'God in a Box' moment for local AI, but it feels like they went for trying to sell multiple units instead.
I'd be more tempted by a 2nd hand 128GB M2 ultra at ~800GB/s but the prices here are still high, and I'm not sure the Spark is going to convince people to part with those, unless we see some M5 glutenous RAM boxes soon. An easy way for Apple to catch up again.
It would be interesting to swap out Ollama for LM Studio and use their built-in MLX support and see the difference.
a) what is the noise level? In that small box, it should be immense?
b) how many frames do we get in Q3A at max. resolution and will it be able to run Crysis? ;-) LOL (SCNR)