Nvidia Dgx Spark: When Benchmark Numbers Meet Production Reality
Key topics
The article discusses the author's hands-on experience with Nvidia's DGX Spark, highlighting both its impressive performance and several issues, including GPU inference problems, which sparked a lively discussion among commenters about the product's strengths and weaknesses.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
2h
Peak period
91
0-12h
Avg / period
23.2
Based on 116 loaded comments
Key moments
- 01Story posted
Oct 26, 2025 at 1:53 PM EDT
2 months ago
Step 01 - 02First comment
Oct 26, 2025 at 3:30 PM EDT
2h after posting
Step 02 - 03Peak activity
91 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 31, 2025 at 2:37 AM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
My job just got me and our entire team a DGX spark. I'm impressed at the ease of use for ollama models I couldn't run on my laptop. gpt-oss:120b is shockingly better than what I thought it would be from running the 20b model on my laptop.
The DGX has changed my mind about the future being small specialized models.
Are you shocked because that isn't your experience?
From the article it sounds like ollama runs cpu inference not GPU inference. Is that the case for you?
I was not familiar with the hardware, so I was disappointed there wasn't a picture of the device. Tried to skim the article and it's a mess. Inconsistent formatting and emoji without a single graph to visualize benchmarks.
I bet the input to the LLM would have been more interesting.
It looks like it worked? Why's it say this?
> Verdict: Inference speed scales proportionally with model size.
Author only tried one model size and it's faster than NVIDIA's reported speed at a larger model. Not really a "Verdict".
> Verdict: 4-bit quantization is production-viable.
That's not really something you can conclude from messing around with it and saying you like the outputs.
> GPU Inference is Fundamentally Broken
Probably not? It probably just doesn't work in llama.cpp right now? Takes a while reading this to work out they tried ollama and then later llama.cpp, which I'd guess is basically testing llama.cpp twice. Actually I don't even believe that, I'm sure author ran into errors that might be a pain to figure out, but there's no evidence it's worse than that.
But then it says this is the "root cause":
Am I to believe GPU inference is really fundamentally broken? I'm not seeing the case made here, just claims. At this point the LLM seems to have gotten confused about whether it's talking about the memory fragmentation issue or the GPU inference issue. But it's hard to believe anything from this point on in the post.I have H100s to myself, and access to more GPUs than I know what to do with in national clusters.
The Spark is much more fun. And I’m more productive. With two of them, you can debug shallow NCCL/MPI problems before hitting a real cluster. I sincerely love Slurm, but nothing like a personal computer.
As for debugging, that's where you should be allowed to spin up a small testing cluster on-demand. Why can't you do that with your slurm access?
It's remarkable what can now be done on a whisper-quiet little box. I hope the Strix Halo's will be just as much fun, and they should be, so long as Flash Attention works.
Fair, thanks for the answer.
The bane of my existence...
Even ignoring GPU details spark is an awesome little quiet powerhouse arm64 workstation that is 100% Linux first
Curious though how you offer idrac to customer, do you have another OOB BMC for the idrac? Or is this internal engineering context
We rent bare metal on-demand and our whole business is to be able to offer compute that you probably wouldn't be able to host in your house $, as if you own it yourself.
So, we made it so that users can get access into the BMC and modify the box however they want. When they are done, we've automated the reset as well. Fully self-service.
$ These boxes are very expensive, weigh 350lbs, sound like a jet engine and consume ~10kW.
But it’s still not quite like exclusive access to resources when you want them. So I can see it from both ways.
But please have your LLM post writer be less verbose and repetitive. This is like the stock output from any LLM, where it describes in detail and then summarizes back and forth over multiple useless sections. Please consider a smarter prompt and post-editing…
Nah. Do you have 1st hand experience with Strix Halo? At less than 1600€ for a 128GB configuration it manages >45 tokens/s with gpt-oss 120b. Which is faster than DGX Spark at a fraction of the cost.
There are official benchmarks of the Spark running multiple models just fine on llama.cpp
https://github.com/ggml-org/llama.cpp/discussions/16578
Wow. Where do I sign up?
It would be cheaper to buy up a dozen 3060s and build a custom PC around them than to buy the Spark.
Given the extreme advantage they have with CUDA and the whole AI/ML ecosystem, barely matching Apple’s M-ultra speeds is a choice…
Apple benchmarks: https://github.com/ggml-org/llama.cpp/discussions/4167
(cited from https://lmsys.org/blog/2025-10-13-nvidia-dgx-spark/)
the reason we use ryzens is because we run linux with almost no problems on them.
The userspace side is where AI is difficult with AMD. Almost all of the community is build around Nvidia tooling first, others second (if it all).
SHOULDA
Practically, if the goal is 100% about AI and cloud isn't an option for some reason, both options are likely "a great way to waste a couple grand trying to save a couple grand" as you'd get 7x the performance and likely still feel it's a bit slow on larger models using an RTX Pro 6000. I say this as a Ryzen AI Max+ 395 owner, though I got mine because it's the closest thing to an x86 Apple Silicon laptop one can get at the moment.
https://www.anaconda.com/blog/python-nvidia-dgx-spark-first-...
Isin't this the same architecture that the Mx from Apple implements from a memory perspective?
> If you start Python and ask it how many CPU cores you have, it will count both kinds of cores and report 20
> Note that because of the speed difference between the cores, you will want to ensure there is some form of dynamic scheduling in your application that can load balance between the different core types.
Sounds like a new type of hell where I now not only need to manage the threads themselves, but also take into account what type of core they run on, and Python straight up report them as the same.
Two things need to happen for me to get excited about this:
1. It stimulates other manufacturers into building their own DGX-class workstations.
2. This all eventually gets shipped in a decent laptop product.
As much as it pains me, until that happens, it still seems like Apple Sillicon is the more viable option, if not the most ethical.
Besides that though, I don't see how Nvidia is particularly non-ethical. They cooperate with Khronos, provide high-quality Linux and BSD drivers free of charge, and don't deliberately block third parties from writing drivers to support new standards. From a relativist standpoint that's as sanctimonious as server hardware gets.
Specifically WRT Mellanox, Nvidia's behavior was more petty than callous.
And yes... yes it is.
ARM64 Architecture: Not x86_64 (limited ML ecosystem maturity) No PyTorch wheels for ARM64+CUDA (must use Docker) Most ML tools optimized for x86
No evidence for any of this whatsoever. The author just asked Claude/claude code to write their article and it just plain hallucinated some rubbish.
Like in Upstream Color: https://www.youtube.com/watch?v=zfDyEr8Ykcg
https://cookbook.openai.com/articles/gpt-oss/run-nvidia
Really? Less RAM bw than an Epyc CPU? And 4x to 8x less than a consumer GPU?
How come this doesn’t massively limit LLM inference speeds?
Key corrections:
Ollama GPU usage - I was wrong. It IS using GPU (verified 96% utilization). My "CPU-optimized backend" claim was incorrect.
FP16 vs BF16 - enum caught the critical gap: I trained with BF16, tested inference with FP16 (broken), but never tested BF16 inference. "GPU inference fundamentally broken" was overclaimed. Should be "FP16 has issues, BF16 untested (likely works)."
llama.cpp - veber-alex's official benchmark link proves it works. My issues were likely version-specific, not representative.
ARM64+CUDA maturity - bradfa was right about Jetson history. ARM64+CUDA is mature. The new combination is Blackwell+ARM64, not ARM64+CUDA itself.
The HN community caught my incomplete testing, overclaimed conclusions, and factual errors.
Ship early, iterate publicly, accept criticism gracefully.
Thanks especially to enum, veber-alex, bradfa, furyofantares, stuckinhell, jasonjmcghee, eadwu, and renaudr. The article is significantly better now.
But if that's not the case, then yeah, it's a crappy practice and I'd hate to see it spread any further than it already has.
Is that version correct?
Asking because (in Ollama terms) it's positively ancient. 0.12.6 being the most recent release (currently).
I'm guessing it _might_ make a difference, as the Ollama crowd do seem to be changing things, adding new features and optimisations (etc) quite often.
For example, that 0.12.6 version is where initial experimental support for Vulkan (ie Intel Xe gpus) was added, and in my testing that worked. Not that Vulkan support would do anything in your case. ;)
Ryzen Max 395+ gets you 55 tok/s [1]
[1] https://www.reddit.com/r/LocalLLaMA/comments/1nabcek/anyone_...
It is also a standard UEFI+ACPI system; one Reddit user even reported that they were able to boot up Fedora 42 and install the open kernel modules no problem. The overall delta/number of specific patches for the Canonical 6.17-nvidia tree is pretty small when I looked (the current kernel is 6.11). That and the likelihood the consumer variant will support Windows hopefully bodes well for its upstream Linux compatibility, I hope.
To be fair, most of this also true of Strix Halo from what I can tell (most benchmarks put the DGX furthest ahead at prompt processing and a bit ahead at raw token output. But the software is still buggy and Blackwell is still a bumpy ride overall, so it might get better). But I think it's mostly the pricing that is holding it back. I'm curious what the consumer variant will be priced at.
I haven't exactly bisected the issue but I'm pretty sure convolutions are broken on sm_121 after a certain size, getting 20x memory blowup from a convolution from a 2x batch size increase _only_ on the DGX Spark.
I haven't had any problems with inference, but I also don't use the transformers library that much.
llama.cpp was working for openai-oss last time I checked and on release, not sure if something broke along the way.
I don't exactly know if memory fragmentation is something fixable on the driver side - this might just be the problem with kernel's policy and GPL, it prevents them from automatically interfering with the memory subsystem to the granularity they'd like - see zfs and their page table antics - or so my thoughts on it is.
If you've done stuff on WSL, you have similar issues and you can fix it by running a service that normally compacts and clean memory, I have it run every hour. Note that this does impact at the very least CPU performance and memory allocation speeds, but I have not have any issue with long training runs with it (24hr+, assuming that is the issue, I have never tried without it and put that service in place since getting it due to my experience on WSL).
1 more comments available on Hacker News