Nvidia Dgx Spark: Great Hardware, Early Days for the Ecosystem
Posted3 months agoActive2 months ago
simonwillison.netTechstoryHigh profile
skepticalmixed
Debate
70/100
Nvidia Dgx SparkAI HardwareGPU Computing
Key topics
Nvidia Dgx Spark
AI Hardware
GPU Computing
The Nvidia DGX Spark is a powerful AI hardware solution, but its ecosystem is still in early days and faces challenges with software compatibility and value proposition.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
2h
Peak period
101
Day 1
Avg / period
22.2
Comment distribution111 data points
Loading chart...
Based on 111 loaded comments
Key moments
- 01Story posted
Oct 14, 2025 at 8:49 PM EDT
3 months ago
Step 01 - 02First comment
Oct 14, 2025 at 10:50 PM EDT
2h after posting
Step 02 - 03Peak activity
101 comments in Day 1
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 26, 2025 at 2:36 AM EDT
2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45586776Type: storyLast synced: 11/20/2025, 6:45:47 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
And of course there's the part of totally random and inconsistent support outside of the few dedicated cards, which is honestly why CUDA the de facto standard everyone measures against - you could run CUDA applications, if slowly, even on the lowest end nvidia cards, like Quadro NVS series (think lowest end GeForce chip but often paired with more displays and different support that focused on business users that didn't need fast 3D). And you still can, generally, run core CUDA code within last few generations on everything from smallest mobile chip to biggest datacenter behemoth.
I kinda lost track, this whole thread reminded me how hopeful I was to play with GPGPU with my then new X1600
https://learn.microsoft.com/en-us/cpp/parallel/amp/cpp-amp-c...
But maybe this will change? Software issues somehow?
It also runs CUDA, which is useful
plus apparently some of the early benchmarks were made with ollama and should be disregarded
Management becomes layers upon layers of bash scripts which ends up calling a final batch script written by Mellanox.
They'll catch up soon, but you end up having to stay strictly on their release cycle always.
Lots of effort.
I'm running VLLM on it now and it was as simple as:
(That recipe from https://catalog.ngc.nvidia.com/orgs/nvidia/containers/vllm?v... )And then in the Docker container:
The default model it loads is Qwen/Qwen3-0.6B, which is tiny and fast to load.I am curious about where you find its main value, and how would it fit within your tooling, and use cases compared to other hardware?
From the inference benchmarks I've seen, a M3 Ultra always come on top.
Installation instructions: https://github.com/comfyanonymous/ComfyUI#nvidia
It's a webUI that'll let you try a bunch of different, super powerful things, including easily doing image and video generation in lots of different ways.
It was really useful to me when benching stuff at work on various gear. ie L4 vs A40 vs H100 vs 5th gen EPYC cpus, etc.
I should be allowed to do stupid things when I want. Give me an override!
(Because Docker doesn't do this as by default, best practice is to create a non root user in your dockerfile and run as that)
Can people please not listen to this terrible advice that gets repeated so oft, especially in Australian IT circles somehow by young naive folks.
You really need to talk to your accountant here.
It's probably under 25% in deduction at double the median wage, little bit over @ triple, and that's *only* if you are using the device entirely for work, as in it sits in an office and nowhere else, if you are using it personally you open yourself up to all sorts of drama if and when the ATO ever decides to audit you for making a $6k AUD claim for a computing device beyond what you normally to use to do your job.
I'm sure I'll get downvoted for this, but this common misunderstanding about tax deductions does remind me of a certain Seinfeld episode :)
Kramer: It's just a write off for them
Jerry: How is it a write off?
Kramer: They just write it off
Jerry: Write it off what?
Kramer: Jerry all these big companies they write off everything
Jerry: You don't even know what a write off is
Kramer: Do you?
Jerry: No. I don't
Kramer: But they do and they are the ones writing it off
Even if what you are saying is correct, the discount is just lower. This is compared to no discount on compute/GPU rental unless your company purchases it.
I for example have some healthcare research projects with personally identifiable data, and in these times it’s simpler for the users to trust my company, than my company and some overseas company and it’s associated government.
I'm looking forward to GLM 4.6 Air - I expect that one should be pretty excellent, based on experiments with a quantized version of its predecessor on my Mac. https://simonwillison.net/2025/Jul/29/space-invaders/
The 120B model is better but too slow since I only have 16GB VRAM. That model runs decent[1] on the Spark.
[1]: https://news.ycombinator.com/item?id=45576737
I'd be pissed if I paid this much for hardware and the performance was this lacklustre while also being kneecapped for training
Obviously, even with connectx, it's only 240Gi of VRAM, so no big models can be trained.
But if FP4 means 4bit floating point, and that the hardware capability of the DGX Spark is effectively only in FP4, then yes. That was nonsense to wish it could have been used for training. But it wasn't obvious from the advertising of nvidia.
Also, the other reviews I’ve seen point out that inference speed is slower than a 5090 (or on par with a 4090 with some tailwind), so the big difference here (other than core counts) is the large chunk of “unified” memory. Still seems like a tricky investment in an age where a Mac will outlive everything else you care to put on a desk and AMD has semi-viable APUs with equivalent memory architectures (even if RoCm is… well… not all there yet).
Curious to compare this with cloud-based GPU costs, or (if you really want on-prem and fully private) the returns from a more conventional rig.
It's not comparable to 4090 inference speed. It's significantly slower, because of the lack of MXFP4 models out there. Even compared to Ryzen AI 395 (ROCm / Vulkan), on gpt-oss-120B mxfp4, somehow DGX manages to lose on token generation (pp is faster though.
> Still seems like a tricky investment in an age where a Mac will outlive everything else you care to put on a desk and AMD has semi-viable APUs with equivalent memory architectures (even if RoCm is… well… not all there yet).
ROCm (v7) for APUs came a long way actually, mostly thanks to the community effort, it's quite competitive and more mature. It's still not totally user friendly, but it doesn't break between updates (I know the bar is low, but that was the status a year ago). So in comparison, the strix halo offers lots of value for your money if you need a cheap compact inference box.
Havn't tested finetuning / training yet, but in theory it's supported, not to forget that APU is extremely performany for "normal" tasks (threadripper level) compared to the CPU of the DGX Spark.
I have no immediate numbers for prefill, but the memory bandwidth is ~4x greater on a 4090 which will lead to ~4x faster decode.
For inference decode the bandwidth is the main limitation so if running LLMs is your use case you should probably get a Mac instead.
The Mac Studio is a more appropriate comparison. There is not yet a DGX laptop, though.
I can do that with a laptop too. And with a dedicated GPU. Or a blade in a data center. I though the feature of the DGX was that you can throw it in a backpack.
In the list price, it's 1000 USD cheaper. 3,699 vs 4,699 I know a lot can be relative but that's a lot for me for sure.
Why not?
Now that you bring it up, the M3 ultra Mac Studio goes up to 512GB for about a $10k config with around 850 GB/s bandwidth, for those who "need" a near frontier large model. I think 4x the RAM is not quite worth more than doubling the price, especially if MoE support gets better, but it's interesting that you can get a Deepseek R1 quant running on prosumer hardware.
https://github.com/apple/containerization/
Running some other distro on this device is likely to require quite some effort.
What I found was a good solution was using Spack: https://spack.io/ That allows you to download/build the full toolchain of stuff you need for whatever architecture you are on - all dependencies, compilers (GCC, CUDA, MPI, etc.), compiled Python packages, etc. and if you need to add a new recipe for something it is really easy.
For the fellow Brits - you can tell this was named by Americans!!!
This a high level overview by one of the Spack authors from the HN post back in 2023 (top comment from 100 comments), including the Spack original paper link [1]:
At a very high level, Spack has:
* Nix's installation model and configuration hashing
* Homebrew-like packages, but in a more expressive Python DSL, and with more versions/options
* A very powerful dependency resolver that doesn't just pick from a set of available configurations -- it configures your build according to possible configurations.
You could think of it like Nix with dependency resolution, but with a nice Python DSL. There is more on the "concretizer" (resolver) and how we've used ASP for it here:
* "Using Answer Set Programming for HPC Dependency Solving", https://arxiv.org/abs/2210.08404
[1] Spack – scientific software package manager for supercomputers, Linux, and macOS (100 comments):
https://news.ycombinator.com/item?id=35237269
Beyond that, seems like the 395 in practice smashes the dgx spark in inference speeds for most models. I haven't seen nvfp4 comparisons yet and would be very interested to.
I dont think there are any models supporting nvfp4 yet but we shall probably start seeing them.
Can anyone explain this? Does this machine have multiple CPU architectures?
Is that true? nvidia Jetson is quite mature now, and runs on ARM.
The DGX Spark is completely overpriced for its performance compared to a single RTX 5090.
I don't think the 5090 could do that with only 32G of VRAM, couldn't it ?
That's the use case, not running LLM efficiently, and you can't do that with a RTX5090.
P.S. exploded view from the horse's mouth: https://www.nvidia.com/pt-br/products/workstations/dgx-spark...
You CAN build - but for people wanting to get started this could be a real viable option.
Perhaps less so though with Apple's M5? Let's see...
https://flopper.io/gpu/nvidia-dgx-spark