Dell's Version of the Dgx Spark Fixes Pain Points
Key topics
Dell's tweaked version of Nvidia's DGX Spark is sparking debate, with some commenters praising the company's efforts to fix existing pain points, while others remain skeptical about dealing with Dell's firmware updates. The discussion highlights a divide between those who value the DGX Spark's niche role in AI research and development, and those who argue that alternative hardware like Apple's Mac or AMD's Strix Halo offer better performance or value. Notably, owners of the DGX Spark device defend its unique strengths, pointing out that its capabilities are tailored to specific use cases that aren't directly comparable to other hardware. As AI research continues to gain momentum, this conversation feels particularly relevant, shedding light on the trade-offs and specialized needs of this rapidly evolving field.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
1h
Peak period
20
6-9h
Avg / period
8
Based on 88 loaded comments
Key moments
- 01Story posted
Jan 1, 2026 at 2:11 PM EST
6d ago
Step 01 - 02First comment
Jan 1, 2026 at 3:14 PM EST
1h after posting
Step 02 - 03Peak activity
20 comments in 6-9h
Hottest window of the conversation
Step 03 - 04Latest activity
Jan 3, 2026 at 4:45 AM EST
4d ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Then there is the networking. While Strix Halo systems come with two USB4 40Gbit/s ports, it's difficult to
a) connect more than 3 devices with two ports each
b) get more than 23GBit/s or so, if you're lucky.
Something like Apple's RDMA via Thunderbolt would be great to have on Strix Halo…
It's not really intended to be a great value box for running LLMs at home. Jeff Geerling talks about this in the article.
If I want to do hobby / amateur AI research or do stuff with fine tuning models etc, learn the tooling. I'm better off with the DG10 than AMD or Apple's systems.
The Strix Halo machines look nice. I'd like one of those too. Especially if/when they ever get around to getting it into a compelling laptop.
But I ordered the ASUS Ascent DG10 machine (since it was more easily available for me than the other versions of these) because I want to play around with fine tuning open weight models, learning tooling, etc.
That and I like the idea of having a (non-Apple) Aarch64 linux workstation at home.
Now if the courier would just get their shit together and actually deliver the thing...
I ended up going with the Asus DG10 because if the goal is to "learn me some AI tooling" I didn't want to have to add "learn me some only recently and shallowly supported-in-linux AMD tooling" to the mix.
I hate NVIDIA -- the company -- but in this case it comes down to pure self-interest in that I want to add some of this stuff to my employable skill set, and NVIDIA ships the machine with all the pieces I need right in the OS distribution.
Plus I have a bias for ARM over x86.
Long run I'm sure I'll end up with a Strix Halo type machine in my collection at some point.
But I also expect those machines to not drop in price, and perhaps even go up, as right now the 128GB of RAM in them is worth the price of the whole machine.
I've used it to fine tune 20+ models in the last couple of weeks. Neither a Mac or Strix Halo even try to compete.
I'm trying to better understand the trade offs, or if it depends on the workload.
AMD’s own marketing numbers suggest the NPU is about 50 TOPS out of 126 TOPS total compute for the platform. Even if you hand-wave everything else away, that caps the theoretical upside at around ~1.6×.
But that assumes:
1. Your workload maps cleanly onto the NPU’s 8-bit fast path.
2. There’s no overhead coordinating the iGPU + NPU (which seems... optimistic).
My expectation is the real-world gain won't be very significant, but I'd love to be proven wrong!
So while I think the Strix Halo is a mostly useless machine for any kind of AI, and I think the spark is actually useful, I don't think pure inference is a good use case for them.
It probably only makes sense as a dev kit for larger cloud hardware.
But the nicest addition Dell made in my opinion is the retro 90's UNIX workstation-style wallpaper: https://jasoneckert.github.io/myblog/grace-blackwell/
https://www.fsi-embedded.jp/contents/uploads/2018/11/DELLEMC...
I liked the idea until the final specs came out.
On one hand, /r/localllama doesn't like the Spark for running models (too low for compute and bandwidth). And I am a CUDA developer an find it overpriced.
Finally, 128 GB DDR5 was $1000 and now it's $3200. So I bet the DGX Spark will double or triple the price soon, too.
These devices are for AI R&D. If you need to build models or fine tune them locally they're great.
That said, I run GPT-OSS 120B on mine and it's 'fine'. I spend some time waiting on it, but the fact that I can run such a large model locally at a "reasonable" speed is still kind of impressive to me.
It's REALLY fast for diffusion as well. If you're into image/video generation it's kind of awesome. All that compute really shines when for workloads that aren't memory speed bound.
If I wanted to I could go on ebay, buy a bunch of parts, build my own system, install my own OS, compile a bunch of junk, tinker with config files for days, and then fire up an extra generator to cope with the 2-4x higher power requirements. For all that work I might save a couple of grand and will be able to actually do less with it. Or... I could just buy a GB10 device and turn it on.
It comes preconfigured to run headless and use the NVIDIA ecosystem. Mine has literally never had a monitor attached to it. NVIDIA has guides and playbooks, preconfigured docker containers, and documentation to get me up and developing in minutes to hours instead of days or weeks. If it breaks I just factory reset it. On top of that it has the added benefit of 200Gbe QSFP networking that would cost $1,500 on it's own. If I decide I need more oomph and want a cluster I just buy another one and connect them, then copy/paste the instructions from NVIDIA.
Sometimes a penny saved is a dollar lost.
Not really, not it isn't, because it's deliberately gimped and doesn't support the same feature-set as the datacenter GPUs[1]. So as a professional development box to e.g. write CUDA kernels before you burn valuable B200 time it's completely useless. You're much better off getting an RTX 6000 or two, which is also gimped, but at least is much faster.
[1] -- https://github.com/NVIDIA/dgx-spark-playbooks/issues/22
It does seem really shady that they'd claim it to be 5th gen tensor cores and then not support the full feature set. I searched through the spark forums, and as that poster said nobody is answering the question.
I'm really curious to see how things shift when the M5 Ultra with "tensor" matmul functionality in the GPU cores rolls out. This should be a multiples speed up of that platform.
Are you doing this with vLLM, or some other model-running library/setup?
(1) https://github.com/ggml-org/llama.cpp/discussions/15396
I thought the last one was a toy, until I tried with a full 1.2 megabyte repomix project dump. It actually works quite well for general code comprehension across the whole codebase, CI scripts included.
I recently needed an LLM to batch process me some queries. I ran an ablation on 20+ models from Open Router to find the best one. Guess which ones got 100% accuracy? GPT-5-mini, Grok-4.1-fast and... Llama4 Scout. For comparison, DeepSeek v3.2 got 90%, and the community darling GLM-4.5-Air got 50%. Even the newest GLM-4.7 only got 70%.
Of course, this is just an anecdotal single datapoint which doesn't mean anything, but it shows that Llama 4 is probably underrated.
Sounds interesting; can you suggest any good discussions of this (on the web)?
The desktop RTX 5090 has 1792 GB/s of memory bandwidth partially due to the 512 bit bus width, compared to the DGX Spark with a 256 bit bus and 273 GB/s memory bandwidth.
The RTX 5090 has 32G of VRAM vs the 128G of “VRAM” in the DGX Spark which is really unified memory.
Also the RTX 5090 has 21760 cuda cores vs 6144 in the DGX Spark. (3.5 x as many). And with the much higher bandwidth in the 5090 you have a better shot at keeping them fed. So for embarrassingly parallel workloads the 5090 crushes the Spark.
So if you need to fit big models into VRAM and don’t care about speed too much because you are for example, building something on your desktop that’ll run on data center hardware in production, the DGX Spark is your answer.
If you need speed and 32G of VRAM is plenty, and you don’t care about modeling network interconnections in production, then the RTX 5090 is what you want.
It isn't, because it's a different architecture than the datacenter hardware. They're both called "Blackwell", but that's a lie[1] and you still need "real" datacenter Blackwell card for development work. (For example, you can't configure/tune vLLM on Spark, and then move it into a B200 and even expect it to work, etc.)
[1] -- https://github.com/NVIDIA/dgx-spark-playbooks/issues/22
The nvfuser code doesn't even call it sm_100 vs. sm_120: NVIDIA's internal nomenclature seems to be 2CTA/1CTA, it's a bin. So there are less MMA tilings in the released ISA as of 13.1 / r85 44.
The mnemonic tcgen05.mma doesn't mean anything, it's lowered onto real SASS. FWIW the people I know doing their own drivers say the whole ISA is there, but it doesn't matter.
The family of mnemonics that hits the "Jensen Keynote" path is roughly here: https://docs.nvidia.com/cuda/parallel-thread-execution/#warp....
10x path is hot today on Thor, Spark, 5090, 6000, and data center.
Getting it to trigger reliably on real tilings?
Well that's the game just now. :)
Because the official NVidia stance is definitely that TMEM, etc. is not supported and doesn't work.
...I don't suppose you have a link to a repo with code that can trigger any of this officially forbidden functionality?
Put this in nsight compute: https://github.com/NVIDIA/cutlass/blob/main/examples/79_blac...
(I said 83, it's 79).
If you want to know what NVIDIA really thinks, watch this repo: https://github.com/nVIDIA/fuser. The Polyhedral Wizards at play. All the big not-quite-Fields players are splashing around there. I'm doing lean4 proofs of a bunch of their stuff. https://v0-straylight-papers-touchups.vercel.app
It works now. It's just not the PTX mnemonic that you want to see.
Anyhow, be that as it may, I was talking about the PTX mnemonics and such because I'd like to use this functionality from my own, custom kernels, and not necessarily only indirectly by triggering whatever lies at the bottom of NVidia's abstraction stack.
So what's your endgame with your proofs? You wrote "the breaking point was implementing an NVFP4 matmul" - so do you actually intend to implement an NVFP4 matmul? (: If so I'd be very much interested; personally I'm definitely still in the "cargo-cults from CUTLASS examples" camp, but would love something more principled.
https://chipsandcheese.com/p/inside-nvidia-gb10s-memory-subs...
If you still have the hardware (this and the Mac cluster) can you PLEASE get some advice and run some actually useful benchmarks?
Batching on a single consumer GPU often results in 3-4x the throughout. We have literally no idea what that looks like on a cluster, because you arent performing useful benchmarks.
https://www.dell.com/en-us/shop/desktop-computers/dell-pro-m...