Nvidia Dgx Spark and Apple MAC Studio = 4x Faster LLM Inference with Exo 1.0

Posted3 months agoActive3 months ago

edelsohn

61 points

20 comments

blog.exolabs.netTechstory

calmpositive

Debate

60/100

LLM InferenceNvidia DgxApple MAC StudioAI Hardware

Key topics

LLM Inference

Nvidia Dgx

Apple MAC Studio

AI Hardware

The article discusses how combining Nvidia DGX Spark with Apple Mac Studio using EXO 1.0 achieves 4x faster LLM inference, sparking discussion on the benefits and limitations of this setup for various AI workloads.

Snapshot generated from the HN discussion

Discussion Activity

Moderate engagement

First comment

22m

Peak period

0-2h

Avg / period

3.3

Comment distribution20 data points

Loading chart...

Based on 20 loaded comments

Key moments

01Story posted
Oct 16, 2025 at 7:30 PM EDT
3 months ago
Step 01
02First comment
Oct 16, 2025 at 7:52 PM EDT
22m after posting
Step 02
03Peak activity
6 comments in 0-2h
Hottest window of the conversation
Step 03
04Latest activity
Oct 17, 2025 at 4:49 PM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (20 comments)

Showing 20 comments

pram

3 months ago

1 reply

Very cool, using the DGX like an “AI eGPU.” I wonder if this could also benefit stuff like Stable Diffusion/WAN etc?

alexandercheema

3 months ago

Yes, these models are mostly compute-bound so benefit even more from the compute on the DGX Spark.

dekhn

3 months ago

1 reply

Are you using USB-C for networking between the Spark and the Mac?

pdpi

3 months ago

1 reply

IP over thunderbolt is definitely a thing, don't know whether IP over USB is also a thing. USB4x2 or TB5 can do 80Gib/s symmetrical or 120+40 asymmetrical (and boy is this a poster child for the asymmetrical setup). The Mac definitely supports that fine, so, as long as the Spark plays nice, USB is actually a legitimately decent choice.

esseph

3 months ago

USB4 was based on Thunderbolt3

Yes, it's a thing that works.

mehdibl

3 months ago

1 reply

The gain is only in prefill and if the task/output is complex the gain will be totally minor. So the numbers are quitly exagerated here based on a prompt that is taking less than 2s to decode. So I guess we are not here doing complex tasks with 100's or 1000 token output. For the cost of an M3 Ultra + DGX the gain seem minimal and most of all, exo didn't clarify the model used here and it's for sure not a dense model or an MoE with 1B or 2B experts otherwise the mac ultra too will suffer a lot and the layers will be bigger!

solarkraft

3 months ago

Anecdotally, even medium-sized prompts (a few thousand tokens) on pretty small models (8-2B) have resulted in extremely noticeable slowdowns (vast majority of total processing time) on my M1 Mac, leading me to appreciate the significance of the pre-fill step (and difficulty of processing large contexts locally).

drodgers

3 months ago

1 reply

This is really cool!

Now I'm trying to stop myself from finding an excuse to spend upwards of $30k on compute hardware...

tuananh

3 months ago

1 reply

if you have $30k to spare, I'm sure there are better options

jsight

3 months ago

Yeah, a couple of RTX Pro 6000 cards would blow this away and still leave him with money to spare.

daft_pink

3 months ago

1 reply

It’s really sad that exo went private.

ethanpil

3 months ago

1 reply

How do you know this happened? I thought it was an abandoned project until I saw this post. I've been diligently checking weekly for new releases but nothing for almost a year...

alexandercheema

3 months ago

Appreciate you checking back so often. We have some exciting plans. Keep checking and it won't be long before something pops up :)

storus

3 months ago

1 reply

Wouldn't this restrict memory to 128GB, wasting M3 Ultra potential?

alexandercheema

3 months ago

1 reply

Blog author here. Actually, no. The model can be streamed into the DGX Spark, so we can run prefill of models much larger than 128GB (e.g. DeepSeek R1) on the DGX Spark. This feature is coming to EXO 1.0 which will be open-sourced soonTM.

storus

3 months ago

Excellent! Good luck!

musicale

3 months ago

But you could also just get two DGX Spark and get 2 * 1.9x = 3.8x total throughput for two query streams.

rcarmo

3 months ago

This is very nicely done. I wonder what the values will look like a year from now with M5 Macs, though.

solarkraft

3 months ago

This is a wonderful explanation of the two phases! I appreciate the hardware concerns for both now.

Reading the article I wished for a device that just does both things well and on that topic it might be noteworthy that Apple's just-released M5 has approximately 3.5x-ed TTFT performance compared to M4, according to their claims!

adam_arthur

3 months ago

I'm confused by all the takes implying decode is more important than prefill.

There are an enormous number of use cases where the prompt is large and the expected output is small.

E.g. providing data for the LLM to analyze, after which it gives a simple yes/no Boolean response. Or selecting a single enum value from a set.

This pattern seems far more valuable in practice, than the common and lazy open ended chat style implementations (lazy from a product perspective).

Obviously decode will be important for code generation or search, but that's such a small set of possible applications, and you'll probably always do better being on the latest models in the cloud.

View full discussion on Hacker News

ID: 45611912Type: storyLast synced: 11/20/2025, 12:41:39 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN