Big Gpus Don't Need Big Pcs

Posted13 days agoActive8 days ago

mikece

281 points

122 comments

jeffgeerling.comTech DiscussionstoryHigh profile

informativeneutral

Debate

20/100

GPU AllocationCPU DesignCompatibility

Key topics

GPU Allocation

CPU Design

Compatibility

The debate around whether big GPUs need big PCs has sparked a lively discussion, with many users sharing their experiences running powerful GPUs on compact machines. Some commenters, like jonahbenton, have successfully used eGPUs with spare laptops, while others, like 3eb7988a1663, are reconsidering their daily drivers in favor of low-power mini PCs. However, not everyone is convinced, with samuelknight noting that their 8-core Ryzen desktop outperforms their mini PC for CPU-intensive tasks, highlighting the importance of TDP limits and cooling systems, as pointed out by loeg. As the discussion unfolds, it becomes clear that the answer depends on specific use cases, with some users, like jasonwatkinspdx, finding that a $200 NUC is sufficient for basic tasks, while others require more powerful machines.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

35m

Peak period

0-12h

Avg / period

21.7

Comment distribution130 data points

Loading chart...

Based on 130 loaded comments

Key moments

01Story posted
Dec 20, 2025 at 12:49 PM EST
13 days ago
Step 01
02First comment
Dec 20, 2025 at 1:24 PM EST
35m after posting
Step 02
03Peak activity
66 comments in 0-12h
Hottest window of the conversation
Step 03
04Latest activity
Dec 25, 2025 at 8:00 PM EST
8 days ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (122 comments)

Showing 130 comments

3eb7988a1663

13 days ago

8 replies

Datapoints like this really make me reconsider my daily driver. I should be running one of those $300 mini PCs at <20W. With ~flat CPU performance, would be fine for the next 10 years. Just remote into my beefy workstation when I actually need to do real work. Browsing the web, watching videos, even playing some games is easily within their wheelhouse.

samuelknight

13 days ago

2 replies

Switching from my 8-core ryzen minipc to an 8-core ryzen desktop makes my unit tests run way faster. TDP limits can tip you off to very different performance envelopes in otherwise similar spec CPUs.

adrian_b

13 days ago

1 reply

A full-size desktop computer will always be much faster for any workload that fully utilizes the CPU.

However, a full-size desktop computer seldom makes sense as a personal computer, i.e. as the computer that interfaces to a human via display, keyboard and graphic pointer.

For most of the activities done directly by a human, i.e. reading & editing documents, browsing Internet, watching movies and so on, a mini-PC is powerful enough. The only exception is playing games designed for big GPUs, but there are many computer users who are not gamers.

In most cases the optimal setup is to use a mini-PC as your personal computer and a full-size desktop as a server on which you can launch any time-consuming tasks, e.g. compilation of big software projects, EDA/CAD simulations, testing suites etc.

The desktop used as server can use Wake-on-LAN to stay powered off when not needed and wake up whenever it must run some task remotely.

whatevaa

12 days ago

Not everything supports remoting well. For example, many IDE's. Unless you run RDP, with whole graphical session on remote.

Also, having to buy two computers also costs money. It makes sense to use 1 for both use cases if you have to buy the desktop anyway.

loeg

13 days ago

Even if you could cool the full TDP in a micro PC, in a full size desktop you might be able to use a massive AIO radiator with fans running at very slow, very quiet speeds instead of jet turbine howl in the micro case. The quiet and ease of working in a bigger space are mostly a good tradeoff for a slightly larger form factor under a desk.

jasonwatkinspdx

13 days ago

For just basic windows desktop stuff, a $200 NUC has been good enough for like 15 years now.

themafia

12 days ago

> I should be running one of those $300 mini PCs at <20W.

Yes. They're basically laptop chips at this point. The thermals are worse but the chips are perfectly modern and can handle reasonably large workloads. I've got an 8 core Ryzen 7 with Radeon 780 Graphics and 96GB of DDR5. Outside of AAA gaming this thing is absolutely fine.

The power draw is a huge win for me. It's like 6W at idle. I live remotely so grid power is somewhat unreliable and saving watts when using solar batteries extends their lifetime massively. I'm thrilled with them.

ivanjermakov

12 days ago

Another benefit is low noise. Many consider fan noise under load to be the most important property of a workstation.

nottorp

12 days ago

That's why I use a M2 (not even pro) Mac Mini as a terminal and remote into other boxes when needed.

reactordev

13 days ago

[delayed]

ekropotin

13 days ago

As experiment, I decided to try using proxmox VM with eGPU and usb bus bypassed to it, as my main PC for browsing and working on hobby projects.

It’s just 1 vCPU with 4 Gb ram, and you know what? It’s more than enough for these needs. I think hardware manufactures falsely convinced us that every professional needs beefy laptop to be productive.

PunchyHamster

12 days ago

Slapping $300 worth of solar panels on your roof/balcony will probably get you ahead on power usage

yjftsjthsd-h

13 days ago

7 replies

I've been kicking this around in my head for a while. If I want to run LLMs locally, a decent GPU is really the only important thing. At that point, the question becomes, roughly, what is the cheapest computer to tack on the side of the GPU? Of course, that assumes that everything does in fact work; unlike OP I am barely in a position to understand eg. BAR problems, let alone try to fix them, so what I actually did was build a cheap-ish x86 box with a half-decent GPU and called it a day:) But it still is stuck in my brain: there must be a more efficient way to do this, especially if all you need is just enough computer to shuffle data to and from the GPU and serve that over a network connection.

zeusk

13 days ago

1 reply

Get the DGX Spark computers? They’re exactly what you’re trying to build.

Gracana

13 days ago

1 reply

They’re very slow.

geerlingguy

13 days ago

1 reply

They're okay, generally, but slow for the price. You're more paying for the ConnectX-7 networking than inference performance.

Gracana

13 days ago

Yeah, I wouldn’t complain if one dropped in my lap, but they’re not at the top of my list for inference hardware.

Although... Is it possible to pair a fast GPU with one? Right now my inference setup for large MoE LLMs has shared experts in system memory, with KV cache and dense parts on a GPU, and a Spark would do a better job of handling the experts than my PC, if only it could talk to a fast GPU.

tcdent

13 days ago

2 replies

We're not yet to the point where a single PCIe device will get you anything meaningful; IMO 128 GB of ram available to the GPU is essential.

So while you don't need a ton of compute on the CPU you do need the ability address multiple PCIe lanes. A relatively low-spec AMD EPYC processor is fine if the motherboard exposes enough lanes.

skhameneh

13 days ago

1 reply

There is plenty that can run within 32/64/96gb VRAM. IMO models like Phi-4 are underrated for many simple tasks. Some quantized Gemma 3 are quite good as well.

There are larger/better models as well, but those tend to really push the limits of 96gb.

FWIW when you start pushing into 128gb+, the ~500gb models really start to become attractive because at that point you’re probably wanting just a bit more out of everything.

tcdent

13 days ago

2 replies

IDK all of my personal and professional projects involve pushing the SOTA to the absolute limit. Using anything other than the latest OpenAI or Anthropic model is out of the question.

Smaller open source models are a bit like 3d printing in the early days; fun to experiment with but really not that valuable for anything other than making toys.

Text summarization, maybe? But even then I want a model that understands the complete context and does a good job. Even things like "generate one sentence about the action we're performing" I usually find I can just incorporate it into the output schema of a larger request instead of making a separate request to a smaller model.

xyzzy123

13 days ago

3 replies

It seems to me like the use case for local GPUs is almost entirely privacy.

If you buy a 15k AUD rtx 6000 96GB, that card will _never_ pay for itself on a gpt-oss:120b workload vs just using openrouter - no matter how many tokens you push through it - because the cost of power in Australia means you cannot generate tokens cheaper than the cloud even if the card were free.

joefourier

13 days ago

There’s a few more considerations:

- You can use the GPU for training and run your own fine tuned models

- You can have much higher generation speeds

- You can sell the GPU on the used market in ~2 years time for a significant portion of its value

- You can run other types of models like image, audio or video generation that are not available via an API, or cost significantly more

- Psychologically, you don’t feel like you have to constrain your token spending and you can, for instance, just leave an agent to run for hours or overnight without feeling bad that you just “wasted” $20

- You won’t be running the GPU at max power constantly

15155

13 days ago

Or censorship avoidance

girvo

13 days ago

> because the cost of residential power in Australia

This so doesn't really matter to your overall point which I agree with but:

The rise of rooftop solar and home battery energy storage flips this a bit now in Australia, IMO. At least where I live, every house has a solar panel on it.

Not worth it just for local LLM usage, but an interesting change to energy economics IMO!

popalchemist

13 days ago

1 reply

This is simply not true. Your heuristic is broken.

The recent Gemma 3 models, which are produced by Google (a little startup - heard of em?) outperform the last several OpenAI releases.

Closed does not necessarily mean better. Plus the local ones can be finetuned to whatever use case you may have, won't have any inputs blocked by censorship functionality, and you can optimize them by distilling to whatever spec you need.

Anyway all that is extraneous detail - the important thing is to decouple "open" and "small" from "worse" in your mind. The most recent Gemma 3 model specifically is incredible, and it makes sense, given that Google has access to many times more data than OpenAI for training (something like a factor of 10 at least). Which is of course a very straightforward idea to wrap your head around, Google was scrapign the internet for decades before OpenAI even entered the scene.

So just because their Gemma model is released in an open-source (open weights) way, doesn't mean it should be discounted. There's no magic voodoo happening behind the scenes at OpenAI or Anthropic; the models are essentially of the same type. But Google releases theirs to undercut the profitability of their competitors.

tcdent

13 days ago

This one? https://artificialanalysis.ai/models/gemma-3-27b

p1necone

13 days ago

3 replies

I'm holding out for someone to ship a gpu with dimm slots on it.

tymscar

13 days ago

3 replies

DDR5 is a couple of orders of magnitude slower than really good vram. That’s one big reason.

dawnerd

13 days ago

1 reply

But it would still be faster than splitting the model up on a cluster though, right? But I’ve also wondered why they haven’t just shipped gpus like cpus.

cogman10

13 days ago

1 reply

Man I'd love to have a GPU socket. But it'd be pretty hard to get a standard going that everyone would support. Look at sockets for CPUs, we barely had cross over for like 2 generations.

But boy, a standard GPU socket so you could easily BYO cooler would be nice.

estimator7292

12 days ago

The problem isn't the sockets. It costs a lot to spec and build new sockets, we wouldn't swap them for no reason.

The problem is that the signals and features that the motherboard and CPU expect are different between generations. We use different sockets on different generations to prevent you plugging in incompatible CPUs.

We used to have cross-generational sockets in the 386 era because the hardware supported it. Motherboards weren't changing so you could just upgrade the CPU. But then the CPUs needed different voltages than before for performance. So we needed a new socket to not blow up your CPU with the wrong voltage.

That's where we are today. Each generation of CPU wants different voltages, power, signals, a specific chipset, etc. Within the same +-1 generation you can swap CPUs because they're electrically compatible.

To have universal CPU sockets, we'd need a universal electrical interface standard, which is too much of a moving target.

AMD would probably love to never have to tool up a new CPU socket. They don't make money on the motherboard you have to buy. But the old motherboards just can't support new CPUs. Thus, new socket.

zrm

12 days ago

1 reply

DDR5 is ~8GT/s, GDDR6 is ~16GT/s, GDDR7 is ~32GT/s. It's faster but the difference isn't crazy and if the premise was to have a lot of slots then you could also have a lot of channels. 16 channels of DDR5-8200 would have slightly more memory bandwidth than RTX 4090.

tymscar

12 days ago

1 reply

Yeah, so DDR5 is 8GT and GDDR7 is 32GT. Bus width is 64 vs 384. That already makes the VRAM 4*6 (24) times faster.

You can add more channels, sure, but each channel makes it less and less likely for you to boot. Look at modern AM5 struggling to boot at over 6000 with more than two sticks.

So you’d have to get an insane six channels to match the bus width, at which point your only choice to be stable would be to lower the speed so much that you’re back to the same orders of magnitude difference, really.

Now we could instead solder that RAM, move it closer to the GPU and cross-link channels to reduce noise. We could also increase the speed and oh, we just invented soldered-on GDDR…

zrm

12 days ago

> Bus width is 64 vs 384.

The bus width is the number of channels. They don't call them channels when they're soldered but 384 is already the equivalent of 6. The premise is that you would have more. Dual socket Epyc systems already have 24 channels (12 channels per socket). It costs money but so does 256GB of GDDR7.

> Look at modern AM5 struggling to boot at over 6000 with more than two sticks.

The relevant number is the number of sticks per channel.

cogman10

13 days ago

For AI, really good isn't really a requirement. If a middle ground memory module could be made, then it'd be pretty appealing.

anon25783

13 days ago

Would that be worth anything, though? What about the overhead of clock cycles needed for loading from and storing to RAM? Might not amount to a net benefit for performance, and it could also potentially complicate heat management I bet.

kristianp

13 days ago

A single CAMM might suit better.

dist-epoch

13 days ago

3 replies

This problem was already solved 10 years ago - crypto mining motherboards, which have a large number of PCIe slots, a CPU socket, one memory slot, and not much else.

> Asus made a crypto-mining motherboard that supports up to 20 GPUs

https://www.theverge.com/2018/5/30/17408610/asus-crypto-mini...

jsheard

13 days ago

1 reply

Those only give each GPU a single PCIe lane though, so you'll have a bad time if your application needs to shuffle any significant amount of data to and from the GPUs. Running many GPUs with many lanes each is much more expensive.

dist-epoch

13 days ago

1 reply

After you load the weights into the GPU and keep the KV cache there too, you don't need any other significant traffic.

numpad0

13 days ago

[delayed]

skhameneh

13 days ago

In theory, it’s only sufficient for pipeline parallel due to limited lanes and interconnect bandwidth.

Generally, scalability on consumer GPUs falls off between 4-8 GPUs for most. Those running more GPUs aren’t typically using a higher quantity of smaller GPUs for cost effectiveness.

zozbot234

13 days ago

M.2 is mostly just a different form factor for PCIe anyway.

binsquare

13 days ago

3 replies

I run a crowd sourced website to collect data on the best and cheapest hardware setup for local LLM here: https://inferbench.com/

Source code: https://github.com/BinSquare/inferbench

nodja

13 days ago

1 reply

Cool site, I noticed the 3090 is on there twice.

https://inferbench.com/gpu/NVIDIA%20GeForce%20RTX%203090

https://inferbench.com/gpu/NVIDIA%20RTX%203090

binsquare

13 days ago

Oh nice catch, I'll fix that

kilpikaarna

13 days ago

1 reply

Nice! Though for older hardware it would be nice if the price reflected the current second hand market (harder to get data for, I know). Eg. Nvidia RTX 3070 ranks as second best GPU in tok/s/$ even at the MSRP of $499. But you can get one for half that now.

binsquare

13 days ago

Great idea - I've added it by manually browsing ebay for that initial data.

So it's just a static value in this hardware list: https://github.com/BinSquare/inferbench/blob/main/src/lib/ha...

Let me know if you know of a better way, or contribute :D

jsight

13 days ago

1 reply

It seems like verification might need to be improved a bit? I looked at Mistral-Large-123B. Someone is claiming 12 tokens/sec on a single RTX 3090 at FP16.

Perhaps some filter could cut out submissions that don't really make sense?

binsquare

8 days ago

Great idea - took a bit to figure out how to implement this.

I came up with a plausibility check based on the model's memory requirements: https://github.com/BinSquare/inferbench/blob/main/src/lib/pl...

So now on the submission page - it has a warning + an automate flag count for volunteers to double check:

```This configuration seems unlikely

Model requires ~906GB VRAM but only 32GB available (28.3x over). This likely requires significant CPU offload which would severely impact performance.

You can still submit, but your result will be flagged for review.```

zh3

13 days ago

One data point, I stuck a couple of ebay RTX3090s into a retired 2012 Asus workstation (P8Z77WS) with i3770 and 32Gb RAM; don't have numbers but moving the setup to a 9900X with 64Gb (and much higher PCIe bandwidth) made little to no difference in perceived performance. Switching models (what is loaded on the GPUs) is much quicker though.

Eisenstein

13 days ago

There is a whole section in here on how to spec out a cheap rig and what to look for:

* https://jabberjabberjabber.github.io/Local-AI-Guide/

seanmcdirmid

13 days ago

And you don’t want to go the M4 Max/M3 Ultra route? It works well enough for most mid sized LLMs.

Wowfunhappy

13 days ago

1 reply

I really would have liked to see gaming performance, although I realize it might be difficult to find a AAA game that supports ARM.

3eb7988a1663

13 days ago

You might have to thread the needle to find a game which does not bottleneck on the CPU.

kristjansson

13 days ago

3 replies

Really why have the PCI/CPU artifice at all? Apple and Nvidia have the right idea: put the MPP on the same die/package as the CPU.

bigyabai

13 days ago

> put the MPP on the same die/package as the CPU.

That would help in latency-constrained workloads, but I don't think it would make much of a difference for AI or most HPC applications.

PunchyHamster

12 days ago

We need low power but high PCIE lane count CPUs for that. Just purely for shoving models from NVMe to GPU

systemtest

13 days ago

Perhaps we can include a small ARM processor on the GPU die plus some USB-ports on the back. So that user can decide to run the GPU as a stand-alone computer or a PCIe plug-in GPU for a big PC.

numpad0

13 days ago

4 replies

[delayed]

nodja

13 days ago

1 reply

> Reportedly the inter-layer transfer sizes are in kilobyte ranges and PCIe x1 is plenty or something.

Not an expert, but napkin math tells me that more often that not this will be in the order of megabytes—not kilobytes—since it scales with sequence length.

Example: Qwen3 30B has a hidden state size of 5120, even if quantized to 8 bits that's 5120 bytes per token. It would pass the MB boundary with just a little over 200 tokens. Still not much of an issue when a single PCIe lane is ~2GB/s.

I think device to device latency is more of an issue here, but I don't know enough to assert that with confidence.

remexre

13 days ago

For each token generated, you only send one token’s worth between layers; the previous tokens are in the KV cache.

syntaxing

13 days ago

1 reply

Theres some technical implementations that makes it more efficient like EXO [1]. Jeff Geerling recently did a review on a 4 MAC Studio cluster with RDMA support and you can see that EXO has a noticeable advantage [2].

[1] https://github.com/exo-explore/exo [2] https://www.youtube.com/watch?v=x4_RsUxRjKU

sgt

12 days ago

1 reply

At this point I'd consider a cluster of top specced Mac Studio's to be worth while in production. I just need to host them properly in a rack and in a co-lo data center.

syntaxing

12 days ago

Honestly, I genuinely can see the value if you want to host something internally for sensitive and important information. I really hope the M5 ultra with matmul accelerators will knock this out of the park. With the way way RAM is trending, a Mac Studio cluster will become more enticing.

scotty79

12 days ago

1 reply

> Not sure what was unexpected about the multi GPU part. It's very well known that most LLM frameworks including llama.cpp splits models by layers, which has sequential dependency, and so multi GPU setups are completely stalled

Oh, I thought the point of transformers was being able to split the load veritcally to avoid seqential dependancies. Is it true just for training or not at all?

sailingparrot

12 days ago

Just for training and processing the existing context (pre fill phase). But when doing inference a token t has to be sampled before t+1 can so it’s still sequential

zozbot234

13 days ago

> Looking for a way to break up tasks for LLMs so that there will be multiple tasks to run concurrently would be interesting, maybe like creating one "manager" and few "delegated engineers" personalities.

This is pretty much what "agents" are for. The manager model constructs prompts and contexts that the delegated models can work on in parallel, returning results when they're done.

jauntywundrkind

13 days ago

1 reply

PCIe 3.0 is the nice easy convenient generation where 1 lane = 1GBps. Given the overhead, thats pretty close to 10Gb ethernet speeds (low latency though).

I do wonder how long the cards are going to need host systems at all. We've already seen GPUs with m.2 sad attached! Radeon Pro SSG hails back from 2016! You still need a way to get the model on that in the first place to get work in and out, but a 1Gbe and small RISC-V chip (which Nvidia already uses formanagement cores) could suffice. Maybe even an rpi on the card. https://www.techpowerup.com/224434/amd-announces-the-radeon-...

Given the gobs of memory cards have, they probably don't even need storage; they just need big pipes. Intel had 100Gbe on their Xeon & Xeon Phi cores (10x what we saw here!) in 2016! GPUs that just plug into the switch and talk across 400Gbe or UltraEthernet or switched CXL, that run semi independently, feel so sensible, so not outlandish. https://www.servethehome.com/next-generation-interconnect-in...

It's far off for now, but flash makers are also looking at radically many channel flash, which can provide absurdly high GB/s, High Bandwidth Flash. And potentially integrated some extremely parallel tensorcores on each channel. Switching from DRAM to flash for AI processing could be a colossal win for fitting large models cost effectively (& perhaps power efficiently) while still having ridiculous gobs of bandwidth. With that possible win of doing processing & filtering extremely near to the data too. https://www.tomshardware.com/tech-industry/sandisk-and-sk-hy...

zozbot234

13 days ago

Sure, but a "card" that doesn't need a host system is just an APU.

Avlin67

13 days ago

2 replies

tired of jeff glinglin everywhere...

manarth

12 days ago

1 reply

I personally find his work and his posts interesting, and enjoy seeing them pop up on HN.

If you prefer not to see his posts on the HN list pages, a practical solution is to use a browser extension (such as Stylus) to customise the HN styling to hide the posts.

Here is a specific CSS style which will hide submissions from Jeff's website:

    tr.submission:has(td a[href="from?site=jeffgeerling.com"]),
    tr.submission:has(td a[href="from?site=jeffgeerling.com"]) + tr,
    tr.submission:has(td a[href="from?site=jeffgeerling.com"]) + tr + tr {
      opacity: 0.05
    }

In this example, I've made it almost invisible, whilst it still takes up space on the screen (to avoid confusion about the post number increasing from N to N+2). You could use { display: none } to completely hide the relevant posts.

The approach can be modified to suit any origin you prefer to not come across.

The limitation is that the style modification may need refactoring if HN changes the markup structure.

mythoughtsexact

11 days ago

You're awesome, thank you.

I stopped following this guy back in 2015 when he straight up forked all of my ansible roles and then published everything to Ansible Galaxy before mine were even complete, tested and ready to be published, and only for me to find that the same day they were all forked by him a new Github organization with the name of the org I had used in my roles had been registered and then squatted, it completely turned me off to his methods.

mjh2539

12 days ago

I only ever see him on HN. He's smart, kind, and talks about interesting things. Are you sure what you're feeling isn't envy?

omneity

13 days ago

1 reply

I wish for a hardware + software solution to enable direct PCIe interconnect using lanes independent from the chipset/CPU. A PCIe mesh of sorts.

With the right software support from say pytorch this could suddenly make old GPUs and underpowered PCs like in TFA into very attractive and competitive solutions for training and inference.

snuxoll

13 days ago

1 reply

PCIe already allows DMA between peers on the bus, but, as you pointed out, the traces for the lanes have to terminate somewhere. However, it doesn't have to be the CPU (which is, of course, the PCIe root in modern systems) handling the traffic - a PCIe switch may be used to facilitate DMA between devices attached to it, if it supports routing DMA traffic directly.

ComputerGuru

13 days ago

1 reply

That’s what happened in TFA.

omneity

12 days ago

1 reply

You're right. Let me correct myself: a hobbyist-friendly hardware solution. Dolphin's PCIe switches cost more than 8 RTX 3090 on a Threadripper machine.

ComputerGuru

11 days ago

Jeff forgot to mention that in his post!

Waterluvian

13 days ago

8 replies

At what point do the OEMs begin to realize they don’t have to follow the current mindset of attaching a GPU to a PC and instead sell what looks like a GPU with a PC built into it?

nightshift1

13 days ago

1 reply

Exactly. With the Intel-Nvidia partnership signed this September, I expect to see some high-performance single-board computers being released very soon. I don't think the atx form-factor will survive another 30 years.

bostik

12 days ago

3 replies

One should also remember that NVidia does have organisational experience on designing and building CPUs[0].

They were a pretty big deal back in ~2010, and I have to admit I didn't know that Tegra was powering Nintendo Switch.

0: https://en.wikipedia.org/wiki/Tegra

goku12

12 days ago

I had a Xolo Tegra Note 7 tablet (marketed in the US as EVGA Tegra Note 7) in around 2013. I preordered it as far as I remember. It had a Tegra 4 SoC with quad core Cortex A15 CPU and a 72 core GeForce GPU. Nvidia used to claim that it is the fastest SoC for mobile devices at the time.

To this day, it's the best mobile/Android device I ever owned. I don't know if it was the fastest, but it certainly was the best performing one I ever had. UI interactions were smooth, apps were fast on it, screen was bright, touch was perfect and still had long enough battery backup. The device felt very thin and light, but sturdy at the same time. It had a pleasant matte finish and a magnetic cover that lasted as long as the device did. It spolied the feel of later tablets for me.

It had only 1 GB RAM. We have much more powerful SoCs today. But nothing ever felt that smooth (iPhone is not considered). I don't know why it was so. Perhaps Android was light enough for it back then. Or it may have had a very good selection and integration of subcomponents. I was very disappointed when Nvidia discontinued the Tegra SoC family and tablets.

miladyincontrol

12 days ago

I'd argue their current CPUs aren't to be discounted either. Much as people love to crown Apple's M-series chips as the poster child of what arm can do, Nvidia's grace CPUs too trade blows with the best of the best.

It leaves one to wonder what could be if they had any appetite for devices more in the consumer realm of things.

jsheard

12 days ago

That said, their attempts at designing CPU cores stalled out a while ago. It's been nearly a decade since they last shipped an in-house CPU core, all of their recent SOCs have just used ARMs reference designs.

pjmlp

12 days ago

2 replies

So basically going back to the old days of Amiga and Atari, in a certain sense, when PCs could only display text.

goku12

12 days ago

2 replies

I'm not familiar with that history. Could you elaborate?

pjmlp

12 days ago

2 replies

In the home computer universe, such computers were the first ones having a programmable graphics unit that did more than paste the framebuffer into the screen.

While the PCs were still displaying text, or if you were lucky to own an Hercules card, gray text, or maybe a CGA one, with 4 colours.

While the Amigas, which I am more confortable with, were doing this in the mid-80's:

https://www.youtube.com/watch?v=x7Px-ZkObTo

https://www.youtube.com/watch?v=-ga41edXw3A

The original Amiga 1000, had on its motherboard, later reduced to fit into an Amiga 500,

Motorola 68000 CPU, a programmable sounds chip with DMA channels (Paula), and a programable blitter chip (Agnus aka early GPUs).

You would build in RAM the audio, or graphics instructions for the respetive chipset, set the DMA parameters, and let them lose.

nnevatie

12 days ago

Hey! I had an Amiga 1000 back in the day - it was simply awesome.

goku12

12 days ago

Thanks! Early computing history is very interesting (I know that this wasn't the earliest). They also sometimes explain certain odd design decisions that are still followed today.

estimator7292

12 days ago

1 reply

In the olden days we didn't have GPUs, we had "CRT controllers".

What it offered you was a page of memory where each byte value mapped to a character in ROM. You feed in your text and the controller fetches the character pixels and puts them on the display. Later we got ASCII box drawing characters. Then we got sprite systems like the NES, where the Picture Processing Unit handles loading pixels and moving sprites around the screen.

Eventually we moved on to raw framebuffers. You get a big chunk of memory and you draw the pixels yourself. The hardware was responsible for swapping the framebuffers and doing the rendering on the physical display.

Along the way we slowly got more features like defining a triangle, its texture, and how to move it, instead of doing it all in software.

Up until the 90s when the modern concept of a GPU coalesced, we were mainly pushing pixels by hand onto the screen. Wild times.

The history of display processing is obviously a lot more nuanced than that, it's pretty interesting if that's your kind of thing.

pjmlp

12 days ago

Small addendum, there was already stuff like TMS34010 in the 1980's, just not at home.

cmrdporcupine

12 days ago

1 reply

Those machines multiplexed the bus to split access to memory, because RAM speeds were competitive with or faster than the CPU bus speed. The CPU and VDP "shared" the memory, but only because CPUs were slow enough to make that possible.

We have had the opposite problem for 35+ years at this point. The newer architecture machines like the Apple machines, the GB10, the AI 395+ do share memory between GPU and CPU but in a different way, I believe.

I'd argue with memory becoming suddenly much more expensive we'll probably see the opposite trend. I'm going to get me one of these GB10 or Strix Halo machines ASAP because I think with RAM prices skyrocketing we won't be seeing more of this kind of thing in the consumer market for a long time. Or at least, prices will not be dropping any time soon.

pjmlp

12 days ago

You are right, hence my "in a certain sense", because I was too lazy to point out the differences between a motherboard having everything there without pluggable graphics unit[0], and having everything now inside of a single chip.

[0] - Not fully correct, as there are/were extensions cards that override the bus, thus replacing one of the said chips, on Amiga case.

animal531

12 days ago

4 replies

It's funny how ideas come and go. I made this very comment here on Hacker News probably 4-5 years ago and received a few down votes for it at the time (albeit that I was thinking of computers in general).

It would take a lot of work to make a GPU do current CPU type tasks, but it would be interesting to see how it changes parallelism and our approach to logic in code.

goku12

12 days ago

2 replies

> I made this very comment here on Hacker News probably 4-5 years ago and received a few down votes for it at the time

HN isn't always very rational about voting. It will be a loss if you judge any idea on their basis.

> It would take a lot of work to make a GPU do current CPU type tasks

In my opinion, that would be counterproductive. The advantage of GPUs is that they have a large number of very simple GPU cores. Instead, just do a few separate CPU cores on the same die, or on a separate die. Or you could even have a forest of GPU cores with a few CPU cores interspersed among them - sort of like how modern FPGAs have logic tiles, memory tiles and CPU tiles spread out on it. I doubt it would be called a GPU at that point.

Den_VR

12 days ago

2 replies

As I recall, Gartner made the outrageous claim that upwards of 70% of all computing will be “AI” in some number of years - nearly the end of cpu workloads.

deliciousturkey

12 days ago

3 replies

I'd say over 70% of all computing is already been non-CPU for years. If you look at your typical phone or laptop SoC, the CPU is only a small part. The GPU takes the majority of area, with other accelerators also taking significant space. Manufacturers would not spend that money on silicon, if it was not already used.

goku12

12 days ago

1 reply

> I'd say over 70% of all computing is already been non-CPU for years.

> If you look at your typical phone or laptop SoC, the CPU is only a small part.

Keep in mind that the die area doesn't always correspond to the throughput (average rate) of the computations done on it. That area may be allocated for a higher computational bandwidth (peak rate) and lower latency. Or in other words, get the results of a large number of computations faster, even if it means that the circuits idle for the rest of the cycles. I don't know the situation on mobile SoCs with regards to those quantities.

deliciousturkey

12 days ago

This is true, and my example was a very rough metric. But the computation density per area is actually way, way higher on GPU's compared to CPU's. CPU's only spend a tiny fraction of their area doing actual computation.

swiftcoder

12 days ago

> If you look at your typical phone or laptop SoC, the CPU is only a small part

In mobile SoCs a good chunk of this is power efficiency. On a battery-powered device, there's always going to be a tradeoff to spend die area making something like 4K video playback more power efficient, versus general purpose compute

Desktop-focussed SKUs are more liable to spend a metric ton of die area on bigger caches close to your compute.

PunchyHamster

12 days ago

If going by raw operations done, if the given workload uses 3d rendering for UI that's probably true for computers/laptops. Watching YT video is essentially CPU pushing data between internet and GPU's video decoder, and to GPU-accelerated UI.

yetihehe

12 days ago

Looking at home computers, most of "computing" when counted as flops is done by gpus anyway, just to show more and more frames. Processors are only used to organise all that data to be crunched up by gpus. The rest is browsing webpages and running some word or excel several times a month.

zozbot234

12 days ago

1 reply

GPU compute units are not that simple, the main difference with CPU is that they generally use a combination of wide SIMD and wide SMT to hide latency, as opposed to the power-intensive out-of-order processing used by CPU's. Performing tasks that can't take advantage of either SIMD or SMT on GPU compute units might be a bit wasteful.

goku12

12 days ago

> GPU compute units are not that simple

Perhaps I should have phrased it differently. CPU and GPU cores are designed for different types of loads. The rest of your comment seems similar to what I was imagining.

Still, I don't think that enhancing the GPU cores with CPU capabilities (OOE, rings, MMU, etc from your examples) is the best idea. You may end up with the advantages of neither and the disadvantages of both. I was suggesting that you could instead have a few dedicated CPU cores distributed among the numerous GPU cores. Finding the right balance of GPU to CPU cores may be the key to achieving the best performance on such a system.

deliciousturkey

12 days ago

1 reply

HN in general is quite clueless about topics like hardware, high performance computing, graphics, and AI performance. So you probably shouldn't care if you are downvoted, especially if you honestly know you are being correct.

philistine

12 days ago

People on here tend to act as if 20% of all computers sold were laptops, when it’s the reverse.

sharpneli

12 days ago

1 reply

Is there any need for that? Just have a few good CPUs there and you’re good to go.

As for how the HW looks like we already know. Look at Strix Halo as an example. We are just getting bigger and bigger integrated GPUs. Most of the flops on that chip is the GPU part.

amelius

12 days ago

I still would like to see a general GPU back end for LLVM just for fun.

PunchyHamster

12 days ago

It would just make everything worse. Some (if anything, most) tasks are far less paralleliseable than typical GPU loads.

themafia

12 days ago

2 replies

At this point what you really need is an incredibly powerful heatsink with some relatively small chips pressed against it.

whywhywhywhy

12 days ago

Transhcan Mac Pro was this idea, triangular heatsink core cpu+gpu+gpu for each side

jnwatson

12 days ago

If you disassemble a modern GPU, that's what you'll find. 95% by weight of a GPU card is cooling related.

lizknope

12 days ago

1 reply

The vast majority of computers sold today have a CPU / GPU integrated together in a single chip. Most ordinary home users don't care about GPU or local AI performance that much.

In this video Jeff is interested in GPU accelerated tasks like AI and Jellyfin. His last video was using a stack of 4 Mac Studios connected by Thunderbolt for AI stuff.

https://www.youtube.com/watch?v=x4_RsUxRjKU

The Apple chips have both power CPU and GPU cores but also have a huge amount of memory (512GB) directly connected unlike most Nvidia consumer level GPUs that have far less memory.

onion2k

12 days ago

1 reply

Most ordinary home users don't care about GPU or local AI performance that much.

Right now, sure. There's a reason why chip manufacturers are adding AI pipelines, tensor processors, and 'neural cores' though. They believe that running small local models are going to be a popular feature in the future. They might be right.

swiftcoder

12 days ago

1 reply

It's mostly marketing gimmicks though - they aren't adding anywhere near enough compute for that future. The tensor cores in an "AI ready" laptop from a year ago are already pretty much irrelevant as far as inferencing current-generation models go.

zozbot234

12 days ago

NPU/Tensor cores are actually very useful for prompt pre-processing, or really any ML inference task that isn't strictly bandwidth limited (because you end up wasting a lot of bandwidth on padding/dequantizing data to a format that the NPU can natively work with, whereas a GPU can just do that in registers/local memory). Main issue is the limited support in current ML/AI inference frameworks.

not_the_fda

11 days ago

They already have. They have the Jetson Line: https://en.wikipedia.org/wiki/Nvidia_Jetson

amelius

12 days ago

Maybe at the point where you can run Python directly on the GPU.

At which point the GPU becomes the new CPU.

cmrdporcupine

12 days ago

I mean, that's kind of what's going on at a certain level with the AMD Strix Halo, the NVIDIA GB10, and the newer Apple machines.

In the sense that the RAM is fully integrated, anyways.

moebrowne

12 days ago

I'd be interested to see if workloads like Folding@home could be efficiently run this way. I don't think they need a lot of bandwidth.

alecco

13 days ago

Wide interconnects to load and sync do matter, though.

nailherwithrust

12 days ago

Thats what SHE said.

kgeist

13 days ago

What about constrained decoding (with JSON schemas)? I noticed my vLLM instance is using 1 CPU 100%.

systemtest

13 days ago

Would be nice if GPUs had a small ARM processor on the card. So you can run it as a standalone computer if you only use it for LLMs. And you can still plug it in a big PC if you need more CPU/RAM.

The additional $40 is negligible on a $1500 card, especially if it means you don't need a host anymore. Perhaps made a separate version of the card that has the standalone feature.

haritha-j

12 days ago

I currently have a £500 laptop hooked up to an egpu box with a £700 gpu. It's not a bad setup.

jonahbenton

13 days ago

So glad someone did this. Have been running big gpus on egpus connected to spare laptops and thinking why not pis.

yoan9224

12 days ago

The most interesting takeaway for me is that PCIe bandwidth really doesn't bottleneck LLM inference for single-user workloads. You're essentially just shuttling the model weights once, then the GPU churns through tokens using its own VRAM.

This is huge for home lab setups. You can run a Pi 5 with a high-end GPU via external enclosure and get 90% of the performance of a full workstation at a fraction of the power draw and cost.

The multi-GPU results make sense too - without tensor parallelism, you're just pipeline parallelism across layers, which is inherently sequential. The GPUs are literally sitting idle waiting for the previous layer's output. Exo and similar frameworks are trying to solve this but it's still early days.

For anyone considering this: watch out for ResizeBAR requirements. Some older boards won't work at all without it.

lostmsu

13 days ago

[delayed]

pjmlp

12 days ago

Of course, just go to any computer store where most gamer setups on affordable bugets go with the combo "beefy GPU + an i5", instead of using an i7 or i9 Intel CPUs.

View full discussion on Hacker News

ID: 46338016Type: storyLast synced: 12/23/2025, 5:35:30 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN