25l Portable Nv-Linked Dual 3090 LLM Rig

Posted4 months agoActive4 months ago

tensorlibb

139 points

121 comments

reddit.comTechstoryHigh profile

calmmixed

Debate

60/100

LLMGPU ComputingHardware Builds

Key topics

LLM

GPU Computing

Hardware Builds

A Reddit post showcases a 25L portable NV-linked dual 3090 LLM rig, sparking discussion on its feasibility, cost, and potential applications, as well as technical concerns and comparisons to other builds.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

N/A

Peak period

Day 5

Avg / period

19.3

Comment distribution116 data points

Loading chart...

Based on 116 loaded comments

Key moments

01Story posted
Sep 19, 2025 at 8:06 AM EDT
4 months ago
Step 01
02First comment
Sep 19, 2025 at 8:06 AM EDT
0s after posting
Step 02
03Peak activity
58 comments in Day 5
Hottest window of the conversation
Step 03
04Latest activity
Sep 28, 2025 at 4:08 AM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (121 comments)

Showing 116 comments of 121

tensorlibbAuthor

4 months ago

6 replies

I'm a huge fan of OpenRouter and their interface for solid LLM's but I recently jumped into fine tuning / modifying my own vision models for FPV drone detection (just for fun) and my daily workstation and it's 2080 just wasn't good enough.

Even in 2025 it's cool how solid a setup dual 3090's still are. nvlink is an absolute must but it's incredibly powerful. I'm able to run the latest Mistral thinking models and relatively powerful yolo based VLM's like the ones RoboFlow is based on.

Curious if anyone else is still using 3090's or has feedback for scaling up to 4-6 3090s.

Thanks everyone ;)

fxtentacle

4 months ago

1 reply

The 3090 are a sweet spot for training. It’s the first generation with seriously fast VRAM. And it’s the last generation before Nvidia blocked NVlink. If you need to copy parameters between GPUs during training, the 3090 can be up to 70% faster than 4090 or 5090. Because the latter two are limited by PCI express bandwidth.

jacquesm

4 months ago

1 reply

To be fair though, the 4090 and 5090 are much easier capable of saturating PCI express than the 3090 is, even at 4 lanes per card the 3090 rarely manages to saturate the links, it still handsomely pays off to split down to 4 lanes and add more cards.

I used:

https://c-payne.com/

Very high quality and manageable prices.

ericdotlee

4 months ago

1 reply

I've purchase 16 of these - cpayne is great! Hope he finds a US distributor to help with tariffs a bit!

jacquesm

4 months ago

What blew me away is the quality and price point of what obviously can't be a very high volume product. This guy makes amazing stuff.

jacquesm

4 months ago

1 reply

I've built a rig with 14 of them. NVLink is not 'an absolute must', it can be useful depending on the model and the application software you use and whether you're training or inferring.

The most important figure is the power consumed per token generated. You can optimize for that and get to a reasonably efficient system, or you can maximize token generation speed and end up with two times the power consumption for very little gain. You also will likely need to have a way to get rid of excess heat and all those fans get loud. I stuck the system in my garage, that made the noise much more manageable.

breakds

4 months ago

1 reply

I am curious about the setup of 14 GPUs - what kind of platform (motherboard) do you use to support so many PCIe lanes? And do you even have a chassis? Is it rack-mounted? Thanks!

jacquesm

4 months ago

1 reply

I used a large supermicro server chassis, a dual Xeon motherboard with 7 8 lane PCI Express slots, all the ram it would take (bought second hand), splitters, four massive powersupplies. I extended the server chassis with aluminum angle riveted onto the base. It could be rack mounted but I'd hate to be the person lifting it in. The 3090s were a mix, 10 of the same type (small, and with blower style fans on them) and 4 much larger ones that were kind of hard to accommodate (much wider and longer). I've linked to the splitter board manufacturer in another comment in this thread. That's the 'hard to get' component but once you have those and good cables to go with them the remaining setup problems are mostly power and heat management.

breakds

4 months ago

1 reply

Thanks that is very inspiring. I thought there are no blower type consumer GPUs, but apparently they exist!

jacquesm

4 months ago

I got them second hand off some bitcoin mining guy.

https://www.tomshardware.com/news/asus-blower-rtx3090

Is the model that I have.

vladgur

4 months ago

5 replies

I am exploring options just for fun.

a used 3090 is around $900 on ebay. a used rtx 6000 ADA is around $5k

4 3090s are slower at inference and worse at training than 1 rtx 6000.

4x3090 would consume 1400W at load.

Rtx 6000 would consume 300W at load.

If you god forbid live in California and your power averages 45 cents per kwh, 4x3090 would be $1500+ more per year to operate than a single RTX 6000[0]

[0] Back of the napkin/ChatGPT calculation of running the GPU at load for 8 hours per day.

Note: I own a pc with a 3090, but if i had to build an AI training workstation, i would seriously consider cost to operate and resale value(per component).

logicallee

4 months ago

1 reply

>I am exploring options just for fun.

Since you're exploring options just for fun, out of curiosity, would you rent it out whenever you're not using it yourself, so it's not just sitting idle? (Could be noisy and loud). You'd be able to use your computer for other work at the same time and stop whenever you wanted to use it yourself.

vladgur

4 months ago

2 replies

It depends. At my electricity cost, 1 hour of 3090 or 1 hour of Rtx 6000 would cost the same 0.45

Just checked vast.ai. I will be losing money with 3090 at my electricity cost and making a tiny bit with rtx 6000.

Like with boats it’s probably better to rent GPUs then buy them

justinclift

4 months ago

Would a solar panel setup be an option for fixing that? :)

logicallee

4 months ago

(you should also be compensated for the noise and inconvenience from it, not only electricity.) It sounds like you might rent it out if the rental price were higher.

ismailmaj

4 months ago

2 replies

To make matters worse, the RTX3090 was released during the crypto craze and so a decent amount of the second hand market could contain overused GPUs that won’t last long, even if 3xxx to 4xxx performance difference is not that high, I would avoid the 3xxx series totally for resell value.

segmondy

4 months ago

2 replies

I have rig of 7 3090s that I bought from crypto bros, they are lasting quite alright and have been chugging along fine for the last 2 years. GPUs are electronic devices not mechanical devices, they rarely blow up.

jonbiggums22

4 months ago

1 reply

I've noticed on ebay there are a lot of 3090s for sale that seem to have rusted or corroded heatsinks. I actually can't recall seeing this with used GPUs before but maybe I just haven't paying attention. Does this have to do with running them flat out in a basement or something?

dwood_dev

4 months ago

Run near a saltwater source without AC and that will happen.

akulbe

4 months ago

1 reply

How do you have a rig that fits that many cards?? those things take 3 slots apiece.

Pictures, or it never happened! :D

dehugger

4 months ago

you get a motherboard designed for the purpose (many pcie slots) and a case (usually open frame) that holds that many cards. riser cables are used so every card doesnt plug directly into the motherboard

aunty_helen

4 months ago

I bought 2 ex mining 3090s ~3 years ago. They’re in an always on pc that I remote into. Haven’t had a problem. If there was mass failures of gpus due to mining I would expect to have heard more about it

segmondy

4 months ago

1 reply

... and this is why napkin calculation is terrible. Even running a GPU at load doesn't mean you are going to use the full wattage. 4 3090 running inference on large model barely uses 350watts combined.

vladgur

4 months ago

1 reply

Can you clarify? Even if you down clock the card to 300W, why would running it at load not consume 4x300W?

segmondy

4 months ago

Inference is often like 200-250w without card clocked down. Then the other cards are like 20w-50w. 4 cards, 1 card is active at once. To get the full 350watt, you need to run parallel inference on the card with multiple users. So if I was using it as a server card and have 10 active users/processes then I might max out the active card. For example, I have a rig with 10 MI50 cards, I believe they are 250w each. Yet I rarely see pass 200w on the active card, they idle at about 20w, so that's 180w + 200w = around 380-400w on full load.

Think of the max watt like a car's max horsepower, a car might make 350HP, it doesn't mean it stays making 350HP all day long, there's a curve to it. At the low end it might be making 170HP and you will need to floor the gas pedal to get to that 350hp. Same with these GPUs. Most people will calculate the gas mileage by finding how much gas a car consumers at it's peak and say, oh, 6mpg when it's making 350hp so with your 20gallon thank, you have a range of 120miles. Which obviously isn't true.

cfn

4 months ago

I have an A6000 and the main advantage over a 3090 cluster is the build simplicity and relative silence of the machine (it is also used as my main dev workstation).

supermatt

4 months ago

I guess it depends on what you want to do: You get half the RAM in the 6000 (48 @ $104/GB) vs 4x3090 (96 @ $37.5/GB).

XCSme

4 months ago

1 reply

I bought a 2nd 3090 2 years ago for like 800eur, still a good price even today I think.

It's in my main workstation, and my idea was to always have Ollama running locally. The problem is that once I have a (large-ish) model running, all my VRAM is almost full and GPU struggles to do things like playing back a YouTube video.

Lately I haven't used local AI much, also because I stopped using any coding AIs (as they wasted more time than they saved), I stopped doing local image generations (the AI image generation hype is going down), and for quick questions I just ask ChatGPT, mostly because I also often use web search and other tools, which are quicker on their platform.

lifeinthevoid

4 months ago

1 reply

I run my desktop environment on the iGPU and the AI stuff on the dGPUs.

XCSme

4 months ago

1 reply

That's a real good point!

Unfortuatenly, my CPU (5900x) doesn't have an iGPU.

The last 5 years iGPU got a bit out of trend. Now maybe they actually make a lot of sense, as there is a clear use-case which involves having dedicated GPU always in-use which is not gaming (and gaming is different, cause you don't often multi-task while gaming).

I do expect to see a surge in iGPU popularity, or maybe a software improvement to allow having a model always available without constantly hogging the VRAM.

XCSme

4 months ago

PS: I thought Ollama had a way to use RAM instead of VRAM (?) to keep the model active when not in use, but in my experience that didn't solve the problem.

AJRF

4 months ago

You really don't need NVLink, you won't saturate the PCIe lanes on a modern motherboard with dual 3090s.

Tim Dettmers amazing GPU blog post posits NVLink doesn't start to become useful until you are at 128+ GPUs

https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...

CraigJPerry

4 months ago

if it's just for detection would audio not be cheaper to process?

I'm imagining a cluster of directional microphones, and then i don't know if it's better to perform some sort of band pass filtering first since it's so computationally cheap or whether it's better to just feed everything into the model directly. No idea.

I guess my first thought was just sounds from a drone likely is detectable reliably at a greater distance than visual, they're so small and a 180 degree by 180 degree hemisphere of pixels is a lot to process.

Fun problem either wayway.

username12349

4 months ago

2 replies

total cost?

bigiain

4 months ago

It says $3090 (maybe easy to miss since it also talks about RTX 3090s?)

jszymborski

4 months ago

It's written quite large on the page, just over 3K

jszymborski

4 months ago

3 replies

I just don't get why the RTX 4090 is still so expensive on the used market. New Rtx 5090s are almost as expensive!

tayo42

4 months ago

1 reply

Are these just for ai now? Or are games pushing video cards that much?

Our_Benefactors

4 months ago

4090 is a great gaming card, the spiritual successor to the 1080. It will be viable for years and years.

robotswantdata

4 months ago

“Easy” to mod to 48gb

renewiltord

4 months ago

They're dropping. I'm trying to offload 8x 4090s and I'll average $1500 I think.

suladead

4 months ago

7 replies

I built pretty much this exact rig myself, but now it's gathering dust, any other uses for this rather than localLLMS

robotswantdata

4 months ago

1 reply

Heating

ProllyInfamous

4 months ago

I use an older machine/GPU for wintertime heating, mining Monero (xmrig).

Should one get lucky and guess the next valid block, that pays the entire month's electricity — since an electric space heater would already be consuming the exact same amount of kWH as this GPU, there is no "negative cost" to operate.

This machine/GPU used to be my main workhorse, and still has ollama3.2 available — but even with HBM, 8GB of VRAM isn't really relevant in LLM-land.

dotnet00

4 months ago

The 3090 I have in my server (Ollama on it is only used occasionally nowadays since I have dual 5080s on my work desktop), also handles accelerating transcoding in Plex, and is in the process of being setup to handle monitoring my 3d printers for failures via camera.

Am also considering setting up Home Assistant with LLM support again.

thiago_fm

4 months ago

Playing games, it has a good graphics card

winkelmann

4 months ago

3D rendering and fluid simulation stuff could be interesting.

asimovDev

4 months ago

Play DnD by yourself with Llama as a DM

Tepix

4 months ago

Sell it? There are people who want a rig like this.

DaSHacka

4 months ago

vidya

deevus

4 months ago

1 reply

I’m really interested in this space from an AI sovereignty pov. Is it feasible for SMB/SME to use a box like in the article to get offline analysis of their data? It doesn’t have the worry of sending it off to the cloud.

I wanted to speak with businesses in my local area but no one took me up on it.

ang_cire

4 months ago

Yes, this is absolutely doable, and many companies are rolling their own ML models (I work with a MedTech company that does, in fact). LLMs are a little more involved, and you'd probably want something beefier than this (maybe a Framework Desktop cluster, if you're not wanting to get into rackmount stuff), but it's definitely feasible for companies to have their own offline LLMs and ML models.

Tepix

4 months ago

7 replies

OK, here's my quick critique of the article (having built a similar AM4-based system in 2023 for 2300€):

1) [I thought] The page is blocking cut & paste. Super annoying!

2) The exact mainboard is not specified exactly. There are 4 different boards called "ASUS ROG Strix X670E Gaming" and some of them only have one PCIe x16 slot. None of them can do PCIe x8 when using two GPUs.

3) The shopping link for the mainboard leads to the "ASUS ROG Strix X670E-E Gaming" model. This model can use the 2nd PCIe 5.0 port at only x4 speeds. The RTX 3090 can only do PCIe 4.0 of course so it will run at PCIe 4.0 x4. If you choose a desktop mainboard for having two GPUs, make sure it can run at PCIe x8 speeds when using both GPU slots! Having NVLink between the GPUs is not a replacement for having a fast connection between the CPU+RAM and the GPU and its VRAM.

4) Despite having a last-modified date of September 22nd, he is using his rig mostly with rather outdated or small LLMs and his benchmarks do not mention their quantization, which makes them useless. Also they seem not to be benchmarks at all, but "estimates". Perhaps the headline should be changed to reflect this?

jychang

4 months ago

5 replies

Yeah, this page seems to be not great for beginners and also useless for people with experience.

A 2x 3090 build is okay for inference, but even with nvlink you're a bit handicapped for training. You're much better off with getting a 4090 48GB from China for $2.5k and just using that. Example: https://www.alibaba.com/trade/search?keywords=4090+48gb&pric...

Also, this phrasing is concerning:

> WARNING - these components don't fit if you try to copy this build. The bottom GPU is resting on the Arctic p12 slim fans at the bottom of the case and pushing up on the GPU. Also the top arctic p14 Max fans don't have mounting points for half of their screw holes, and are in place by being very tightly wedged against the motherboard, case, and PSU. Also, there's probably way too much pressure on the pcie cables coming off the gpus when you close the glass.

Gracana

4 months ago

1 reply

$2.5k is about $1k more than you'd spend on a pair of 3090s, and people I know who've bought blower 4090s say they sound like hair driers.

dotnet00

4 months ago

1 reply

Blowers are loud, but they're easier to pack together, particularly given how most motherboards don't seem to space their two slots sufficiently to accomodate the massive coolers on recent GPUs.

mertleee

4 months ago

I can't wait for blower 3090s from China / MSI to get cheap (although I fear this may never happen)

nxobject

4 months ago

1 reply

Are the Alibaba 4090s modded to reach 48GB VRAM? (I ask only to figure how why they're that cheap...)

noboostforyou

4 months ago

Yes, they are modded by replacing the individual VRAM modules. https://www.tomshardware.com/pc-components/gpus/usd142-upgra...

tacomagick

4 months ago

1 reply

Simply replacing the 3090's with 4090's would provide a major performance uplift assuming your model fits. (I have rented both 3090 and 4090 systems online for research, this comment is based on my personal experience, it is well worth the price increase and the hourly rate for the inference speed you get)

tryauuum

4 months ago

I am not a lawer, but shouldn't 4090s be worse since they don't have nvlink?

there are patched drivers for enabling p2p but if I remember correctly, they are still slower than having an nvlink

someperson

4 months ago

1 reply

What an indictment on NVidia market segmentation that there's an industry doing aftermarket VRAM upgrades on gaming cards due their intentionally hobbled VRAM.

I wish AMD and Intel Arc would step up their game.

writebetterc

4 months ago

2 replies

Intel Arc Pro B60 will come in a 48GB dual-GPU model. So yeah, hardware is gonna be there, and the 24GB model will be $599 from Sparkle. I assume 48GB will be cheaper than a hacked RTX 4090.

Look at this: https://www.maxsun.com/products/intel-arc-pro-b60-dual-48g-t... https://www.sparkle.com.tw/files/20250618145718157.pdf

mertleee

4 months ago

Yeah, but the B60 is basically half the speed of a 3090... in 2025. I'd rather buy 5yr old nVidia hardware for $100 more on eBay than an intel product with horrendous software support that's half the speed effectively. This build is so cool because the 2x 3090 setup is still maybe the best option 5yrs+ after the GPU was released by nVidia.

magicalhippo

4 months ago

Keep in mind that the dual-GPU is done via PCIe bifurcation, so that if you use two B60's on a similar motherboard to what's in the article, you'll only see two GPUs, not the full four. Hence just 48GB VRAM not 96GB.

Aurornis

4 months ago

Don’t those modified cards require hacked drivers? I would not want my expensive video card to depend on hacked drivers that may or may not continue to be available with new updates.

glax

4 months ago

2 replies

Sorry for going off topic. But your insight will be helpful on my build

I'm thinking about a low budget system, which will be using

1.X99 D8 MAX LGA2011-3 Motherboard - It has 4 pcie 3.0 x16 slots, dual cpu socket. They are priced around $260 with both the cpu

2. 4X AMD MI50 32G cards - They are old now, but they have 32 gigs of vram and also can be sources at $110 each

The whole setup would not cost more than $1000, is it a right build ? or something more performant can be built within this budget ?

juliangoldsmith

4 months ago

1 reply

I'd use caution with the Mi50s. I bought a 16GB one on eBay a while back and it's been completely unusable.

It seems to be a Radeon VII on an Mi50 board, which should technically work. It immediately hangs the first time an OpenCL kernel is run, and doesn't come back up until I reboot. It's possible my issues are due to Mesa or driver config, but I'd strongly recommend buying one to test before going all in.

There are a lot of cheap SXM2 V100s and adapter boards out now, which should perform very well. The adapters unfortunately weren't available when I bought my hardware, or I would have scooped up several.

BizarroLand

4 months ago

I've seen the sxm2 (x2) with pci extension cards out on ebay for like $350.

The 32gb v100s with heatsink are like $600 each, so that would be $1500 or so for a one-off 64gb gpu that is less overall performant than a single 3090.

OakNinja

4 months ago

Better to buy one used 3090 than those old cards. Everything is not vram. Or, you can do nothing without vram but you can’t do anything with just vram.

To use the second pair of pcie slots, you _must_ have two cpus installed. Just saying in case someone finds a board with just one cpu socket populated.

fkyoureadthedoc

4 months ago

2 replies

> The page is blocking cut & paste. Super annoying!

I've been running Don't F* With Paste* for years for this

https://chromewebstore.google.com/detail/dont-f-with-paste/n...

jgalt212

4 months ago

2 replies

Interesting. I guess our content-based marketing pages need to move to canvas-based rendering. That's probably bum too. Straight to serving up jpgs.

93po

4 months ago

1 reply

thankfully most web browsing will be done by LLMs soon and that won't stop them, good riddance to the mess of a web that google has created

jgalt212

4 months ago

dead Internet for realz

vidarh

4 months ago

> Straight to serving up jpgs.

Back in my Amiga-days we had PowerSnap[1] which did the bargain basement version of OCR: Check the font settings of the window you wanted to cut and paste from, and try to match the font to the bitmap, to let you copy and paste from apps that didn't support it, or from UI element you normally couldn't.

These days, just throwing the image at an AI model would be far more resilient...

I think we've gotten to the point where it would be hard to compose an image that humans can read but an AI model can't, and easy to compose an image an AI can read but humans can't, so I suspect the only option for your marketing department will be to try to prompt inject the AI into buying your product.

(Oh, look, I have written nearly this same comment once before, 11 years ago, on HN[2] - I was wrong about how it worked, and Orgre was right, and my follow up reply appears to be closer to what it actually does)

[1] https://aminet.net/package/util/cdity/PowerSnap22a

[2] https://news.ycombinator.com/item?id=7631161

mertleee

4 months ago

1 reply

Hmm, I can copy paste just fine from the build page?

fkyoureadthedoc

4 months ago

I don't know if the page actually f's with copy/paste or not since I already have the extension. It's usually most useful on forms where they force you to type in stuff.

ericdotlee

4 months ago

1 reply

Any reason you wouldn't opt for the 4090 or 5090?

Instantix

4 months ago

3090 second hand can be found at something like $600.

yjftsjthsd-h

4 months ago

1 reply

> 3) The shopping link for the mainboard leads to the "ASUS ROG Strix X670E-E Gaming" model. This model can use the 2nd PCIe 5.0 port at only x4 speeds. The RTX 3090 can only do PCIe 4.0 of course so it will run at PCIe 4.0 x4. If you choose a desktop mainboard for having two GPUs, make sure it can run at PCIe x8 speeds when using both GPU slots! Having NVLink between the GPUs is not a replacement for having a fast connection between the CPU+RAM and the GPU and its VRAM.

Forgive a noob question: I thought the connection to the GPU was actually fairly unimportant once the model was loaded, because sending input to the model and getting a response is low bandwidth? So it might matter if you're changing models a lot or doing a model that can work on video, but otherwise I thought it didn't really matter.

Tepix

4 months ago

In general, if all you do is inference with a model that’s in VRAM, you’re right. OTOH it’s simply a matter of picking the right mainboard. If you have one of those sweet new MoE models that won‘t completely fit in your VRAM, offloading means you want PCIe bandwidth, because it will be a bottleneck. Also swapping between LLMs will be faster.

hengheng

4 months ago

I've also learned the hard way to Google "AM4 main board tier list" before buying.

Some boards can run a 5950X in name only, while others can comfortably run it close to double its spec power all day. VRMs are a real differentiator for this tier of hardware.

(If anyone can comment on the airflow required for 400-500W Epyc CPUs with the tiny VRM heatsinks that Supermicro uses, I'm all ears.)

danparsonson

4 months ago

> None of them can do PCIe x8 when using two GPUs.

Is that important for this workload? I thought most of the effort was spent processing data on the card rather than moving data on or off of it?

cl3misch

4 months ago

1 reply

> The workplace of the coworker I built this for is truly offline, with no potential for LAN or wifi, so to download new models and update the system periodically I need to go pick it up from him and take it home.

I'm surprised that a "truly offline" workplace allows servers to be taken home and being connected to the internet.

bedstefar

4 months ago

I worked in the Arctic for the better part of a decade. There's Starlink now, but I've been TRULY OFFLINE for weeks (with plenty diesel generated power) as recently as 2018. Technically we could use Iridium at like $10 per MB, but my full Wikipedia mirror (+ Debian/Ubuntu packages, PyPI etc) did come in handy more than once.

I know some Antarctic research stations (like McMurdo for example) still have connectivity restrictions depending on time-of-day, and I wouldn't be surprised if they also had mirrors of these sort of things, and/or dual-3090 rigs for llama.cpp in the off hours.

lifeinthevoid

4 months ago

3 replies

I built a similar system, meanwhile I've sold one of the RTX 3090's. Local inference is fun and feels liberating, but it's also slow, and once I was used to the immense power of the giant hosted models, the fun quickly disappeared.

I've kept a single GPU to still be able to play a bit with light local models, but not anymore for serious use.

nenenejej

4 months ago

1 reply

Graphics cards are so expensive (list price) they are cheap (no depreciation liquid market)

Our_Benefactors

4 months ago

Did you really claim GPUs have zero depreciation? That’s obviously false.

imiric

4 months ago

3 replies

I have a similar setup as the author with 2x 3090s.

The issue is not that it's slow. 20-30 tk/s is perfectly acceptable to me.

The issue is that the quality of the models that I'm able to self-host pales in comparison to that of SOTA hosted models. They hallucinate more, don't follow prompts as well, and simply generate overall worse quality content. These are issues that plague all "AI" models, but they are particularly evident on open weights ones. Maybe this is less noticeable on behemoth 100B+ parameter models, but to run those I would need to invest much more into this hobby than I'm willing to do.

I still run inference locally for simple one-off tasks. But for anything more sophisticated, hosted models are unfortunately required.

mycall

4 months ago

1 reply

> 20-30 tk/s

or ~2.2M tk/day. This is how we should be thinking about it imho.

chpatrick

4 months ago

1 reply

Is it? If you're the only user then you care about latency more than throughput.

mycall

4 months ago

Not if you have a queue of work that isn't a high priority, like edge compute to review changes in security cam footage or prepare my next day's tasks (calendar, commitments, needs, etc)

elsombrero

4 months ago

2 replies

On my 2x 3090s I am running glm4.5 air q1 and it runs at ~300pp and 20/30 tk/s works pretty well with roo code on vscode, rarely misses tool calls and produces decent quality code.

I also tried to use it with claude code with claude code router and it's pretty fast. Roo code uses bigger contexts, so it's quite slower than claude code in general, but I like the workflow better.

this is my snippet for llama-swap

``` models: "glm45-air": healthCheckTimeout: 300 cmd: | llama.cpp/build/bin/llama-server -hf unsloth/GLM-4.5-Air-GGUF:IQ1_M --split-mode layer --tensor-split 0.48,0.52 --flash-attn on -c 82000 --ubatch-size 512 --cache-type-k q4_1 --cache-type-v q4_1 -ngl 99 --threads -1 --port ${PORT} --host 0.0.0.0 --no-mmap -hfd mradermacher/GLM-4.5-DRAFT-0.6B-v3.0-i1-GGUF:Q6_K -ngld 99 --kv-unified ```

imiric

4 months ago

1 reply

Thanks, but I find it hard to believe that a Q1 model would produce decent results.

I see that the Q2 version is around 42GB, which might be doable on 2x 3090s, even if some of it spills over to CPU/RAM. Have you tried Q2?

elsombrero

4 months ago

well, I tried it and it works for me. llm output is hard to properly evaluate without actually using it.

I read a lot of good comments on r/localllama, with most people suggesting qwen3 coder 30ba3b, but I never got it to work as well as GLM 4.5 air Q1.

As for using Q2, it will fit in vram, but with very small context or spill over to RAM, but with quite an impact on speed depending on your setup. I have slow ddr4 ram and going for Q1 has been a good compromise for me, but YMMV.

ericdotlee

4 months ago

1 reply

What is llama-swap?

Been looking for more details about software configs on https://llamabuilds.ai

elsombrero

4 months ago

https://github.com/mostlygeek/llama-swap

it's a transparent proxy that automatically launches your selected model with your preferred inference server so that you don't need to manually start/stop the server when you want to switch model

so, let's say I have configured roo code to use qwen3 30ba3b as the orchestrator and glm4.5 air as coder, roo code would call the proxy server with model "qwen3" when using orchestrator mode and then kill llama.cpp with qwen3 and restart it with "glm4.5air"

ThatPlayer

4 months ago

1 reply

> behemoth 100B+ parameter models, but to run those I would need to invest much more into this hobby than I'm willing to do.

Have you tried newer MoE models with llama.cpp's recent '--n-cpu-moe' option to offload MoE layers to the CPU? I can run gpt-oss-120b (5.1B active) on my 4080 and get a usable ~20 tk/s. Had to upgrade my system RAM, but that's easier. https://github.com/ggml-org/llama.cpp/discussions/15396 has a bit on getting that running

imiric

4 months ago

1 reply

I use Ollama which offloads to the CPU automatically IIRC. IME the performance drops dramatically when that happens, and it hogs the CPU making the system unresponsive for other tasks, so I try to avoid it.

ThatPlayer

4 months ago

1 reply

I don't believe that's the same thing. That should be the generic offloading that ollama will do to any too big model, while this feature requires MoE models. https://github.com/ollama/ollama/issues/11772 is the feature request for similar on ollama.

One comment in that thread mentions getting almost 30tk/s from gpt-oss-120b on a 3090 with llama.cpp compared to 8tk/s with ollama.

This feature is limited to MoE models, but those seem to be gaining traction with gpt-oss, glm-4.5, and qwen3

imiric

4 months ago

Ah, I was not aware of that, thanks. I'll give it a try.

NicoJuicy

4 months ago

2 replies

If you have a 24 gb 3090. Try out qwen:30b-a3b-instruct-2507-q4_K_M ( ollama )

It's pretty good.

naabb

4 months ago

tbf I also run that on a 16GB 5070TI at 25T/S, it's amazing how fast it runs on consumer grade hardware. I think you could push up to a bigger model but I don't know enough about local llama.

jszymborski

4 months ago

Don't need a 3090, it runs really fast on an RTX 2080 too.

loudmax

4 months ago

1 reply

There was an interesting post to r/LocalLLaMA yesterday from someone running inference mostly on CPU: https://carteakey.dev/optimizing%20gpt-oss-120b-local%20infe...

One of the observations is how much difference memory speed and bandwidth makes, even for CPU inference. Obviously a CPU isn't going to match a GPU for inference speed, but it's an affordable way to run much larger models than you can fit in 24GB or even 48GB of VRAM. If you do run inference on a CPU, you might benefit from some of the same memory optimizations made by gamers: favoring low-latency overclocked RAM.

mistercheph

4 months ago

Outside of prompt processing, the only reason GPU's are better than CPU's for inference is memory bandwidth, the performance of apple M* devices at inference is a consequence of this, not of their UMA.

AJRF

4 months ago

Those GPUs are so close to each other, doesn’t the heat cause instability?

logicziller

4 months ago

I get a 403 error.

arkj

4 months ago

The link is down with 403 error.

AJRF

4 months ago

> WARNING - these components don't fit if you try to copy this build. The bottom GPU is resting on the Arctic p12 slim fans at the bottom of the case and pushing up on the GPU.

I built a dual 3090 rig, and this point was why I spent a long time looking for a case where the GPU's could fit side by side with a little gap for airflow

I eventually went with a SilverStone GD11 HTPC which is a PC case for building a media centre, but it's huge inside, has a front fan that takes up 75% of width of the case and also allows the GPUs to stand up right so they don't sag and pull on their thin metal supports.

Highly recommend for a dual GPU build! If you can get dual 5090s instead of 3090s (good luck!) you'd even be able to get "good" airflow in this case.

tomhow

4 months ago

URL updated from https://www.llamabuilds.ai/build/portable-25l-nvlinked-dual-..., which points to this.

kapone

4 months ago

I'm failing to see the point of this article? I mean, people have been building dual GPU workstations for a long long time.

What's so special about this one?

Deepen5

4 months ago

is it that easy to get started?

gbolcer

4 months ago

I was going to say you need an extension cable. My first dual 3090 build I had three issues. First was the pcie extension wouldn't support gen4, so I had to change to gen3 in the bios. Second issue was that depending on which slot, you couldn't get x16/x16 and it would drop to x16/x8 unless you had it configured right. Third, I finally gave up and just had the card resting first inside the case and then outside which if fan kicks up, it'll jiggle around, so I had to make some makeshift holder to keep the card sitting there.

JKCalhoun

4 months ago

I love how the prices for various Llama builds are all over the map on this site.

Oh look, here's one for $43K: https://www.llamabuilds.ai/build/a16zs-personal-ai-workstati...

whoami730

4 months ago

Anybody else getting 403 Forbidden error?

5 more comments available on Hacker News

View full discussion on Hacker News

ID: 45300668Type: storyLast synced: 11/20/2025, 4:38:28 PM

Want the full context?