25l Portable Nv-Linked Dual 3090 LLM Rig
Posted4 months agoActive4 months ago
reddit.comTechstoryHigh profile
calmmixed
Debate
60/100
LLMGPU ComputingHardware Builds
Key topics
LLM
GPU Computing
Hardware Builds
A Reddit post showcases a 25L portable NV-linked dual 3090 LLM rig, sparking discussion on its feasibility, cost, and potential applications, as well as technical concerns and comparisons to other builds.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
N/A
Peak period
58
Day 5
Avg / period
19.3
Comment distribution116 data points
Loading chart...
Based on 116 loaded comments
Key moments
- 01Story posted
Sep 19, 2025 at 8:06 AM EDT
4 months ago
Step 01 - 02First comment
Sep 19, 2025 at 8:06 AM EDT
0s after posting
Step 02 - 03Peak activity
58 comments in Day 5
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 28, 2025 at 4:08 AM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45300668Type: storyLast synced: 11/20/2025, 4:38:28 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Even in 2025 it's cool how solid a setup dual 3090's still are. nvlink is an absolute must but it's incredibly powerful. I'm able to run the latest Mistral thinking models and relatively powerful yolo based VLM's like the ones RoboFlow is based on.
Curious if anyone else is still using 3090's or has feedback for scaling up to 4-6 3090s.
Thanks everyone ;)
I used:
https://c-payne.com/
Very high quality and manageable prices.
The most important figure is the power consumed per token generated. You can optimize for that and get to a reasonably efficient system, or you can maximize token generation speed and end up with two times the power consumption for very little gain. You also will likely need to have a way to get rid of excess heat and all those fans get loud. I stuck the system in my garage, that made the noise much more manageable.
https://www.tomshardware.com/news/asus-blower-rtx3090
Is the model that I have.
a used 3090 is around $900 on ebay. a used rtx 6000 ADA is around $5k
4 3090s are slower at inference and worse at training than 1 rtx 6000.
4x3090 would consume 1400W at load.
Rtx 6000 would consume 300W at load.
If you god forbid live in California and your power averages 45 cents per kwh, 4x3090 would be $1500+ more per year to operate than a single RTX 6000[0]
[0] Back of the napkin/ChatGPT calculation of running the GPU at load for 8 hours per day.
Note: I own a pc with a 3090, but if i had to build an AI training workstation, i would seriously consider cost to operate and resale value(per component).
Since you're exploring options just for fun, out of curiosity, would you rent it out whenever you're not using it yourself, so it's not just sitting idle? (Could be noisy and loud). You'd be able to use your computer for other work at the same time and stop whenever you wanted to use it yourself.
Just checked vast.ai. I will be losing money with 3090 at my electricity cost and making a tiny bit with rtx 6000.
Like with boats it’s probably better to rent GPUs then buy them
Pictures, or it never happened! :D
Think of the max watt like a car's max horsepower, a car might make 350HP, it doesn't mean it stays making 350HP all day long, there's a curve to it. At the low end it might be making 170HP and you will need to floor the gas pedal to get to that 350hp. Same with these GPUs. Most people will calculate the gas mileage by finding how much gas a car consumers at it's peak and say, oh, 6mpg when it's making 350hp so with your 20gallon thank, you have a range of 120miles. Which obviously isn't true.
It's in my main workstation, and my idea was to always have Ollama running locally. The problem is that once I have a (large-ish) model running, all my VRAM is almost full and GPU struggles to do things like playing back a YouTube video.
Lately I haven't used local AI much, also because I stopped using any coding AIs (as they wasted more time than they saved), I stopped doing local image generations (the AI image generation hype is going down), and for quick questions I just ask ChatGPT, mostly because I also often use web search and other tools, which are quicker on their platform.
Unfortuatenly, my CPU (5900x) doesn't have an iGPU.
The last 5 years iGPU got a bit out of trend. Now maybe they actually make a lot of sense, as there is a clear use-case which involves having dedicated GPU always in-use which is not gaming (and gaming is different, cause you don't often multi-task while gaming).
I do expect to see a surge in iGPU popularity, or maybe a software improvement to allow having a model always available without constantly hogging the VRAM.
Tim Dettmers amazing GPU blog post posits NVLink doesn't start to become useful until you are at 128+ GPUs
https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...
I'm imagining a cluster of directional microphones, and then i don't know if it's better to perform some sort of band pass filtering first since it's so computationally cheap or whether it's better to just feed everything into the model directly. No idea.
I guess my first thought was just sounds from a drone likely is detectable reliably at a greater distance than visual, they're so small and a 180 degree by 180 degree hemisphere of pixels is a lot to process.
Fun problem either wayway.
Should one get lucky and guess the next valid block, that pays the entire month's electricity — since an electric space heater would already be consuming the exact same amount of kWH as this GPU, there is no "negative cost" to operate.
This machine/GPU used to be my main workhorse, and still has ollama3.2 available — but even with HBM, 8GB of VRAM isn't really relevant in LLM-land.
Am also considering setting up Home Assistant with LLM support again.
I wanted to speak with businesses in my local area but no one took me up on it.
1) [I thought] The page is blocking cut & paste. Super annoying!
2) The exact mainboard is not specified exactly. There are 4 different boards called "ASUS ROG Strix X670E Gaming" and some of them only have one PCIe x16 slot. None of them can do PCIe x8 when using two GPUs.
3) The shopping link for the mainboard leads to the "ASUS ROG Strix X670E-E Gaming" model. This model can use the 2nd PCIe 5.0 port at only x4 speeds. The RTX 3090 can only do PCIe 4.0 of course so it will run at PCIe 4.0 x4. If you choose a desktop mainboard for having two GPUs, make sure it can run at PCIe x8 speeds when using both GPU slots! Having NVLink between the GPUs is not a replacement for having a fast connection between the CPU+RAM and the GPU and its VRAM.
4) Despite having a last-modified date of September 22nd, he is using his rig mostly with rather outdated or small LLMs and his benchmarks do not mention their quantization, which makes them useless. Also they seem not to be benchmarks at all, but "estimates". Perhaps the headline should be changed to reflect this?
A 2x 3090 build is okay for inference, but even with nvlink you're a bit handicapped for training. You're much better off with getting a 4090 48GB from China for $2.5k and just using that. Example: https://www.alibaba.com/trade/search?keywords=4090+48gb&pric...
Also, this phrasing is concerning:
> WARNING - these components don't fit if you try to copy this build. The bottom GPU is resting on the Arctic p12 slim fans at the bottom of the case and pushing up on the GPU. Also the top arctic p14 Max fans don't have mounting points for half of their screw holes, and are in place by being very tightly wedged against the motherboard, case, and PSU. Also, there's probably way too much pressure on the pcie cables coming off the gpus when you close the glass.
there are patched drivers for enabling p2p but if I remember correctly, they are still slower than having an nvlink
I wish AMD and Intel Arc would step up their game.
Look at this: https://www.maxsun.com/products/intel-arc-pro-b60-dual-48g-t... https://www.sparkle.com.tw/files/20250618145718157.pdf
I'm thinking about a low budget system, which will be using
1.X99 D8 MAX LGA2011-3 Motherboard - It has 4 pcie 3.0 x16 slots, dual cpu socket. They are priced around $260 with both the cpu
2. 4X AMD MI50 32G cards - They are old now, but they have 32 gigs of vram and also can be sources at $110 each
The whole setup would not cost more than $1000, is it a right build ? or something more performant can be built within this budget ?
It seems to be a Radeon VII on an Mi50 board, which should technically work. It immediately hangs the first time an OpenCL kernel is run, and doesn't come back up until I reboot. It's possible my issues are due to Mesa or driver config, but I'd strongly recommend buying one to test before going all in.
There are a lot of cheap SXM2 V100s and adapter boards out now, which should perform very well. The adapters unfortunately weren't available when I bought my hardware, or I would have scooped up several.
The 32gb v100s with heatsink are like $600 each, so that would be $1500 or so for a one-off 64gb gpu that is less overall performant than a single 3090.
To use the second pair of pcie slots, you _must_ have two cpus installed. Just saying in case someone finds a board with just one cpu socket populated.
I've been running Don't F* With Paste* for years for this
https://chromewebstore.google.com/detail/dont-f-with-paste/n...
Back in my Amiga-days we had PowerSnap[1] which did the bargain basement version of OCR: Check the font settings of the window you wanted to cut and paste from, and try to match the font to the bitmap, to let you copy and paste from apps that didn't support it, or from UI element you normally couldn't.
These days, just throwing the image at an AI model would be far more resilient...
I think we've gotten to the point where it would be hard to compose an image that humans can read but an AI model can't, and easy to compose an image an AI can read but humans can't, so I suspect the only option for your marketing department will be to try to prompt inject the AI into buying your product.
(Oh, look, I have written nearly this same comment once before, 11 years ago, on HN[2] - I was wrong about how it worked, and Orgre was right, and my follow up reply appears to be closer to what it actually does)
[1] https://aminet.net/package/util/cdity/PowerSnap22a
[2] https://news.ycombinator.com/item?id=7631161
Forgive a noob question: I thought the connection to the GPU was actually fairly unimportant once the model was loaded, because sending input to the model and getting a response is low bandwidth? So it might matter if you're changing models a lot or doing a model that can work on video, but otherwise I thought it didn't really matter.
Some boards can run a 5950X in name only, while others can comfortably run it close to double its spec power all day. VRMs are a real differentiator for this tier of hardware.
(If anyone can comment on the airflow required for 400-500W Epyc CPUs with the tiny VRM heatsinks that Supermicro uses, I'm all ears.)
Is that important for this workload? I thought most of the effort was spent processing data on the card rather than moving data on or off of it?
I'm surprised that a "truly offline" workplace allows servers to be taken home and being connected to the internet.
I know some Antarctic research stations (like McMurdo for example) still have connectivity restrictions depending on time-of-day, and I wouldn't be surprised if they also had mirrors of these sort of things, and/or dual-3090 rigs for llama.cpp in the off hours.
I've kept a single GPU to still be able to play a bit with light local models, but not anymore for serious use.
The issue is not that it's slow. 20-30 tk/s is perfectly acceptable to me.
The issue is that the quality of the models that I'm able to self-host pales in comparison to that of SOTA hosted models. They hallucinate more, don't follow prompts as well, and simply generate overall worse quality content. These are issues that plague all "AI" models, but they are particularly evident on open weights ones. Maybe this is less noticeable on behemoth 100B+ parameter models, but to run those I would need to invest much more into this hobby than I'm willing to do.
I still run inference locally for simple one-off tasks. But for anything more sophisticated, hosted models are unfortunately required.
or ~2.2M tk/day. This is how we should be thinking about it imho.
I also tried to use it with claude code with claude code router and it's pretty fast. Roo code uses bigger contexts, so it's quite slower than claude code in general, but I like the workflow better.
this is my snippet for llama-swap
``` models: "glm45-air": healthCheckTimeout: 300 cmd: | llama.cpp/build/bin/llama-server -hf unsloth/GLM-4.5-Air-GGUF:IQ1_M --split-mode layer --tensor-split 0.48,0.52 --flash-attn on -c 82000 --ubatch-size 512 --cache-type-k q4_1 --cache-type-v q4_1 -ngl 99 --threads -1 --port ${PORT} --host 0.0.0.0 --no-mmap -hfd mradermacher/GLM-4.5-DRAFT-0.6B-v3.0-i1-GGUF:Q6_K -ngld 99 --kv-unified ```
I see that the Q2 version is around 42GB, which might be doable on 2x 3090s, even if some of it spills over to CPU/RAM. Have you tried Q2?
I read a lot of good comments on r/localllama, with most people suggesting qwen3 coder 30ba3b, but I never got it to work as well as GLM 4.5 air Q1.
As for using Q2, it will fit in vram, but with very small context or spill over to RAM, but with quite an impact on speed depending on your setup. I have slow ddr4 ram and going for Q1 has been a good compromise for me, but YMMV.
Been looking for more details about software configs on https://llamabuilds.ai
it's a transparent proxy that automatically launches your selected model with your preferred inference server so that you don't need to manually start/stop the server when you want to switch model
so, let's say I have configured roo code to use qwen3 30ba3b as the orchestrator and glm4.5 air as coder, roo code would call the proxy server with model "qwen3" when using orchestrator mode and then kill llama.cpp with qwen3 and restart it with "glm4.5air"
Have you tried newer MoE models with llama.cpp's recent '--n-cpu-moe' option to offload MoE layers to the CPU? I can run gpt-oss-120b (5.1B active) on my 4080 and get a usable ~20 tk/s. Had to upgrade my system RAM, but that's easier. https://github.com/ggml-org/llama.cpp/discussions/15396 has a bit on getting that running
One comment in that thread mentions getting almost 30tk/s from gpt-oss-120b on a 3090 with llama.cpp compared to 8tk/s with ollama.
This feature is limited to MoE models, but those seem to be gaining traction with gpt-oss, glm-4.5, and qwen3
It's pretty good.
One of the observations is how much difference memory speed and bandwidth makes, even for CPU inference. Obviously a CPU isn't going to match a GPU for inference speed, but it's an affordable way to run much larger models than you can fit in 24GB or even 48GB of VRAM. If you do run inference on a CPU, you might benefit from some of the same memory optimizations made by gamers: favoring low-latency overclocked RAM.
I built a dual 3090 rig, and this point was why I spent a long time looking for a case where the GPU's could fit side by side with a little gap for airflow
I eventually went with a SilverStone GD11 HTPC which is a PC case for building a media centre, but it's huge inside, has a front fan that takes up 75% of width of the case and also allows the GPUs to stand up right so they don't sag and pull on their thin metal supports.
Highly recommend for a dual GPU build! If you can get dual 5090s instead of 3090s (good luck!) you'd even be able to get "good" airflow in this case.
What's so special about this one?
Oh look, here's one for $43K: https://www.llamabuilds.ai/build/a16zs-personal-ai-workstati...
5 more comments available on Hacker News