Run 35b Llms on Dual Pascal Gpus with Qlora
Posted3 months ago
Techstory
supportivepositive
Debate
0/100
LlmsGPU OptimizationQuantization
Key topics
Llms
GPU Optimization
Quantization
Hi HN,
I built a system to run 35B parameter language models on older Pascal GPUs (P100 +
GTX 1080 Ti) using multi-GPU memory spillover.
Problem: Most LLM inference tools (Ollama, LM Studio) are limited to single GPU VRAM
(~13B models max on a 16GB GPU). If you have multiple older GPUs, the second one sits
idle.
Solution: Multi-GPU + CPU memory spillover with QLoRA 4-bit quantization. The system
automatically distributes layers across GPU0 → GPU1 → CPU RAM, enabling 35B models on
hardware that normally maxes at 13B.
Benchmarks (P100 16GB + GTX 1080 Ti 11GB):
- Qwen-14B: 13.7 tokens/sec (9.4GB VRAM)
- OPT-30B: 5.4 tokens/sec (15.2GB VRAM)
- CodeLlama-34B: 0.8 tokens/sec (16.7GB VRAM)
Quick start:
docker pull rickeshtn/large-model-international_release:latest
docker run -it --rm --runtime=nvidia --gpus all --ipc=host --ulimit memlock=-1
--ulimit stack=268435456 -v $(pwd):/workspace -e HF_HOME=/workspace/model_cache
rickeshtn/large-model-international_release:latest python
/app/interactive_chat.py --model-name Qwen/Qwen2.5-14B-Instruct
Technical details:
- QLoRA 4-bit NF4 quantization (75% memory reduction)
- HuggingFace Transformers + Accelerate + bitsandbytes
- Automatic device mapping with CPU offload
- Interactive chat with conversation persistence
GitHub: https://github.com/rickeshtn/locallm-pascal
Docker Hub: https://hub.docker.com/r/rickeshtn/large-model-international_release
34 users already running it. Happy to answer technical questions!The author presents a system to run 35B parameter language models on older Pascal GPUs using multi-GPU memory spillover and QLoRA 4-bit quantization, sharing technical details and a Docker setup.
Snapshot generated from the HN discussion
Discussion Activity
No activity data yet
We're still syncing comments from Hacker News.
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45498552Type: storyLast synced: 11/17/2025, 11:07:32 AM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Discussion hasn't started yet.