Run 35b Llms on Dual Pascal Gpus with Qlora

Posted3 months ago

rickesh_tn

4 points

0 comments

Techstory

supportivepositive

Debate

0/100

LlmsGPU OptimizationQuantization

Key topics

Llms

GPU Optimization

Quantization

Hi HN,

  I built a system to run 35B parameter language models on older Pascal GPUs (P100 +
  GTX 1080 Ti) using multi-GPU memory spillover.

  Problem: Most LLM inference tools (Ollama, LM Studio) are limited to single GPU VRAM
  (~13B models max on a 16GB GPU). If you have multiple older GPUs, the second one sits
   idle.

  Solution: Multi-GPU + CPU memory spillover with QLoRA 4-bit quantization. The system
  automatically distributes layers across GPU0 → GPU1 → CPU RAM, enabling 35B models on
   hardware that normally maxes at 13B.

  Benchmarks (P100 16GB + GTX 1080 Ti 11GB):
  - Qwen-14B: 13.7 tokens/sec (9.4GB VRAM)
  - OPT-30B: 5.4 tokens/sec (15.2GB VRAM)
  - CodeLlama-34B: 0.8 tokens/sec (16.7GB VRAM)

  Quick start:
    docker pull rickeshtn/large-model-international_release:latest
    docker run -it --rm --runtime=nvidia --gpus all --ipc=host     --ulimit memlock=-1
  --ulimit stack=268435456     -v $(pwd):/workspace -e HF_HOME=/workspace/model_cache
     rickeshtn/large-model-international_release:latest     python
  /app/interactive_chat.py --model-name Qwen/Qwen2.5-14B-Instruct

  Technical details:
  - QLoRA 4-bit NF4 quantization (75% memory reduction)
  - HuggingFace Transformers + Accelerate + bitsandbytes
  - Automatic device mapping with CPU offload
  - Interactive chat with conversation persistence

  GitHub: https://github.com/rickeshtn/locallm-pascal
  Docker Hub: https://hub.docker.com/r/rickeshtn/large-model-international_release

  34 users already running it. Happy to answer technical questions!

The author presents a system to run 35B parameter language models on older Pascal GPUs using multi-GPU memory spillover and QLoRA 4-bit quantization, sharing technical details and a Docker setup.

Snapshot generated from the HN discussion

Discussion Activity

No activity data yet

We're still syncing comments from Hacker News.

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (0 comments)

Discussion hasn't started yet.

ID: 45498552Type: storyLast synced: 11/17/2025, 11:07:32 AM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

View on HN