Sharing Base Model in GPU Vram Across Multiple Inference Stack Process [video]

Posted4 months ago

medicis123

7 points

1 comments

youtube.comTechstory

supportivepositive

Debate

10/100

GPU OptimizationAI InferenceVram Management

Key topics

GPU Optimization

AI Inference

Vram Management

The post shares a demo of WoolyAI GPU Hypervisor, which enables sharing a base model in GPU VRAM across multiple inference stacks, increasing capacity and isolation. The discussion highlights the benefits and potential applications of this technology.

Snapshot generated from the HN discussion

Discussion Activity

Light discussion

First comment

N/A

Peak period

Start

Avg / period

Key moments

01Story posted
Sep 3, 2025 at 3:48 PM EDT
4 months ago
Step 01
02First comment
Sep 3, 2025 at 3:48 PM EDT
0s after posting
Step 02
03Peak activity
1 comments in Start
Hottest window of the conversation
Step 03
04Latest activity
Sep 3, 2025 at 3:48 PM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (1 comments)

Showing 1 comments

medicis123Author

4 months ago

We have just published a short demo of the WoolyAI GPU Hypervisor, showcasing VRAM memory sharing/deduplication. Load a single base model once, then run multiple isolated LoRA stacks or VLLM stacks on the same GPU.

Why this matters

Higher capacity: Share the base model in VRAM; add more adapters or vertical inference stacks per GPU without increasing memory usage.

Isolation & control: Each stack is its own process with independent batching and SLA-aware scheduling.

While vLLM supports multiple adapters on a single vLLM process, many teams need predictable per-adapter SLAs—this is where running independent stacks with a shared base model in VRAM can enable doing it all on the same GPU.

The demo uses LoRA inference using Pytorch, but the same applies when using vLLM. If you’re scaling LoRA inference across business units or model variants and need predictable latency without overprovisioning GPUs, I’d love your feedback. Comment or DM to chat.

View full discussion on Hacker News

ID: 45119743Type: storyLast synced: 11/17/2025, 10:09:41 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN