Sharing Base Model in GPU Vram Across Multiple Inference Stack Process [video]
Posted4 months ago
youtube.comTechstory
supportivepositive
Debate
10/100
GPU OptimizationAI InferenceVram Management
Key topics
GPU Optimization
AI Inference
Vram Management
The post shares a demo of WoolyAI GPU Hypervisor, which enables sharing a base model in GPU VRAM across multiple inference stacks, increasing capacity and isolation. The discussion highlights the benefits and potential applications of this technology.
Snapshot generated from the HN discussion
Discussion Activity
Light discussionFirst comment
N/A
Peak period
1
Start
Avg / period
1
Key moments
- 01Story posted
Sep 3, 2025 at 3:48 PM EDT
4 months ago
Step 01 - 02First comment
Sep 3, 2025 at 3:48 PM EDT
0s after posting
Step 02 - 03Peak activity
1 comments in Start
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 3, 2025 at 3:48 PM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45119743Type: storyLast synced: 11/17/2025, 10:09:41 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Why this matters
Higher capacity: Share the base model in VRAM; add more adapters or vertical inference stacks per GPU without increasing memory usage.
Isolation & control: Each stack is its own process with independent batching and SLA-aware scheduling.
While vLLM supports multiple adapters on a single vLLM process, many teams need predictable per-adapter SLAs—this is where running independent stacks with a shared base model in VRAM can enable doing it all on the same GPU.
The demo uses LoRA inference using Pytorch, but the same applies when using vLLM. If you’re scaling LoRA inference across business units or model variants and need predictable latency without overprovisioning GPUs, I’d love your feedback. Comment or DM to chat.