Show HN: LLMKube – Kubernetes for Local LLMs with GPU Acceleration
Mood
supportive
Sentiment
positive
Category
tech
Key topics
Kubernetes
LLMs
GPU Acceleration
Why this exists: Regulated industries (healthcare, defense, finance) need air-gapped LLM deployments, but existing tools are either single-node only (Ollama) or lack GPU optimization and SLO enforcement. LLMKube bridges the gap.
What's working:
- 17x speedup with NVIDIA GPUs (64 tok/s on Llama 3.2 3B vs 4.6 tok/s CPU)
- One command: llmkube deploy llama-3b --gpu (auto CUDA setup, scheduling, layer offloading)
- Production observability: Prometheus + Grafana + DCGM GPU metrics out of the box
- OpenAI-compatible API endpoints
- Terraform configs for GKE GPU clusters with auto-scale to zero
Tech: Kubernetes CRDs, llama.cpp with CUDA, NVIDIA GPU Operator, cost-optimized spot instances (~$50-150/mo dev workloads).
Status: v0.2.0 production-ready for single-GPU deployments on standard K8s clusters. Multi-GPU and multi-node model sharding on the roadmap.
Apache 2.0 licensed. Would love feedback from anyone running LLMs in production!
Website: https://llmkube.com
The author introduces LLMKube, a Kubernetes operator for deploying GPU-accelerated LLMs in production, highlighting its features and benefits for regulated industries.
Snapshot generated from the HN discussion
Discussion Activity
No activity data yet
We're still syncing comments from Hacker News.
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Discussion hasn't started yet.
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.