Back to Home11/18/2025, 4:47:49 PM

Show HN: LLMKube – Kubernetes for Local LLMs with GPU Acceleration

1 points
0 comments

Mood

supportive

Sentiment

positive

Category

tech

Key topics

Kubernetes

LLMs

GPU Acceleration

Hi HN! I built LLMKube, a Kubernetes operator for deploying GPU-accelerated LLMs in production. One command gets you from zero to inference with full observability.

Why this exists: Regulated industries (healthcare, defense, finance) need air-gapped LLM deployments, but existing tools are either single-node only (Ollama) or lack GPU optimization and SLO enforcement. LLMKube bridges the gap.

What's working:

- 17x speedup with NVIDIA GPUs (64 tok/s on Llama 3.2 3B vs 4.6 tok/s CPU)

- One command: llmkube deploy llama-3b --gpu (auto CUDA setup, scheduling, layer offloading)

- Production observability: Prometheus + Grafana + DCGM GPU metrics out of the box

- OpenAI-compatible API endpoints

- Terraform configs for GKE GPU clusters with auto-scale to zero

Tech: Kubernetes CRDs, llama.cpp with CUDA, NVIDIA GPU Operator, cost-optimized spot instances (~$50-150/mo dev workloads).

Status: v0.2.0 production-ready for single-GPU deployments on standard K8s clusters. Multi-GPU and multi-node model sharding on the roadmap.

Apache 2.0 licensed. Would love feedback from anyone running LLMs in production!

Website: https://llmkube.com

GitHub: https://github.com/Defilan/LLMKube

The author introduces LLMKube, a Kubernetes operator for deploying GPU-accelerated LLMs in production, highlighting its features and benefits for regulated industries.

Snapshot generated from the HN discussion

Discussion Activity

No activity data yet

We're still syncing comments from Hacker News.

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (0 comments)

Discussion hasn't started yet.

ID: 45968719Type: storyLast synced: 11/18/2025, 4:50:41 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.