Show HN: Small hardware box that runs local LLMs and exposes an OpenAI API
Mood
calm
Sentiment
positive
Category
tech
Key topics
LLMs
AI Hardware
OpenAI API
Right now the box boots into a very simple web UI where you choose a model and start using it. The API follows the OpenAI format for chat completions and embeddings. It can run different models depending on the hardware you pick, either a Jetson Orin Nano or an x86 mini-PC with a GPU. It stores data locally, supports basic RAG indexing and only exposes itself on the LAN by default.
A few things still aren’t working. There’s no multi-user rate limiting yet. The RAG quality is basic and I’m still improving chunking and reranking. The Orin runs hot under heavy load, so thermal performance needs work. It’s also still a prototype rather than a finished consumer product.
On the technical side, it runs containerized model servers using Ollama and some custom runners. Models load through GGUF or TensorRT-LLM depending on the hardware. The API layer follows the OpenAI spec. The RAG pipeline uses local embeddings and a vector database. The software stack is a mix of TypeScript and Python.
I’m looking for feedback from anyone who has built or deployed local inference before. I’m trying to understand what thermal and power issues you’ve run into, whether a drop-in OpenAI compatible box is actually useful to small teams, what hardware setups I should consider, and any honest critiques of the idea.
The author is showcasing a small hardware box that runs local LLMs and exposes an OpenAI API, but there is no discussion or feedback from the community yet.
Snapshot generated from the HN discussion
Discussion Activity
No activity data yet
We're still syncing comments from Hacker News.
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Discussion hasn't started yet.
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.