A Guide to Local Coding Models
Key topics
The debate around local coding models is heating up, with a recent guide sparking discussion on the cost-effectiveness of self-hosting versus relying on cloud services like Claude. Commenters are sharing their experiences and tips on setting up local models, with some suggesting affordable hardware configurations, such as dual 3060 GPUs, to achieve decent performance. While some users are optimistic about the potential of local models, others point out that their effectiveness depends on the specific use case and data they're trained on, with those farther from the training data requiring more specificity to get good results. As the capabilities of local models continue to improve, the conversation is highlighting the trade-offs between cost, performance, and customization.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
33m
Peak period
73
0-6h
Avg / period
17.8
Based on 160 loaded comments
Key moments
- 01Story posted
Dec 21, 2025 at 3:55 PM EST
12 days ago
Step 01 - 02First comment
Dec 21, 2025 at 4:28 PM EST
33m after posting
Step 02 - 03Peak activity
73 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Dec 24, 2025 at 1:15 AM EST
10 days ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
I am still toying with the notion of assembling an LLM tower with a few old GPUs but I don't use LLMs enough at the moment to justify it.
Cheap tier is dual 3060 12G. Runs 24B Q6 and 32B Q4 at 16 tok/sec. The limitation is VRAM for large context. 1000 lines of code is ~20k tokens. 32k tokens is is ~10G VRAM.
Expensive tier is dual 3090 or 4090 or 5090. You'd be able to run 32B Q8 with large context, or a 70B Q6.
For software, llama.cpp and llama-swap. GGUF models from HuggingFace. It just works.
If you need more than that, you're into enterprise hardware with 4+ PCIe slots which costs as much as a car and the power consumption of a small country. You're better to just pay for Claude Code.
https://vast.ai/hosting#gpu-farms-homelabs
https://simonwillison.net/
Indeed, his self hosting inspired me to get Qwen3:32B ollama working locally. Fits nicely on my M1 pro 32GB (running Asahi). Output is a nice read-along speed and I havent felt the need for anything more powerful.
I'd be more tempted with a maxed out M2 Ultra as an upgrade, vs tower with dedicated GPU cards. The unified memory just feels right for this task. Although I noticed the 2nd hand value of those machine jumped massively in the last few months.
I know that people turn their noses up at local LLM's, but it more than does the job for me. Plus I decided a New Years Resolution of no more subscriptions / Big-AdTech freebies.
I've noticed that I need to be a lot more specific in those cases, up to the point where being more specific is slowing me down, partially because I don't always know what the right thing is.
Are people really doing that?
If that's you, know that you can get a LONG way on the $20/month plans from OpenAI and Anthropic. The OpenAI one in particular is a great deal, because Codex is charged a whole lot lower than Claude.
The time to cough up $100 or $200/month is when you've exhausted your $20/month quota and you are frustrated at getting cut off. At that point you should be able to make a responsible decision by yourself.
YMMV based on the kinds of side projects you do, but it's definitely been cheaper for me in the long run to pay by token, and the flexibility it offers is great.
(I also have the same MBP the author has and have used Aider with Qwen locally.)
Sonnet 4.5 is great for vibe coding. You can give it a relatively vague prompt and it will take the initiative to interpret it in a reasonable way. This is good for non-programmers who just want to give the model a vague idea and end up with a working, sensible product.
But I usually do not want that, I do not want the model to take liberties and be creative. I want the model to do precisely what I tell it and nothing more. In my experience, te GPT-5.x models are a better fit for that way of working.
I just can't accept how slow codex is, and that you can't really use it interactively because of that. I prefer to just watch Claude code work and stop it once I don't like the direction it's taking.
Codex models tend to be extremely good at following instructions, to the point that it won't do any additional work unless you ask it to. GPT-5.1 and GPT-5.2 on the other hand is a little bit more creative.
Models from Anthropics on the other hand is a lot more loosy goosy on the instructions, and you need to keep an eye on it much more often.
I'm using models interchangeably from both providers all the time depending on the task at hand. No real preference if one is better then the other, they're just specialized on different things
> The time to cough up $100 or $200/month is when you've exhausted your $20/month quota and you are frustrated at getting cut off. At that point you should be able to make a responsible decision by yourself.
These are the same people, by and large. What I have seen is users who purely vibe code everything and run into the limits of the $20/m models and pay up for the more expensive ones. Essentially they're trading learning coding (and time, in some cases, it's not always faster to vibe code than do it yourself) for money.
I don't pay $100 to "vibe code" and "learn to program" or "avoid learning to program."
I pay $100 so I can get my projects done without having to hire people.
Restoring a bit of balance to things.
I review all of it, but hand write little of it. It's bizarre how I've ended up here, but yep.
That said, I wouldn't / don't trust it with something from scratch, I only trust it to do that because I built -- by hand -- a decent foundation for it to start from.
But I’ve not found that to be true at all. My actually engineered processes where I care the most is where I push tokens the hardest. Mostly because I’m using llms in many places in the sdlc.
When I’m vibing it’s just a single agent sort of puttering along. It uses much fewer tokens.
I said "by and large" ie generally speaking. As I mentioned before, the exception does not invalidate the trend. I assume HN is more heavily weighted towards non-vibe-coders using up tokens like me and you but again, that's the exception to what I see online elsewhere.
Programming has always been about levels of abstraction, and the people who see LLM-generated code as “cheating” are the same people who argued that you can’t write good code with a compiler. Luddites, who will time-and-time again be proven wrong by the passage of time.
If you're doing mostly smaller changes, you can go all day with the 20$ Claude plan without hitting the limits. Especially if you need to thoroughly review the AI changes for correctness, instead of relying on automated tests.
Claude Code is a whole lot less generous though.
I havent tried agentic coding as I havent set it up in a container yet, and not going to yolo my system (doing stuff via chat and a utility to copy and paste directories and files got me pretty far over the last year and a half).
If I wasn't only using it for side projects I'd have to cough up the $200 out of necessity.
https://geminicli.com/docs/faq/
> What is the privacy policy for using Gemini Code Assist or Gemini CLI if I’ve subscribed to Google AI Pro or Ultra?
> To learn more about your privacy policy and terms of service governed by your subscription, visit Gemini Code Assist: Terms of Service and Privacy Policies.
> https://developers.google.com/gemini-code-assist/resources/p...
The last page only links to generic Google policies. If they didn't train on it, they could've easily said so, which they've done in other cases - e.g. for Google Studio and CLI they clearly say "If you use a billed API key we don't train, else we train". Yet for the Pro and Ultra subscriptions they don't say anything.
If any Googlers read this, and you don't train on paying Pro/Ultra, you need to state this clearly somewhere as you've done with other products. Until then the assumption should be that you do train on it.
https://docs.github.com/en/copilot/reference/ai-models/model...
I originally thought they only supported the previous generation models i.e. Claude Opus 4.1 and Gemini 2.5 Pro based on the copy on their pricing page [1] but clicking through [2] shows that they support far more models.
[1] https://github.com/features/copilot#pricing
[2] https://github.com/features/copilot/plans#compare
Lately Copilot have been getting access to new frontier models the same day they release elsewhere. That wasn't the case months ago (GPT 5.1). But annoyingly you have to explicitly enable each new model.
And this is for hobby / portfolio projects.
Do you mean that users should start a new chat for every new task, to save tokens? Thanks.
That hasn't been true with Opus 4.5. I usually hit my limit after an hour of intense sessions.
1. Do you start off using the Claude Code CLI, then when you hit limits, you switch to the GitHub Copilot CLI to finish whatever it is you are working on?
2. Or, you spend most of your time inside VSCode so the model switching happens inside an IDE?
3. Or, you are more of a strict browser-only user, like antirez :)?
The $20 Anthropic plan is only enough to wet my appetite, I can't finish anything.
I pay for $100 Anthropic plan, and keep a $20 Codex plan in my back pocket for getting it to do additional review and analysis overtop of what Opus cooks up.
And I have a few small $ of misc credits in DeepSeek and Kimi K2 AI services mainly to try them out, and for tasks that aren't as complicated, and for writing my own agent tools.
$20 Claude doesn't go very far.
My monthly spend on ai models is < $1
I'm not cheap, just ahead of the curve. With the collapse in inference cost, everything will be this eventually
Also I've put in my 30 years of tech learning so I might not need them as much as others. I'll basically do
or even Things I used to do intensively I now do lazily.My tool can read stdin, send it to an LLM, and do a couple nice things with the reply. Not exactly RAG, but most man pages fit into the context window so it's okay.
If you aren't using coding models you aren't ahead of the curve.
There are free coding models. I use them heavily. They are ok but only partial substitutes for frontier models.
Alright.
Some people, with some tasks, get great results
But me, with my tasks, I need to maintain provenance and accountability over the code. I can't just have AI fly by the seat of its pants.
I can get into lots of detail on this. If you have seen tools and setups I have done you'd realize why it doesn't work for me.
I've spent money, the results for me, with my tasks, have not been the right decision.
Could you please elaborate on this? Do I get this right that you can set up your your command line so that you can pipe something to a command that sends this something together with a question to an LLM? Or did you just mean that metaphorically? Sorry if this is a stupid question.
Actually for many cases the LLM already knows enough. For more obscure cases, piping in a --help output is also sometimes enough.
Example:
where ai could be a simple shell script combining the argument with stdin
> I'm not cheap
You're cheap. It's okay. We're all developers here. It's a safe space.
llm 'output a .gitignore file for typical python project that I can pipe into the actual file ' > .gitignore
I'm not convinced.
I'm convinced you don't value your time. As Simon said, throw $20-$100/mo and get the best state of the art models with "near 0" setup and move on.
0: https://tealdeer-rs.github.io/tealdeer/
Not a serious question but I thought it's an interesting way of looking at value.
I used to sell cars in SF. Some people wouldn't negotiate over $50 on a $500 a month lease because their apartment was $4k anyway.
Other people WOULD negotiate over $50 because their apartment was $4k.
On the other hand, Claude has been nothing but productive for me.
I’m also confused why you don’t assume people have the intelligence to only upgrade when needed. Isn’t that what we’re all doing? Why would you assume people would immediately sign up for the most expensive plan that they don’t need?
For most of my work I only need the LLM to perform a structured search of the codebase or to refactor something faster than I can type, so the $20/month plan is fine for me.
But for someone trying to get the LLM to write code for them, I could see the $20/month plans being exhausted very quickly. My experience with trying “vibecoding” style app development, even with highly detailed design documents and even providing test case expected output, has felt like lighting tokens on fire at a phenomenal rate. If I don’t interrupt every couple of commands and point out some mistake or wrong direction it can spin seemingly for hours trying to deal with one little problem after another. This is less obvious when doing something basic like a simple React app, but becomes extremely obvious once you deviate from material that’s represented a lot in training materials.
With Gemini/Antigravity, there’s the added benefit of switching to Claude Code Opus 4.5 and Google is waaaay more generous than Claude.
So having subscribed to all three at their lowest subscriptions (for $60/mo) I get the best of each one and never run out of quota. I’ve also got a couple of open-source model subscriptions but I’ve barely had the chance to use them since Codex and Gemini got so good (and generous).
The fact that OpenAI is only spending 30% of their revenue on servers and inference despite being so generous is just mind boggling to me. I think the good times are likely going to last.
This entire comment is confusing. Why are you buying the $200/month plan if you’re only using 10% of it?
I rotate providers. My comment above applies to all of them. It really depends on the work you’re doing and the codebase. There are tasks where I can get decent results and barely make the usage bar move. There are other tasks where I’ve seen the usage bar jump over 20% for the session before I get any usable responses back. It really depends.
This is why it’s confusing, though. Why start with the highest plan as the starting point when it’s so easy to upgrade?
I’m just a simple dude trying to optimize his life.
For context, this was a few months ago when GPT 5 was still new and I was constantly hitting o3 limits. It was an experiment to see if it could pay for itself. It most certainly can but I realized that I just don’t need it.
You should also queue up many "continue ur work" type messages.
Note: I’m using the $20 plan for this! With codex-5.2-medium most of the time (previously codex-5.1-max-medium). For my work projects, Gemini 3 and Antigravity Claude Opus 4.5 are doing the heavy lifting at the moment, which frees up codex :) I usually have it running constantly in a second tab.
The only way I can now justify Pro is if I am developing multiple parallel projects with codex alone. But that isn’t the case for me. I am happier having a mix of agents to work with.
I've been doing something like this with the basic Gemini subscription using Antigravity. I end up hitting the Gemini 3 Pro High quota many times but then I can still use Claude Opus 4.5 on it!
Ah, I missed this part. Yes, this is basically what I would recommend today as well. Buy a couple of different frontier model provider basic subscriptions. See which works better on what problems. For me, I use them all. For someone else it might be codex alone. Ymmv but totally worth exploring!
It's worth noting that the Claude subscription seems notably less than the others.
Also there are good free options for code review.
It could take longer, but save your subscription tokens.
leo dicaprio snapping gif
These kinds of articles should focus on use case because mileage may vary depending on maturity of idea, testing and host of other factors.
If the app, service, or whatever is unproven, that's a sunk cost on macbook vs. 4 weeks to validate an idea which is a pretty long time.
If the idea is sound then run it on macbook :)
Incidentally, wondering if anyone has seen this approach of asking Claude to manage Codex:
https://www.reddit.com/r/codex/comments/1pbqt0v/using_codex_...
And when pressed on “this doesn't make sense, are you sure this works?” they ask the model to answer, it gets it wrong, and they leave it at that.
in my experience cursor is nicer to work with the openai/anthropic cli tools
When I consider it against my other hobbies, $100 is pretty reasonable for a month of supply. That being said, I wouldn’t do it every month. Just the months I need it.
Sure am. Capacity to finish personal projects has tripled for a mere $200/month. Would purchase again.
From what my team tells me, it's not a great deal since it's so far behind Claude in capabilities and IDE integration.
LM Studio can run both MLX and GGUF models but does so from an Ollama style (but more full-featured) macOS GUI. They also have a very actively maintained model catalog at https://lmstudio.ai/models
but people should use llama.cpp instead
and why should that affect usage? it's not like ollama users fork the repo before installing it.
190 more comments available on Hacker News