Deepseek-V3.1
Original: DeepSeek-v3.1
Key topics
Regulars are buzzing about DeepSeek-v3.1, a new AI model that's stirring up the leaderboard scene with its impressive performance. Commenters are sharing their hands-on experiences, with some raving about its high-quality results and others comparing its capabilities to industry heavyweights like Gemini 2.5-Pro. A lively debate is unfolding around the validity of benchmarking tests, with some users questioning their objectivity and others pointing out that top companies often tailor custom agents to ace specific benchmarks. As users dive into the nitty-gritty of DeepSeek-v3.1's strengths and weaknesses, a consensus is emerging that this model is a force to be reckoned with.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
55m
Peak period
74
0-12h
Avg / period
22.9
Based on 160 loaded comments
Key moments
- 01Story posted
Aug 21, 2025 at 3:06 PM EDT
4 months ago
Step 01 - 02First comment
Aug 21, 2025 at 4:01 PM EDT
55m after posting
Step 02 - 03Peak activity
74 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Aug 28, 2025 at 3:36 PM EDT
4 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
https://www.tbench.ai/leaderboard
Looks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will tell how good it is in practice.
We made objective systems turn out subjective answers… why the shit would anyone think objective tests would be able to grade them?
"We totally promise that when we run your benchmark against our API we won't take the data from it and use to be better at your benchmark next time"
:P
If you want to do it properly you have to avoid any 3rd party hosted model when you test your benchmark, which means you can't have GPT5, claude, etc. on it; and none of the benchmarks want to be 'that guy' who doesn't have all the best models on it.
So no.
They're not secret.
Your api key is linked to your credit card, which is linked to your identity.
…but hey, youre right.
Lets just trust them not to be cheating. Cool.
there are plenty of other benchmarks that disagree with these, with that said. from my experience most of these benchmarks are trash. use the model yourself, apply your own set of problems and see how well it fairs.
I don't consider myself super special. I think it should be doable to create a benchmark that beats me having to test every single new model.
I also publish my own evals on new models (using coding tasks that I curated myself, without tools, rated by human with rubrics). Would love you to check out and give your thoughts:
Example recent one on GPT-5:
https://eval.16x.engineer/blog/gpt-5-coding-evaluation-under...
All results:
https://eval.16x.engineer/evals/coding
or this: ``` <|toolcallsbegin|><|toolcallbegin|>executeshell<|toolsep|>{"command": "pwd && ls -la"}<|toolcallend|><|toolcallsend|> ```
Prompting it to use the right format doesn't seem to work. Claude, Gemini, GPT5, and GLM 4.5, don't do that. To accomodate DeepSeek, the tiny agent that I'm building will have to support all the weird formats.
One of the funnier info scandals of 2025 has been that only Claude was even close to properly trained on JSON file edits until o3 was released, and even then it needed a bespoke format. Geminis have required using a non-formalized diff format by Aider. Wasn't until June Gemini could do diff-string-in-JSON better than 30% of the time and until GPT-5 that an OpenAI model could. (Though v4a, as OpenAI's bespoke edit format is called, is fine because it at least worked well in tool calls. Geminis was a clown show, you had to post process regular text completions to parse out any diffs)
I thought the APIs in use generally interface with backend systems supporting logit manipulation, so there is no need to reject and reinference anything; its guaranteed right the first time because any token that would be invalid has a 0% chance of being produced.
I guess for the closed commercial systems that's speculative, but all the discussion of the internals of the open source systems I’ve seen has indicated that and I don't know why the closed systems would be less sophisticated.
There is a substantial performance cost to nuking, the open source internals discussion may have glossed over that for clarity (see github.com/llama.cpp/... below). The cost is very high, default in API* is not artificially lower other logits, and only do that if the first inference attempt yields a token invalid in the compiled grammar.
Similarly, I was hoping to be on target w/r/t to what strict mode is in an API, and am sort of describing the "outer loop" of sampling
* blissfully, you do not have to implement it manually anymore - it is a parameter in the sampling params member of the inference params
* "the grammar constraints applied on the full vocabulary can be very taxing. To improve performance, the grammar can be applied only to the sampled token..and nd only if the token doesn't fit the grammar, the grammar constraints are applied to the full vocabulary and the token is resampled." https://github.com/ggml-org/llama.cpp/blob/54a241f505d515d62...
For OpenAI, you can just pass in the json_schema to activate it, no library needed. For direct LLM interfacing you will need to host your own LLM or use a cloud provider that allows you too hook in, but someone else may need to correct me on this.
If anyone is using anything other than Outlines, please let us know.
Pricing: https://openrouter.ai/deepseek/deepseek-chat-v3.1
I'm considering buying a Linux workstation lately and I want it full AMD. But if I can just plug an NVIDIA card via an eGPU card for self-hosting LLMs then that would be amazing.
llama-server.exe -ngl 99 -m Qwen3-14B-Q6_K.gguf
And then connect to llamacpp via browser to localhost:8080 for the WebUI (its basic but does the job, screenshots can be found on Google). You can connect more advanced interfaces to it because llama.cpp actually has OpenAI-compatible API.
There's also ROCm. That's not working for me in LM Studio at the moment. I used that early last year to get some LLMs and stable diffusion running. As far as I can tell, it was faster before, but Vulkan implementations have caught up or something - so much the mucking about isn't often worth it. I believe ROCm is hit or miss for a lot of people, especially on windows.
Though I have to ask: why two eGPUs? Is the LLM software smart enough to be able to use any combination of GPUs you point it at?
llama.cpp probably is too, but I haven't tried it with a bigger model yet.
While the memory bandwidth is decent, you do actually need to do matmuls and other compute operations for LLMs, which again its pretty slow at
[0]: https://old.reddit.com/r/LocalLLaMA/comments/1dcdit2/p40_ben... [1]: https://developer.nvidia.com/cuda-gpus
I'm not a programmer, though - engineering manager.
./llama.cpp/llama-cli -hf unsloth/DeepSeek-V3.1-GGUF:UD-Q2_K_XL -ngl 99 --jinja -ot ".ffn_.*_exps.=CPU"
More details on running + optimal params here: https://docs.unsloth.ai/basics/deepseek-v3.1
There is a way to convert to Q8_0, BF16, F16 without compiling llama.cpp, and it's enabled if you use `FastModel` and not on `FastLanguageModel`
Essentially I try to do `sudo apt-get` if it fails then `apt-get` and if all fails, it just fails. We need `build-essential cmake curl libcurl4-openssl-dev`
See https://github.com/unslothai/unsloth-zoo/blob/main/unsloth_z...
You just fail and print a nice error message telling the user exactly what they need to do, including the exact apt command or whatever that they need to run.
I was thinking if I can do it during the pip install or via setup.py which will do the apt-get instead.
As a fallback, I'll probably for now remove shell executions and just warn the user
The current solution hopefully is in between - ie sudo is gone, apt-get will run only after the user agrees by pressing enter, and if it fails, it'll tell the user to read docs on installing llama.cpp
Usually you don't make assumptions on the host OS, just try to find the things you need and if not, fail, ideally with good feedback. If you want to provide the "hack", you can still do it, but ideally behind a flag, `allow_installation` or something like that. This is, if you want your code to reach broader audiences.
Some people may prefer using whatever llama.cpp in $PATH, it's okay to support that, though I'd say doing so may lead to more confused noob users spam - they may just have an outdated version lurking in $PATH.
Doing so makes unsloth wheel platform-dependent, if this is too much of a burden, then maybe you can just package llama.cpp binary and have it on PyPI, like how scipy guys maintain a https://pypi.org/project/cmake/ on PyPI (yes, you can `pip install cmake`), and then depends on it (maybe in an optional group, I see you already have a lot due to cuda shit).
I'm still working on it, but sadly I'm not a packaging person so progress has been nearly zero :(
From how I interpreted it, he meant you could create a new python package, this would effectively be the binary you need.
In your current package, you could depend on the new one, and through that - pull in the binary.
This would let you easily decouple your package from the binary,too - so it'd be easy to update the binary to latest even without pushing a new version of your original package
I've maintained release pipelines before and handled packaging in a previous job, but I'm not particularly into the python ecosystem, so take this with a grain of salt: an approach would be
Pip Packages :
I was trying to see if I could pre-compile some llama.cpp binaries then save them as a zip file (I'm a noob sorry) - but I definitely need to investigate further on how to do python pip binaries
Try to find prebuilt and download.
See if you can compile from source if a compiler is installed.
If no compiler: prompt to install via sudo apt and explaining why, also give option to abort and have the user install a compiler themselves.
This isn't perfect, but limits the cases where prompting is necessary.
But I do agree maybe for better security pypi should check for commands and warn
(1) Removed and disabled sudo
(2) Installing via apt-get will ask user's input() for permission
(3) Added an error if failed llama.cpp and provides instructions to manual compile llama.cpp
Again apologies on my dumbness and thanks for pointing it out!
Imo it's best to just depend on the required fork of llama.cpp at build time (or not) according to some configuration. Installing things at runtime is nuts (especially if it means modifying the existing install path). But if you don't want to do that, I think this would also be an improvement:
Is either sort of change potentially agreeable enough that you'd be happy to review it?1. So I added a `check_llama_cpp` which checks if llama.cpp does exist and it'll use the prebuilt one https://github.com/unslothai/unsloth-zoo/blob/main/unsloth_z...
2. Yes I like the idea of determining distro
3. Agreed on bailing - I was also thinking if doing a Python input() with a 30 second waiting period for apt-get if that's ok? We tell the user we will apt-get some packages (only if apt exists) (no sudo), and after 30 seconds, it'll just error out
4. I will remove sudo immediately (ie now), and temporarily just do (3)
But more than happy to fix this asap - again sorry on me being dumb
- Determine the command that has to be run by the algorithm above.
This does most of the work a user would have to figure out what has to be installed on their system.
- Ask whether to run the command automatically.
This allows the “software should never install dependencies by itself” crowd to say no and figure out further steps, while allowing people who just want it to work to get on with their task as quickly as possible (who do you think there are more of?).
I think it would be fine to print out the command and force the user to run it themselves, but it would bring little material gain at the cost of some of your users’ peace (“oh no it failed, what is it this time ...”).
(1) Removed and disabled sudo
(2) Installing via apt-get will ask user's input() for permission
(3) Added an error if failed llama.cpp and provides instructions to manual compile llama.cpp
I would just ask the user to install the package, and _maybe_ show the command line to install it (but never run it).
That said, it does at least seem like these recent changes are a large step in the right direction.
---
* in terms of what the standard approach should be, we live in an imperfect world and package management has been done "wrong" in many ecosystems, but in an ideal world I think the "correct" solution here should be:
(1) If it's an end user tool it should be a self contained binary or it should be a system package installed via the package manager (which will manage any ancillary dependencies for you)
(2) If it's a dev tool (which, if you're cloning a cpp repo & building binaries, it is), it should not touch anything systemwide. Whatsoever.
This often results in a README with manual instructions to install deps, but there are many good automated ways to approach this. E.g. for CPP this is a solved problem with Conan Profiles. However that might incur significant maintenace overhead for the Unsloth guys if it's not something the ggml guys support. A dockerised build is another potential option here, though that would still require the user to have some kind of container engine installed, so still not 100% ideal.
(2) I might make the message on installing llama.cpp maybe more informative - ie instead of re-directing people to the docs on manual compilation ie https://docs.unsloth.ai/basics/troubleshooting-and-faqs#how-..., I might actually print out a longer message in the Python cell entirely
Yes we're working on Docker! https://hub.docker.com/r/unsloth/unsloth
That will be nice too, though I was more just referring to simply doing something along the lines of this in your current build:
(likely mounting & calling a sh file instead of passing individual commands)---
Although I do think getting the ggml guys to support Conan (or monkey patching your own llama conanfile in before building) might be an easier route.
Quietly installing stuff at runtime is shady for sure, but why not if I consent?
But I'm working on more cross platform docs as well!
I'm working with the AMD folks to make the process easier, but it looks like first I have to move off from pyproject.toml to setup.py (allows building binaries)
Please, please, never silently attempt to mutate the state of my machine, that is not a good practice at all and will break things more often than it will help because you don't know how the machine is set up in the first place.
But yes agreed there won't be any more random package installs sorry!
nixos is such a great way to expose code doing things it shouldn't be doing.
but when it was failing on my original idea, it kept trying dumb things that weren't really even nix after a while.
llama.cpp also unfortunately cannot quantize matrices that are not a multiple of 256 (2880)
165GB will need a 24GB GPU + 141GB of RAM for reasonably fast inference or a Mac
Was that document almost exclusively written with LLMs? I looked at it last night (~8 hours ago) and it was riddled with mistakes, most egregious was that the "Run with Ollama" section had instructions for how to install Ollama, but then the shell commands were actually running llama.cpp, a mistake probably no human would make.
Do you have any plans on disclosing how much of these docs are written by humans vs not?
Regardless, thanks for the continued release of quants and weights :)
But in the docs I see things like
Wouldn't this explain that? (Didn't look too deep)```
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggerganov/llama.cpp cmake llama.cpp -B llama.cpp/build \ -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp
```
but then Ollama is above it:
```
./llama.cpp/llama-gguf-split --merge \ DeepSeek-V3.1-GGUF/DeepSeek-V3.1-UD-Q2_K_XL/DeepSeek-V3.1-UD-Q2_K_XL-00001-of-00006.gguf \ merged_file.gguf
```
I'll edit the area to say you first have to install llama.cpp
``` ./llama.cpp/llama-gguf-split --merge \ DeepSeek-V3.1-GGUF/DeepSeek-V3.1-UD-Q2_K_XL/DeepSeek-V3.1-UD-Q2_K_XL-00001-of-00006.gguf \ merged_file.gguf ```
Ollama can only allow merged GGUFs (not splitted ones), so hence the command.
All docs are made by humans (primarily my brother and me), just sometimes there might be some typos (sorry in advance)
I'm also uploading Ollama compatible versions directly so ollama run can work (it'll take a few more hours)
You guys already do a lot for the local LLM community and I appreciate it.
103 more comments available on Hacker News