Deepseek-V3.1

Posted4 months agoActive4 months ago

Original: DeepSeek-v3.1

wertyk

778 points

263 comments

api-docs.deepseek.comTech DiscussionstoryHigh profile

informativepositive

Debate

20/100

AIDeepseekAPI Documentation

Key topics

Deepseek

API Documentation

Regulars are buzzing about DeepSeek-v3.1, a new AI model that's stirring up the leaderboard scene with its impressive performance. Commenters are sharing their hands-on experiences, with some raving about its high-quality results and others comparing its capabilities to industry heavyweights like Gemini 2.5-Pro. A lively debate is unfolding around the validity of benchmarking tests, with some users questioning their objectivity and others pointing out that top companies often tailor custom agents to ace specific benchmarks. As users dive into the nitty-gritty of DeepSeek-v3.1's strengths and weaknesses, a consensus is emerging that this model is a force to be reckoned with.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

55m

Peak period

0-12h

Avg / period

22.9

Comment distribution160 data points

Loading chart...

Based on 160 loaded comments

Key moments

01Story posted
Aug 21, 2025 at 3:06 PM EDT
4 months ago
Step 01
02First comment
Aug 21, 2025 at 4:01 PM EDT
55m after posting
Step 02
03Peak activity
74 comments in 0-12h
Hottest window of the conversation
Step 03
04Latest activity
Aug 28, 2025 at 3:36 PM EDT
4 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (263 comments)

Showing 160 comments of 263

hodgehog11

4 months ago

6 replies

For reference, here is the terminal-bench leaderboard:

https://www.tbench.ai/leaderboard

Looks like it doesn't get close to GPT-5, Claude 4, or GLM-4.5, but still does reasonably well compared to other open weight models. Benchmarks are rarely the full story though, so time will tell how good it is in practice.

seunosewa

4 months ago

1 reply

The DeepSeek R1 in that list is the old model that's been replaced. Update: Understood.

yorwba

4 months ago

Yes, and 31.3% is given in the announcement as the performance of the new v3.1, which would put it in sixteenth place.

coliveira

4 months ago

2 replies

My personal experience is that it produces high quality results.

amrrs

4 months ago

2 replies

Any example or prompt you use to make this statment?

imachine1980_

4 months ago

2 replies

I remember asking for quotes about the Spanish conquest of South America because I couldn't remember who said a specific thing. The GPT model started hallucinating quotes on the topic, while DeepSeek responded with, "I don't know a quote about that specific topic, but you might mean this other thing." or something like that then cited a real quote in the same topic, after acknowledging that it wasn't able to find the one I had read in an old book. i don't use it for coding, but for things that are more unique i feel is more precise.

mycall

4 months ago

I wonder if Conway's law is at all responsible for that, in the similarity it is based on; regional trained data which has concept biases which it sends back in response.

valtism

4 months ago

Was that true for GPT-5? They claim it is much better at not hallucinating

sync

4 months ago

2 replies

I'm doing coreference resolution and this model (w/o thinking) performs at the Gemini 2.5-Pro level (w/ thinking_budget set to -1) at a fraction of the cost.

dr_dshiv

4 months ago

Strong claim there!

antman

4 months ago

Nice point. How did you test for coreference resolution? Specific prompt or dataset?

SV_BubbleTime

4 months ago

Vine is about the only benchmark I think is real.

We made objective systems turn out subjective answers… why the shit would anyone think objective tests would be able to grade them?

guluarte

4 months ago

2 replies

tbh companies like anthopic, openai, create custom agents for specific benchmarks

bazmattaz

4 months ago

1 reply

Do you have a source for this? I’m intrigued

guluarte

4 months ago

https://www-cdn.anthropic.com/07b2a3f9902ee19fe39a36ca638e5a... "we iteratively refine prompting by analyzing failure cases and developing prompts to address them."

amelius

4 months ago

3 replies

Aren't good benchmarks supposed to be secret?

noodletheworld

4 months ago

1 reply

How can a benchmark be secret if you post it to an API to test a model on it?

"We totally promise that when we run your benchmark against our API we won't take the data from it and use to be better at your benchmark next time"

If you want to do it properly you have to avoid any 3rd party hosted model when you test your benchmark, which means you can't have GPT5, claude, etc. on it; and none of the benchmarks want to be 'that guy' who doesn't have all the best models on it.

So no.

They're not secret.

dmos62

4 months ago

1 reply

How do you propose that would work? A pipeline that goes through query-response pairs to deduce response quality and then uses the low-quality responses for further training? Wouldn't you need a model that's already smart enough to tell that previous model's responses weren't smart enough? Sounds like a chicken and egg problem.

irthomasthomas

4 months ago

2 replies

It just means that once you send your test questions to a model API, that company now has your test. So 'private' benchmarks take it on faith that the companies won't look at those requests and tune their models or prompts to beat them.

dmos62

4 months ago

2 replies

Sounds a bit presumptious to me. Sure, they have your needle, but they also need a cost-efficient way to find it in their hay stack.

lucianbr

4 months ago

They have quite large amounts of money. I don't think they need to be very cost-efficient. And they also have very smart people, so likely they can figure out a somewhat cost-efficient way. The stakes are high, for them.

noodletheworld

4 months ago

Security through obscurity is not security.

Your api key is linked to your credit card, which is linked to your identity.

…but hey, youre right.

Lets just trust them not to be cheating. Cool.

merelysounds

4 months ago

1 reply

Would the model owners be able to identify the benchmarking session among many other similar requests?

irthomasthomas

4 months ago

Depends. Something like arc-agi might be easy as it follows a defined format. I would also guess that the usage pattern for someone running a benchmark will be quite distinct from that of a normal user, unless they take specific measures to try to blend in.

wkat4242

4 months ago

This industry is currently burning billions a month. With that much money around I don't think any secrets can exist.

make3

4 months ago

if you're able to submit multiple times, then you can learn from the hidden set

segmondy

4 months ago

2 replies

garbage benchmark, inconsistent mix of "agent tools" and models. if you wanted to present a meaningful benchmark, the agent tools will stay the same and then we can really compare the models.

there are plenty of other benchmarks that disagree with these, with that said. from my experience most of these benchmarks are trash. use the model yourself, apply your own set of problems and see how well it fairs.

jstummbillig

4 months ago

Which benchmarks are not garbage?

I don't consider myself super special. I think it should be doable to create a benchmark that beats me having to test every single new model.

paradite

4 months ago

Hey. I like your roast on benchmarks.

I also publish my own evals on new models (using coding tasks that I curated myself, without tools, rated by human with rubrics). Would love you to check out and give your thoughts:

Example recent one on GPT-5:

https://eval.16x.engineer/blog/gpt-5-coding-evaluation-under...

All results:

https://eval.16x.engineer/evals/coding

tonyhart7

4 months ago

Yeah but the pricing is insane, I don't care about SOTA if its not break my bank

YetAnotherNick

4 months ago

Depends on the agent. Rank 5 and 15 are claude 4 sonnet, and this stands close to 15th.

seunosewa

4 months ago

2 replies

It's a hybrid reasoning model. It's good with tool calls and doesn't think too much about everything, but it regularly uses outdated tool formats randomly instead of the standard JSON format. I guess the V3 training set has a lot of those.

ivape

4 months ago

2 replies

What formats? I thought the very schema of json is what allows these LLMs to enforce structured outputs at the decoder level? I guess you can do it with any format, but why stray from json?

seunosewa

4 months ago

2 replies

Sometimes it will randomly generate something like this in the body of the text: ``` <tool_call>executeshell <arg_key>command</arg_key> <arg_value>echo "" >> novels/AI_Voodoo_Romance/chapter-1-a-new-dawn.txt</arg_value> </tool_call> ```

or this: ``` <｜toolcallsbegin｜><｜toolcallbegin｜>executeshell<｜toolsep｜>{"command": "pwd && ls -la"}<｜toolcallend｜><｜toolcallsend｜> ```

Prompting it to use the right format doesn't seem to work. Claude, Gemini, GPT5, and GLM 4.5, don't do that. To accomodate DeepSeek, the tiny agent that I'm building will have to support all the weird formats.

ilaksh

4 months ago

Maybe you have your temperature turned up too high.

irthomasthomas

4 months ago

Can't you use logit bias to help with this? Might depend how they are tokenized.

refulgentis

4 months ago

1 reply

In the modes in APIs, the sampling code essentially "rejects and reinference" any token sampled that wouldn't create valid JSON under a grammar created from the schema. Generally, the training is doing 99% of the work, of course, it's just "strict" means "we'll check it's work to the point a GBNF grammar created from the schema will validate."

One of the funnier info scandals of 2025 has been that only Claude was even close to properly trained on JSON file edits until o3 was released, and even then it needed a bespoke format. Geminis have required using a non-formalized diff format by Aider. Wasn't until June Gemini could do diff-string-in-JSON better than 30% of the time and until GPT-5 that an OpenAI model could. (Though v4a, as OpenAI's bespoke edit format is called, is fine because it at least worked well in tool calls. Geminis was a clown show, you had to post process regular text completions to parse out any diffs)

dragonwriter

4 months ago

1 reply

> In the modes in APIs, the sampling code essentially "rejects and reinference" any token sampled that wouldn't create valid JSON under a grammar created from the schema.

I thought the APIs in use generally interface with backend systems supporting logit manipulation, so there is no need to reject and reinference anything; its guaranteed right the first time because any token that would be invalid has a 0% chance of being produced.

I guess for the closed commercial systems that's speculative, but all the discussion of the internals of the open source systems I’ve seen has indicated that and I don't know why the closed systems would be less sophisticated.

refulgentis

4 months ago

2 replies

I maintain a cross-platform llama.cpp client - you're right to point out that generally we expect nuking logits can take care of it.

There is a substantial performance cost to nuking, the open source internals discussion may have glossed over that for clarity (see github.com/llama.cpp/... below). The cost is very high, default in API* is not artificially lower other logits, and only do that if the first inference attempt yields a token invalid in the compiled grammar.

Similarly, I was hoping to be on target w/r/t to what strict mode is in an API, and am sort of describing the "outer loop" of sampling

* blissfully, you do not have to implement it manually anymore - it is a parameter in the sampling params member of the inference params

* "the grammar constraints applied on the full vocabulary can be very taxing. To improve performance, the grammar can be applied only to the sampled token..and nd only if the token doesn't fit the grammar, the grammar constraints are applied to the full vocabulary and the token is resampled." https://github.com/ggml-org/llama.cpp/blob/54a241f505d515d62...

7thpower

4 months ago

1 reply

This is a basic question but maybe you can help: what is a good resource to use to understand how to take advantage of logits?

ivape

4 months ago

1 reply

https://dottxt-ai.github.io/outlines/latest/

For OpenAI, you can just pass in the json_schema to activate it, no library needed. For direct LLM interfacing you will need to host your own LLM or use a cloud provider that allows you too hook in, but someone else may need to correct me on this.

If anyone is using anything other than Outlines, please let us know.

7thpower

4 months ago

Thank you!

dragonwriter

4 months ago

Thanks for the explanation!

darrinm

4 months ago

Did you try the strict (beta) function calling? https://api-docs.deepseek.com/guides/function_calling

esafak

4 months ago

2 replies

It seems behind Qwen3 235B 2507 Reasoning (which I like) and gpt-oss-120B: https://artificialanalysis.ai/models/deepseek-v3-1-reasoning

Pricing: https://openrouter.ai/deepseek/deepseek-chat-v3.1

bigyabai

4 months ago

4 replies

Those Qwen3 2507 models are the local creme-de-la-creme right now. If you've got any sort of GPU and ~32gb of RAM to play with, the A3B one is great for pair-programming tasks.

pdimitar

4 months ago

3 replies

Do you happen to know if it can be run via an eGPU enclosure with f.ex. RTX 5090 inside, under Linux?

I'm considering buying a Linux workstation lately and I want it full AMD. But if I can just plug an NVIDIA card via an eGPU card for self-hosting LLMs then that would be amazing.

gunalx

4 months ago

1 reply

You would still need drivers and all the stuff difficult with nvidia in linux with a egpu. (Its not nessecarily terrible just suboptimal) Rather just add the second GPU in the Workstation, or just run the llm in your AMD GPU.

pdimitar

4 months ago

4 replies

Oh, we can run LLMs efficiently with AMD GPUs now? Pretty cool, I haven't been following, thank you.

green7ea

4 months ago

llama.cpp and lmstudio have a Vulkan backend which is pretty fast. I'm using it to run models on a Strix Halo laptop and it works pretty well.

bavell

4 months ago

IDK about "efficiently" but we've been able to run llms locally with AMD for 1.5-2 years now

DarkFuture

4 months ago

I've been running LLM models on my Radeon 7600 XT 16GB for past 2-3 months without issues (Windows 11). I've been using llama.cpp only. The only thing from AMD I installed (apart from latest Radeon drivers) is the "AMD HIP SDK" (very straight forward installer). After unzipping (the zip from GitHub releases page must contain hip-radeon in the name) all I do is this:

llama-server.exe -ngl 99 -m Qwen3-14B-Q6_K.gguf

And then connect to llamacpp via browser to localhost:8080 for the WebUI (its basic but does the job, screenshots can be found on Google). You can connect more advanced interfaces to it because llama.cpp actually has OpenAI-compatible API.

Plasmoid2000ad

4 months ago

Yes - I'm running a LM Studio on windows on a 6800xt, and everything works more-or-less out of the box using always using Vulkan llama.cpp on the gpu I believe.

There's also ROCm. That's not working for me in LM Studio at the moment. I used that early last year to get some LLMs and stable diffusion running. As far as I can tell, it was faster before, but Vulkan implementations have caught up or something - so much the mucking about isn't often worth it. I believe ROCm is hit or miss for a lot of people, especially on windows.

oktoberpaard

4 months ago

1 reply

I’m running Ollama on 2 eGPUs over Thunderbolt. Works well for me. You’re still dealing with an NVDIA device, of course. The connection type is not going to change that hassle.

pdimitar

4 months ago

2 replies

Thank you for the validation. As much as I don't like NVIDIA's shenanigans on Linux, having a local LLM is very tempting and I might put my ideological problems to rest over it.

Though I have to ask: why two eGPUs? Is the LLM software smart enough to be able to use any combination of GPUs you point it at?

SV_BubbleTime

4 months ago

Even today some progress was released on parallelizing WAN video generation over multiple GPUs. LLMs are way easier to split up.

arcanemachiner

4 months ago

Yes, Ollama is very plug-and-play when it comes to multi GPU.

llama.cpp probably is too, but I haven't tried it with a bigger model yet.

bigyabai

4 months ago

Sure, though you'll be bottlenecked by the interconnect speed if you're tiling between system memory and the dGPU memory. That shouldn't be an issue for the 30B model, but would definitely be an issue for the 480B-sized models.

decide1000

4 months ago

1 reply

I use it on a 24gb gpu Tesla P40. Very happy with the result.

hkt

4 months ago

1 reply

Out of interest, roughly how many tokens per second do you get on that?

edude03

4 months ago

1 reply

Like 4. Definitely single digit. The P40s are slow af

coolspot

4 months ago

1 reply

P40 has memory bandwidth of 346GB/s which means it should be able to do around 14+ t/s running a 24 GB model+context.

edude03

4 months ago

Not sure why I got downvoted - literally the first result (for me) says the best result[0] is 11t/s at Q3. Everything else is single digits, like 2-8t/s. Also considering that its not supported anymore[1] (It's Compute Capability is 6.1, not supported by cuda anymore) and it's power draw, I'd highly recommend anyone interested in ML stay far away from it - even if its all you can afford.

While the memory bandwidth is decent, you do actually need to do matmuls and other compute operations for LLMs, which again its pretty slow at

[0]: https://old.reddit.com/r/LocalLLaMA/comments/1dcdit2/p40_ben... [1]: https://developer.nvidia.com/cuda-gpus

indigodaddy

4 months ago

2 replies

Do we get these good qwen models when using qwen-code CLI tool and authing via qwen.ai account?

esafak

4 months ago

1 reply

You do not need qwen-code or qwen.ai to use them; openrouter + opencode suffice.

indigodaddy

4 months ago

1 reply

Right, I'm aware, was just wondering about that specific scenario.

esafak

4 months ago

I don't know about qwen.ai but you can use that model in qwen-cli through openrouter or Alibaba Cloud ModelStudio: https://www.alibabacloud.com/help/en/model-studio/models#42e...

bigyabai

4 months ago

I'm not sure, probably?

tomr75

4 months ago

With qwen code?

epolanski

4 months ago

I too like Qwen a lot, it's one of the best models for programming, I generally use it via the chat.

abtinf

4 months ago

1 reply

Unrelated, but it would really be nice to have a chart breaking down Price Per Token Per Second for various model, prompt, and hardware combinations.

imranq

4 months ago

1 reply

There is one: https://pricepertoken.com/

rapind

4 months ago

3 replies

Claude's Opus pricing is nuts. I'd be surprised if anyone uses it without the top max subscription.

memothon

4 months ago

Some people have startup credits

tmoravec

4 months ago

FWIW I have the €20 Pro plan and exchange maybe 20 messages with Opus (with thinking) every day, including one weeks-long conversation. Plus a few dozen Sonnet tasks and occasionally light weight CC.

I'm not a programmer, though - engineering manager.

jjani

4 months ago

Sure I do, but not as part of any tools, just for one-off conversations where I know it's going to be the best out there. For tasks where reasoning helps little to none, it's often still number one.

Discussion Activity

Jump to the original sources