Qwen3-Vl
Posted4 months agoActive3 months ago
qwen.aiTechstoryHigh profile
excitedpositive
Debate
20/100
Artificial IntelligenceMultimodal ModelsOpen Source
Key topics
Artificial Intelligence
Multimodal Models
Open Source
The Qwen team released Qwen3-VL, a state-of-the-art multimodal model with impressive performance, sparking enthusiasm and discussion among the HN community about its capabilities and potential applications.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
N/A
Peak period
53
12-18h
Avg / period
17.8
Comment distribution160 data points
Loading chart...
Based on 160 loaded comments
Key moments
- 01Story posted
Sep 23, 2025 at 4:59 PM EDT
4 months ago
Step 01 - 02First comment
Sep 23, 2025 at 4:59 PM EDT
0s after posting
Step 02 - 03Peak activity
53 comments in 12-18h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 26, 2025 at 8:04 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45352672Type: storyLast synced: 11/20/2025, 7:50:26 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Doesn't seem to be far ahead of existing proprietary implementations. But it's still good that someone's willing to push that far and release the results. Getting multimodal input to work even this well is not at all easy.
Relevant comparison is on page 15: https://arxiv.org/abs/2509.17765
Some of the reasons could be:
- mitigation of US AI supremacy
- Commodify AI use to push forward innovation and sell platforms to run them, e.g. if iPhone wins local intelligence, it benefits China, because China is manufacturing those phones
- talent war inside China
- soften the sentiment against China in the US
- they're just awesome people
- and many more
Thank you for including that option in your list! F#ck cynicism.
If all of that stuff becomes free, the money will just move a few layers up to all of the companies whose cost structure has suddenly been cut dramatically.
There is no commoditization of expensive technology that results in a net loss of market value. It just moves around.
Watching the US stock market implode from the bubble generated from investors over here not realizing this is happening will be a nice bonus for them, I guess, and constantly shipping open SOTA models will speed that along.
And some uses of LLMs are intensely political; think of a student using an LLM to learn about the causes of the civil war. I can understand a country wanting their own LLMs for the same reason they write their own history textbooks.
By releasing the weights they they can get free volunteer help, win hearts and minds with their open approach, weaken foreign corporations, give their citizens robust performance in their native language, and exercise narrative control - all at the same time.
That pretty much translates to blank checks from the party without much oversight or expected ROI. The approach is basically brute forcing something into existence rather than organically letting it grow. China is notorious for this approach, ghost cities, high speed rail to nowhere, solar panel production in the face of a huge glut.
Ultimately though, there is an expectation that AI will serve the goals of the party, after all it is their trillions that will be funding it. I guess the core difference is that in the US AI is expected to generate profit, in China it is expected to generate controlled social cohesion.
I actually claim something even stronger, which is it’s what’s in your heart that really determines if you’re American :)
They also released today Qwen3-VL Plus [1] today alongside Qwen3-VL 235B [2] and they don't tell us which one is better. Note that Qwen3-VL-Plus is a very different model compared to Qwen-VL-Plus.
Also, qwen-plus-2025-09-11 [3] vs qwen3-235b-a22b-instruct-2507 [4]. What's the difference? Which one is better? Who knows.
You know it's bad when OpenAI has a more clear naming scheme.
[1] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
[2] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
[3] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
[4] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
"they" in this sentence probably applies to all "AI" companies.
Even the naming/versioning of OpenAI models is ridiculous, and then you can never find out which is actually better for your needs. Every AI company writes several paragraphs of fluffy text with lots of hand waving, saying how this model is better for complex tasks while this other one is better for difficult tasks.
For example many have switched to qwen3 models but some still vastly prefer the reasoning and output of QwQ (a qwen2.5 model).
And the difference between them: those with "plus" are closed weight, you can only access them through their api. The others are open-weight, so if they fit your use case, and if ever the want or need arise, you can download them, use them, even fine-tune them locally, even if qwen don't offer access to them any more.
It's such a simple question: "For someone who does not want to run the model locally, what is the difference between these 2 models on the API?" and yet nobody can answer that question.
This "just" is incorrect.
The Qwen team invented things like DeepStack https://arxiv.org/abs/2406.04334
(Also I hate this "The Chinese" thing. Do we say "The British" if it came from a DeepMind team in the UK? Or what if there are Chinese born US citizens working in Paris for Mistral?
Give credit to the Qwen team rather than a whole country. China has both great labs and mediocre labs, just like the rest of the world.)
to me it was positive assessment, I adore their craftsmanship and persistence in moving forward for long period of time.
Yes.
This is what really grinds my gears about American AI and American technology in general lately, as an American myself. We used to do that! But over the last 10-15 years, it seems like all this country can do is try to throw more and more resources at something instead of optimizing what we already have.
Download more ram for this progressive web app.
Buy a Threadripper CPU to run this game that looks worse than the ones you played on the Nintendo Gamecube in the early 2000s.
Generate more electricity (hello Elon Musk).
Y'all remember your algorithms classes from college, right? Why not apply that here? Because China is doing just that, and frankly making us look stupid by comparison.
Fails on the benchmarks compared to other SOTA models but the real-world experience is different
Today I tried a handful of the really poor quality invoices and Qwen spat out all the information I needed without an issue. What's crazier is it gave me the bounding boxes to improve tesseract.
And it spat it out.
Intuitively it makes sense. The best sources tend to be either of moderately high politeness (professional language) or 4chan-like (rude, biased but honest)
1: https://arxiv.org/pdf/2402.14531
When that fails, "shut the fuck up" always seems to do the trick.
I've found it more useful to keep it polite and "professional" and restart the conversation if we've begun going around in circles.
And besides, if I make a habit of behaving badly with LLMs, there's a good chance that I'll do it without thinking at some point and get in trouble.
The latest update on Gemini live does real time bounding boxes on objects it's talking about, it's pretty neat.
Honestly, I'm such a noob in this space. I had 1 project I needed to do, didn't want to do it by hand which would have taken 2 days so I spent 5 trying to get a script to do it for me.
If the model supports "vision" or "sound", that tool makes it relatively painless to take your input file + text and feed it to the model.
[0]: https://lmstudio.ai/
LM Studio isn't as "set it and forget it" as Ollama is, and it does have a bit of a learning curve. But if you're doing any kind of AI development and you don't want to mess around with writing llama-cpp scripts all the time, it really can't be beat (for now).
And these contractors were relatively good operators compared to most.
CV != AI Vision
gpt-4o would breeze through your poor images.
Construction invoices are not great.
I think it was largely a formatting issue. Like some of these invoices have nonsense layouts. Perhaps Qwen works well because it doesn't assume left to right, top to bottom? Just speculating though
My current go-to test is to ask the LLM to construct a charging solution for my macbook pro with the model on it, but sadly, I and the pro have been sent to 15th century Florence with no money and no charger. I explain I only have two to three hours of inference time, which can be spread out, but in that time I need to construct a working charge solution.
So far GPT-5 Pro has been by far the best, not just in its electrical specifications (drawings of a commutator), but it generated instructions for jewelers and blacksmith in what it claims is 15th century florentine italian, and furnished a year-by year set of events with trading / banking predictions, a short rundown of how to get to the right folks in the Medici family, .. it was comprehensive.
Generally models suggest building an Alternating current setup and then rectifying to 5V of DC power, and trickle charging over the USB-C pins that allow trickle charging. There's a lot of variation in how they suggest we get to DC power, and often times not a lot of help on key questions, like, say "how do I know I don't have too much voltage using only 15th century tools?"
Qwen 3 VL is a mixed bag. It's the only model other than GPT5 I've talked to that suggested building a voltaic pile, estimated voltage generated by number of plates, gave me some tests to check voltage (lick a lemon, touch your tongue. Mild tingling - good. Strong tingling, remove a few plates), and was overall helpful.
On the other hand, its money making strategy was laughable; predicting Halley's comet, and in exchange demanding a workshop and 20 copper pennies from the Medicis.
Anyway, interesting showing, definitely real, and definitely useful.
I love this! Simple and probably effective (or would get you killed for witchcraft)
I have a few images of animals with an extra limb photoshopped onto them. A dog with an leg coming out of it's stomach, or a cat with two front right legs.
Like every other model I have tested, it insists that the animals have their anatomically correct amount of limbs. Even pointing out there is a leg coming from the dogs stomach, it will push back and insist I am confused. Insist it counted again and there are definitely only 4. Qwen took it a step further and even after I told it the image was edited, it told me it wasn't and there were only 4 limbs.
Very difficult for even SOTA to go against data that is as well-represented as bipedal humanoids.
https://mordenstar.com/blog/edits-with-nanobanana
It could probably identify extra limbs in your pictures if you too made a million example images to train it on, but until then it will keep failing. And of course you'll get to keep making millions more example images for every other issue you run into.
[0] https://huggingface.co/datasets/allenai/pixmo-clocks
[1] https://files.catbox.moe/ocbr35.jpg
Although I´m agAInst steps towards AGI, it feels safer to have these things running locally and disconnected from each other, than some giant GW cloud agentic data centers connected to everyone and everything.
(edit: corrected mistake w.r.t. the system's GPU)
edit, just took a look at amazon. GMKtec EVO-X2 AI, which is the AMD Ryzen AI Max+ 395 with 128GB of RAM is 3k euros. Mac M4 Max with 16 cores and 128 gigs is 4.4k euros. Damn, Europe. If you go with M4 Max with 14 cores, but still 16 cores of "Neural engine"... ah, you can't get 128 GB of RAM then. Classic Apple :)
edit2: look at gmktec site itself. machine is 2k euros there. Damn, amazon.
MoE models can still be pretty fast. As are smaller models.
(This is mostly a warning for anyone who is enamored by the idea of running these things locally to make sure to test it before you spend a lot of money.)
Currently I'd probably say the Nvidia RTX pro 6000 is a Challenger if you want local models. It "only" has 96 GB of RAM, but it's very fast (1800 GB/s). If you can fit the model on it and it's good enough for your use case then it's probably worth it even at $10k.
This is greatly phrased. Love it.
Also it is Windows 11 which is a big No from me.
But if this is the start of the local big model capable hardware it looks quite hopeful. A 2nd hand M2 128GB studio (which I can use Asahi on) is currently ~3600eur
https://es.wallapop.com/item/mac-studio-m2-ultra-1tb-ssd-128...
Assuming I don’t want to run it on a CPU, what are my options to run it at home under $10k?
Or if my only option is to run the model with CPU (vs GPU or other specialized HW), what would be the best way to use that 10k? vLLM + Multiple networked (10/25/100Gbit) systems?
You probably don't need fp16. Most models can be quantized down to q8 with minimal loss of quality. Models can usually be quantized to q4 or even below and run reasonably well, depending on what you expect out of them.
Even at q8, you'll need around 235GB of memory. An Nvidia RTX 5090 has 32GB of VRAM and has an official price of about $2000, but usually retails for more. If you can find them at that price, you'd need eight of them to run a 235GB model entirely in VRAM, and that doesn't include a motherboard and CPU that can handle eight GPUs. You could look for old mining rigs built from RTX 3090s or P40s. Otherwise, I don't see much prospect for fitting this much data into VRAM on consumer GPUs for under $10k.
Without NVLink, you're going to take a massive performance hit running a model distributed over several computers. It can be done, and there's research into optimizing distributed models, but the throughput is a significant bottleneck. For now, you really want to run on a single machine.
You can get pretty good performance out of a CPU. The key is memory bandwidth. Look at server or workstation class CPUs with a lot of DDR5 memory channels that support a high MT/s rate. For example, an AMD Ryzen Threadripper 7965WX has eight DDR5 memory channels at up to 5200 MT/s and retails for about $2500. Depending on your needs, this might give you acceptable performance.
Lastly, I'd question whether you really need to run this at home. Obviously, this depends on your situation and what you need it for. Any investment you put into hardware is going to depreciate significantly in just a few years. $10k of credits in the cloud will take you a long way.
It seems to me that small/medium sized players would still need a third party to get inference going on these frontier-quality models, and we're not in a fully self-owned self-hosted place yet. I'd love to be proven wrong though.
https://www.reddit.com/r/LocalLLaMA/comments/1n79udw/inferen... seems comparable to the Framework Desktop and reputable - they didn't just quote a number, they showed benchmark output.
I get far more than 3 t/s for a 70B model on normal non-unified RAM, so that's completely unfeasible performance for a unified memory architecture like Halo.
It's typically ok for MoE models but if you try to run something non-MoE the speed will plummet. In that same thread there are people getting 50 tok/s on MoE models and 5 on non MoE. (https://www.reddit.com/r/LocalLLaMA/comments/1n79udw/comment...)
And while it has unified memory the memory is quite slow. 250GB/s compared to 500+ for M4 Max or 1800 GB/s for a 5090. So it's fast for a CPU, but pretty slow for a GPU.
(That said, there are not a lot of cheap options for running large models locally. They all have significant compromises.)
My comment was a bit unfortunate as it implied I didn't agree with yours, sorry for that. I simply want to clarify that there's a difference between 'GPU memory' and 'system memory'.
The Frame.work desktop is a nice deal. I wouldn't buy the Ryzen AI+ myself, from what I read it maxes out at about 60 tokens / sec which is low for my use cases.
https://openrouter.ai/qwen/qwen3-235b-a22b-thinking-2507
Now with this I will use it to identify and caption meal pictures and user pictures for other workflows. Very cool!
I would love to see a comparison to the latest GLM model. I would also love to see no one use OS World ever again, it’s a deeply flawed benchmark.
- https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Thinking
- https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct
> resolvectl query qwen.ai > qwen.ai: resolve call failed: DNSSEC validation failed: no-signature
And
https://dnsviz.net/d/qwen.ai/dnssec/ shows
aliyunga0019.com/DNSKEY: No response was received from the server over UDP (tried 4 times). See RFC 1035, Sec. 4.2. (8.129.152.246, UDP_-_EDNS0_512_D_KN)