Qwen3-Omni: Native Omni AI Model for Text, Image and Video
Posted3 months agoActive3 months ago
github.comTechstoryHigh profile
excitedpositive
Debate
20/100
Artificial IntelligenceMultimodal ModelsLanguage Processing
Key topics
Artificial Intelligence
Multimodal Models
Language Processing
The Qwen3-Omni AI model is a native omni AI model that can process text, image, and video, and has been generating excitement among the HN community for its capabilities and potential applications.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
55m
Peak period
99
0-12h
Avg / period
15.8
Comment distribution142 data points
Loading chart...
Based on 142 loaded comments
Key moments
- 01Story posted
Sep 22, 2025 at 1:50 PM EDT
3 months ago
Step 01 - 02First comment
Sep 22, 2025 at 2:45 PM EDT
55m after posting
Step 02 - 03Peak activity
99 comments in 0-12h
Hottest window of the conversation
Step 03 - 04Latest activity
Sep 28, 2025 at 3:36 AM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45336989Type: storyLast synced: 11/22/2025, 11:00:32 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
Multi-modal audio models are a lot less common. GPT-4o was meant to be able to do this natively from the start but they ended up shipping separate custom models based on it for their audio features. As far as I can tell GPT-5 doesn't have audio input/output at all - the OpenAI features for that still use GPT-4o-audio.
I don't know if Gemini 2.5 (which is multi-modal for vision and audio) shares the same embedding space for all three, but I expect it probably does.
For example, beyond video->text->llm and video->embedding in llm, you can also have an llm controlling/guiding a separate video extractor.
See this paper for a pretty thorough overview.
Tang, Y., Bi, J., Xu, S., Song, L., Liang, S., Wang, T., Zhang, D., An, J., Lin, J., Zhu, R., Vosoughi, A., Huang, C., Zhang, Z., Liu, P., Feng, M., Zheng, F., Zhang, J., Luo, P., Luo, J., & Xu, C. (2025). Video Understanding with Large Language Models: A Survey (No. arXiv:2312.17432). arXiv. https://doi.org/10.48550/arXiv.2312.17432
> "Bonjour, pourriez-vous me dire comment se rendreà la place Tian'anmen?"
translation: "Hello, could you tell me how to get to Tiananmen Square?"
a bold choice!
e.g. if something similar happened in Trafalgar Square, I expect it would still be primarily a major square in London to me, not oh my god they must be referring to that awful event. (In fact I think it was targeted in the 7/7 bombings for example.)
Or a better example to go with your translation - you can refer to the Bastille without 'boldly' invoking the histoire of its storming in the French Revolution.
No doubt the US media has referred to the Capitol without boldness many times since 6 Jan '21.
"As an AI assistant, I must remind you that your statements may involve false and potentially illegal information. Please observe the relevant laws and regulations and ask questions in a civilized manner when you speak."
I wonder if we'll see a macOS port soon - currently it very much needs an NVIDIA GPU as far as I can tell.
I'm pretty happy about that - I was worried it'd be another 200B+.
It has an entertaining selection of different voices, including:
*Dylan* - A teenager who grew up in Beijing's hutongs
*Peter* - Tianjin crosstalk, professionally supporting others
*Cherry* - A sunny, positive, friendly, and natural young lady
*Ethan* - A sunny, warm, energetic, and vigorous boy
*Eric* - A Sichuan Chengdu man who stands out from the crowd
*Jada* - The fiery older sister from Shanghai
"... A comprehensive evaluation was performed on a suite of models, including Qwen3-Omni-30B-A3B- Instruct, Qwen3-Omni-30B-A3B-Thinking, and two in-house developed variants, designated Qwen3- Omni-Flash-Instruct and Qwen3-Omni-Flash-Thinking. These “Flash” models were designed to improve both computational efficiency and performance efficacy, integrating new functionalities, notably the support for various dialects. ..."
In Russian, Ryan sounds like a westerner who started reading Russian words a month ago.
Dylan sounds somewhat authentic, while everyone else is a different degree of heavy-asian-accented Russian.
Depending on the architecture this is something you could feasibly have in your house in a couple of years or in an expensive "ai toaster"
Ever since ChatGPT added this feature I've been waiting for anyone else to catch up.
They're are tons of hands free situations like cooking where this would be amazing ("read the next step please, my hands are covered in raw pork", "how much flour for the roux", "crap, I don't have any lemons, what can I substitute")
The Chinese are going to end up owning the AI market if the American labs don't start competing on open weights. Americans may end up in a situation where they have some $1000-2000 device at home with an open Chinese model running on it, if they care about privacy or owning their data. What a turn of events!
For the actual AI, I basically set up three docker containers: one for speech to text[5], one for text to speech[6], and then ollama[7] for the actual AI. After that it's just a matter of pointing HomeAssistant at the various services, as it has built in support for all of these things.
1. https://www.adafruit.com/product/5835
2. https://esphome.io/
3. https://gist.github.com/tedivm/2217cead94cb41edb2b50792a8bea...
4. https://github.com/BigBobbas/ESP32-S3-Box3-Custom-ESPHome/
5. https://github.com/rhasspy/wyoming-faster-whisper
6. https://github.com/rhasspy/wyoming-piper
7. https://ollama.com/
The nails in the video made me laugh
I've never set up any AI system. Would you say setting up such a self-hosted AI is at a point now where an AI novice can get an AI system installed and integrated with an existing Home Assistant install in a couple hours?
Claude code is your friend.
I run proxmox on an old Dell R710 in my closet that hosts my homeassistant (amongst others) VM and then I've setup my "gaming" PC (which hasn't done any gaming in quite some time) to dual boot (Windows or Deb/Proxmox) and just keep it booted into Deb as another proxmox node. That PC also has a 4070 Super that I have setup to passthru to a VM and on that VM I've got various services utilizing the GPU. This includes some that are utilized by my hetzner bare metal servers for things like image/text embeddings as well as local LLM use (though, rather minimal due to VRAM constraints) and some image/video object detection stuff with my security cameras (slowly working on a remote water gun turret to keep the racoons from trying to eat the kittens that stray cats keep having in my driveway/workshop).
Install claude code (or, opencode, it's also good) - use Opus (get the max plan) and give it a directory that it can use as it's working directory (don't open it in ~/Documents and just start doing things) and prompt it with something as simple as this:
"I have an existing home assistant setup at home and I'd like to determine what sort of self-hosted AI I could setup and integrate with that home assistant install - can you help me get started? Please also maintain some notes in .md files in this working directory with those note files named and organized as you see appropriate so that we can share relevant context and information with future sessions. (example: Hardware information, local urls, network layout, etc) If you're unsure of something, ask me questions. Do not perform any destructive actions without first confirming with me."
Plan mode. _ALWAYS_ use plan mode to get the task setup, if there's something about the plan you don't like, say no and give it notes - it will return with a new plan. Eventually agree to the plan when it's right - then work through that plan not in plan mode, but if it gets off the plan, get back in plan mode to get the/a plan set and then again let it go and just steer it in regular mode.
I dont have max plan, but on the Pro i tried for a month, i was able to blow trough my 5 hour limit by a single prompt (with 70k context codebase attached). The idea of paying so much money to get few questions per "workday" seems insane to me
Like, I don't know if my account is broken or everyone else just uses things differently. I use claude code, I have it hard-stuck to Opus 4.1 - I don't even touch Sonnet. I _abuse_ the context - I used to /compact early or /clear often depending on the task... but these days (Opus seems much better with nearly full context than Sonnet was) if I'm still on the same task/group of tasks or I think that the current context would be useful for the next thing/task/step I don't even /compact anymore. I've found that if I just run it right up to full and let it auto /compact it does a _really_ good job picking up where it left off. (Which wasn't always the case) Point being - I'm exclusively using Opus 4.1 while also constantly cycling through and maxing out context only to restart with /compact'd context so it's not even starting empty and just keep going.
Hours a day like this. Never hit a limit. (I've said elsewhere that I do believe the general time I work, which is late evening and early morning in north america, does have something to do with this but I don't actually know)
Or, ask somebody who already has it set up working.
That way you can get certain results, without guessing around why it works for them and not for you.
(I, too, am interested in the grandparent poster's setup.)
Keep up the good hacking - it's been fun to play with this stuff!
Wouldn't worry about that, I'm pretty sure the government is going to ban running Chinese tech in this space sooner or later. And we won't even be able to download it.
Not saying any of the bans will make any kind of sense, but I'm pretty sure they're gonna say this is a "strategic" space. And everything else will follow from there.
Download Chinese models while you can.
There are a lot of things in the US Constitution. But the Supreme Court is the final arbiter, and they're moving closer and closer to "whatever you say, big daddy."
[1] https://www.oyez.org/cases/1789-1850/3us386
Whatever a given bill does is precisely what its authors, who are almost never elected by any constituency on Earth, intend for it to do.
What percentage of the people that you know are able to install python and dependencies plus the correct open weights models?
I'd wager most of your parents can't do it.
Most "normies" wouldn't even know what a local model even is, let alone how to install a GPU.
Granted, based on how annoyingly chill we are with advertisements and government surveillance, I suppose this desire for privacy never extended beyond the neighbors.
I think HN vastly overestimates the market for something like this. Yes, there are some people who would spend $2,000 to avoid having prompts go to any cloud service.
However, most people don’t care. Paying $20 per month for a ChatGPT subscription is a bargain and they automatically get access to new versions as they come.
I think the at-home self hosting hobby is interesting, but it’s never going to be a mainstream thing.
Case in point: I give Gmail OAuth access to nobody. I nearly got burned once and I really don’t want my entire domain nuked. But I want to be able to have an LLM do things only LLMs can do with my email.
“Find all emails with ‘autopay’ in the subject from my utility company for the past 12 months, then compare it to the prior year’s data.” GPT-OSS-20b tried its best but got the math obviously wrong. Qwen happily made the tool calls and spat out an accurate report, and even offered to make a CSV for me.
Surely if you can’t trust npm packages or MS to not hand out god tokens to any who asks nicely, you shouldn’t trust a random MCP server with your credentials or your model. So I had Kilocode build my own. For that use case, local models just don’t quite cut it. I loaded $10 into OpenRouter, told it what I wanted, and selected GPT5 because it’s half off this week. 45 minutes, $0.78, and a few manual interventions later I had a working Gmail MCP that is my very own. It gave me some great instructions on how to configure an OAuth app in GCP, and I was able to get it running queries within minutes from my local models.
There is a consumer play for a ~$2499-$5000 box that can run your personal staff of agents on the horizon. We need about one more generation of models and another generation of low-mid inference hardware to make it commercially feasible to turn a profit. It would need to pay for itself easily in the lives of its adopters. Then the mass market could open up. A more obvious path goes through SMBs who care about control and data sovereignty.
If you’re curious, my power bill is up YoY, but there was a rate hike, definitely not my 4090;).
Moving further, if the OAuth Token confers access to the rest of a user's Google suite, any information in Drive can be compromised. If the token has broader access to a Google Workspace account, there's room for inspecting, modifying, and destroying important information belonging to multiple users. If it's got admin privileges, a third party can start making changes to the org's configuration at large, sending spam from the domain to tank its reputation while earning a quick buck, or engage in phishing on internal users.
The next step would be racking up bills in Google's Cloud, but that's hopefully locked behind a different token. All the same, a bit of lateral movement goes a long way ;)
My greatest concern for local AI solutions like this is the centrality of email and the obvious security concerns surrounding email auth.
All of these things have a dark side, though; but it's likely unnecessary for me to elaborate on that.
are we the baddies??
Even more so now that Trump is in command.
That's not to say they are being selfish, or to judge in any way the morality of their actions. But because of that incentive, you can't logically infer moral agency in their decision to release open-weights, IP-free CPUs, etc.
In the case of technology like RISC, pretty much all the value add is unprotected, so they can sell those products in the US/EU without issue.
open your eyes.
So did you run the model offline on your own computer and get realtime audio?
Can you tell me the GPU or specifications you used?
I inquired with ChatGPT:
https://chatgpt.com/share/68d23c2c-2928-800b-bdde-040d8cb40b...
It seems it needs around a $2,500 GPU, do you have one?
I tried Qwen online via its website interface a few months ago, and found it to be very good.
I've run some offline models including Deepseek-R1 70B on CPU (pretty slow, my server has 128 GB of RAM but no GPU) and I'm looking into what kind of setup I would need to run an offline model on GPU myself.
At the top of the README of the GitHub repository, there are a few links to demos where you can try the model.
> It seems it needs around a $2,500 GPU
You can get a used RTX 3090 for about $700, which has the same amount of VRAM as the RTX 4090 in your ChatGPT response.
But as far as I can tell, quantized inference implementations for this model do not exist yet.
By analogy, you can legally charge for copies of your custom Linux distribution, but what's the point when all the others are free?
China has nothing to lose and everything to gain by releasing stuff openly.
Once China figures put how to make high performance FPGA chips really cheap, its game over for the US. The only power the US has is over GPU supply...and even then its pretty weak.
Not to mention NVIDIA crippling its own country with low VRAM cards. China is taking older cards, stripping the RAM and upgrading other older cards.
I was obviously really naive. Either way, it gets me excited any time I see progress with OCR. I should give this a try against my (small) dataset.
All I'm saying is I'm excited to try Qwen to see if it out performs my gnarly algorithm.
I wonder how hard is it to turn the open-source model to be a realtime model.
This is an older link, but they listed two different sections here, commercial and open source models: https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
For the realtime multimodal, I'm not seeing the open source models tab: https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
Pasted here for your own judgement:
*Title: The Last Lightbulb*
The power had been out for three days. Rain drummed against the windows of the old cabin, and the only light came from a flickering candle on the kitchen table.
Maggie, wrapped in a wool blanket, squinted at the last working flashlight. “We’ve got one bulb left, Jack. One.”
Jack, hunched over a board game he’d dug out of the closet, didn’t look up. “Then don’t turn it on unless you’re reading Shakespeare or delivering a baby.”
She rolled her eyes. “I need it to find the can opener. I’m not eating cold beans with my fingers again.”
Jack finally glanced up, grinning. “You did that yesterday and called it ‘rustic dining.’”
“Desperate times,” she muttered, clicking the flashlight on. The beam cut through the gloom—and immediately began to dim.
“No—!” Jack lunged, but too late. The light sputtered… then died.
Silence. Then Maggie sighed. “Well. There goes civilization.”
Jack leaned back, chuckling. “Guess we’re officially cavemen now.”
“Cavewoman,” she corrected, fumbling in the dark. “And I’m going to bed. Wake me when the grid remembers we exist.”
As she shuffled off, Jack called after her, “Hey—if you find the can opener in the dark, you’re officially magic.”
A pause. Then, from down the hall: “I found socks that match. That’s basically witchcraft.”
Jack smiled into the dark. “Goodnight, witch.”
“Goodnight, caveman.”
Outside, the rain kept falling. Inside, the dark didn’t feel so heavy anymore.
— The End —
Even though gemini 2.0 flash is quite old I still like it. Very cheap (each second of audio is just 32 tokens), support even more languages, non-reasoning so very fast, big rate limits.
https://www.youtube.com/watch?v=_zdOrPju4_g
i was lead to believe that it was possibly by the first image here: https://qwen.ai/blog?id=1f04779964b26eacd0025e68698258faacc7... that shows a voice output (top left) next to the written out detail of the thinking mode
It asked me follow up questions, and then took forever. I finally clicked the button to see what was happening and it had started outputting Chinese in the middle.
I gave up on it.
If we had some aggregated cluster reasoning mechanisms, When would 8x 30B models running on an h100 server out perform in terms of accuracy 1 240B model on the same server.
Impressive what these local models are now capable of.
1 more comments available on Hacker News