Neutts-Air – Open-Source, on Device Tts
Posted3 months agoActive3 months ago
github.comTechstory
excitedpositive
Debate
20/100
TtsOpen-SourceArtificial IntelligenceMachine Learning
Key topics
Tts
Open-Source
Artificial Intelligence
Machine Learning
The Neutts-air project, an open-source on-device TTS model, has been released on GitHub, sparking excitement and discussion among HN users about its potential applications and limitations.
Snapshot generated from the HN discussion
Discussion Activity
Active discussionFirst comment
N/A
Peak period
18
84-96h
Avg / period
4.7
Comment distribution28 data points
Loading chart...
Based on 28 loaded comments
Key moments
- 01Story posted
Oct 6, 2025 at 5:06 AM EDT
3 months ago
Step 01 - 02First comment
Oct 6, 2025 at 5:06 AM EDT
0s after posting
Step 02 - 03Peak activity
18 comments in 84-96h
Hottest window of the conversation
Step 03 - 04Latest activity
Oct 11, 2025 at 11:01 PM EDT
3 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45489311Type: storyLast synced: 11/20/2025, 6:42:50 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
(Genuine question; I haven't seen any other than this one.)
0: https://github.com/microsoft/VibeVoice/issues/26#issuecommen...
On Fdroid
you could try out nabu and let me know, i am working on adding more tts models in the future. It features all the kokoro voices, style mixing to create your own blend from their voices, basic kitten tts support, audio book / screen reader, LLMs and more :)
My cynical side thinks people just take the state-of-the-art open source model, use an LLM to alter the source, minimal fine tuning to change the weights and they are able to claim “we built our own state of the art tts”.
I know it’s open source, so I can dig into the details myself, but are they any good high-level overviews of modern TTS, comparing/contrasting the top models?
Architecturally it's similar to other LLM-based TTS models (like OuteTTS) but the underlying LLM makes them able to release it under an Apache 2 license.
I've found some of them to be surprisingly good. I keep a list of them, as I have future project ideas that might need a good one, and each has its own merits.
I'm yet to find one that does good spoken informal Chinese. I'd appreciate if anyone can suggest one!
This means using this TTS in commercial project is very dicy due to GPL3.
Watermarking is usually very fragile and generally relies on an adversary not knowing about it. I honestly don't know why anyone bothers with it.
But the current one seems really good, tested it for quite a bit with multiple kind of inputs.
> Context Window: 2048 tokens, enough for processing ~30 seconds of audio (including prompt duration)
But it's cutting off for me before even that point. I fed it a paragraph of text and it gets part of the way through it before skipping a few words ahead, saying a few words more, then cutting off at 17 seconds. Another test just cut off after 21 seconds (no skipping).
Lastly, I'm on a MBP M3 Max with 128GB running Sequoia. I'm following all the "Guidelines for minimizing Latency" but generating a 4.16 second clip takes 16.51s for me. Not sure what I'm doing wrong or how you would use this in practice since it's not realtime and the limit is so low (and unclear). Maybe you are supposed to cut your text into smaller chunks and run them in parallel/sequence to get around the limit?
The demo is impressive. It uses reference audio at inference time, and it looks like the training code is mostly available [2][3] with a reference dataset [4] as well.
From the README:
> NeuTTS Air is built off Qwen 0.5B
1. https://huggingface.co/neuphonic/neutts-air/tree/main
2. https://github.com/neuphonic/neutts-air/issues/7
3. https://github.com/neuphonic/neutts-air/blob/feat/example-fi...
4. https://huggingface.co/datasets/neuphonic/emilia-yodas-engli...