Neutts-Air – Open-Source, on Device Tts

Posted3 months agoActive3 months ago

nopelynopington

105 points

28 comments

github.comTechstory

excitedpositive

Debate

20/100

TtsOpen-SourceArtificial IntelligenceMachine Learning

Key topics

Tts

Open-Source

Artificial Intelligence

Machine Learning

The Neutts-air project, an open-source on-device TTS model, has been released on GitHub, sparking excitement and discussion among HN users about its potential applications and limitations.

Snapshot generated from the HN discussion

Discussion Activity

Active discussion

First comment

N/A

Peak period

84-96h

Avg / period

4.7

Comment distribution28 data points

Loading chart...

Based on 28 loaded comments

Key moments

01Story posted
Oct 6, 2025 at 5:06 AM EDT
3 months ago
Step 01
02First comment
Oct 6, 2025 at 5:06 AM EDT
0s after posting
Step 02
03Peak activity
18 comments in 84-96h
Hottest window of the conversation
Step 03
04Latest activity
Oct 11, 2025 at 11:01 PM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (28 comments)

Showing 28 comments

nopelynopingtonAuthor

3 months ago

1 reply

If this lives up to the demo it's a huge development for anyone looking to do realistic tts without paying to use an API

kristopolous

3 months ago

2 replies

there's quite a number of pretty low overhead models around that do that in realtime these days.

MarsIronPI

3 months ago

1 reply

But how many of them support voice cloning?

(Genuine question; I haven't seen any other than this one.)

nickthegreek

3 months ago

1 reply

microsoft’s vibe voice.

MarsIronPI

3 months ago

1 reply

VibeVoice (according to the repo description) is currently unavailable due to "misuse". But my impression was that it required a significant (>8GB) amount of VRAM? Or that it wasn't suitable for on-device for devices with low specs.

nickthegreek

3 months ago

1 reply

its unavailable from their repo, but was released with an open license and mirrors exist. I'm not sure what the VRAM req are.

MarsIronPI

3 months ago

According to this issue[0] the 1.5B model needs 6GB of VRAM. Meanwhile it looks like NeuTTS is designed to be able to run on CPU, which is nice for older/lower-spec hardware.

0: https://github.com/microsoft/VibeVoice/issues/26#issuecommen...

foofoo12

3 months ago

curioussquirrel

3 months ago

3 replies

Could we finally get a decent opensource TTS app for Android? This project is very cool.

hsjdbsjeveb

3 months ago

1 reply

SherpaTTS?

On Fdroid

deknos

3 months ago

i though this uses coqui which is not really opensource?

noman-land

3 months ago

1 reply

SherpaTTS is decent.

curioussquirrel

3 months ago

Will check it out, thx!

mewmix

3 months ago

https://github.com/mewmix/nabu

you could try out nabu and let me know, i am working on adding more tts models in the future. It features all the kokoro voices, style mixing to create your own blend from their voices, basic kitten tts support, audio book / screen reader, LLMs and more :)

ks2048

3 months ago

2 replies

Every couple of weeks I see a new TTS model showcased here and it’s always difficult to see how they differ from one another. Why don’t they describe the architecture and details of the trailing data?

My cynical side thinks people just take the state-of-the-art open source model, use an LLM to alter the source, minimal fine tuning to change the weights and they are able to claim “we built our own state of the art tts”.

I know it’s open source, so I can dig into the details myself, but are they any good high-level overviews of modern TTS, comparing/contrasting the top models?

popalchemist

3 months ago

The special sauce here is that it is built on a very small LLM (Qwen) which means this can run on CPU-only, or even on micro devices like Raspberry Pi or a mobile phone.

Architecturally it's similar to other LLM-based TTS models (like OuteTTS) but the underlying LLM makes them able to release it under an Apache 2 license.

DecoPerson

3 months ago

Without the resources to do a study to see if the quality is actually better or worse than other options, these open-TTS models must be judged by what you think of their output. (That is, do your own study.)

I've found some of them to be surprisingly good. I keep a list of them, as I have future project ideas that might need a good one, and each has its own merits.

I'm yet to find one that does good spoken informal Chinese. I'd appreciate if anyone can suggest one!

miki123211

3 months ago

1 reply

> Install espeak (required dependency)

This means using this TTS in commercial project is very dicy due to GPL3.

mlla

3 months ago

1 reply

If only English support is required eSpeak could be replaced with MisakiSwift, which is under Apache 2.0 https://github.com/mlalma/MisakiSwift

diggan

3 months ago

Unfortunately seems it's Mac/iPhone only. Any cross platform alternatives?

aitchnyu

3 months ago

1 reply

Tangential, how easy is it to verify watermark with a smartphone and how easy is it to erase the watermark?

ottah

3 months ago

Removing the watermark looks trivial https://github.com/neuphonic/neutts-air/blob/d9761a3d938b06c...

Watermarking is usually very fragile and generally relies on an adversary not knowing about it. I honestly don't know why anyone bothers with it.

mrklol

3 months ago

Model says it’s only supporting English, seems like the demos on their page for other languages are using an older model as the quality is worse.

But the current one seems really good, tested it for quite a bit with multiple kind of inputs.

oidar

3 months ago

I really wish these cloning tts would incorporate some sort of prosody control.

joshstrange

3 months ago

This is really neat. I cloned my voice and can generate text, but I can't seem to generate longer clips. The README.md says:

> Context Window: 2048 tokens, enough for processing ~30 seconds of audio (including prompt duration)

But it's cutting off for me before even that point. I fed it a paragraph of text and it gets part of the way through it before skipping a few words ahead, saying a few words more, then cutting off at 17 seconds. Another test just cut off after 21 seconds (no skipping).

Lastly, I'm on a MBP M3 Max with 128GB running Sequoia. I'm following all the "Guidelines for minimizing Latency" but generating a 4.16 second clip takes 16.51s for me. Not sure what I'm doing wrong or how you would use this in practice since it's not realtime and the limit is so low (and unclear). Maybe you are supposed to cut your text into smaller chunks and run them in parallel/sequence to get around the limit?

baby

3 months ago

BTW I was looking to train a TTS on my voice, whats the best way to do that today locally?

gardnr

3 months ago

The model weighs 1.5GB [1] (the q4 quant is ~500MB)

The demo is impressive. It uses reference audio at inference time, and it looks like the training code is mostly available [2][3] with a reference dataset [4] as well.

From the README:

> NeuTTS Air is built off Qwen 0.5B

1. https://huggingface.co/neuphonic/neutts-air/tree/main

2. https://github.com/neuphonic/neutts-air/issues/7

3. https://github.com/neuphonic/neutts-air/blob/feat/example-fi...

4. https://huggingface.co/datasets/neuphonic/emilia-yodas-engli...

kanwisher

3 months ago

Need to hook this up to Home assistant

View full discussion on Hacker News

ID: 45489311Type: storyLast synced: 11/20/2025, 6:42:50 PM

Want the full context?