Open Source Speech Foundation Model That Runs Locally on CPU in Real-Time

Posted3 months agoActive3 months ago

neuphonic

22 points

10 comments

huggingface.coTechstory

excitedpositive

Debate

20/100

Artificial IntelligenceSpeech SynthesisOpen Source

Key topics

Artificial Intelligence

Speech Synthesis

Open Source

The post shares an open-source speech foundation model that can run in real-time on CPU, sparking interest and discussion about its potential applications and comparisons to other models.

Snapshot generated from the HN discussion

Discussion Activity

Moderate engagement

First comment

N/A

Peak period

0-12h

Avg / period

3.3

Key moments

01Story posted
Oct 2, 2025 at 10:47 AM EDT
3 months ago
Step 01
02First comment
Oct 2, 2025 at 10:47 AM EDT
0s after posting
Step 02
03Peak activity
7 comments in 0-12h
Hottest window of the conversation
Step 03
04Latest activity
Oct 8, 2025 at 5:27 AM EDT
3 months ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (10 comments)

Showing 10 comments

neuphonicAuthor

3 months ago

1 reply

We’ve just released Neuphonic TTS Air, a lightweight open-source speech foundation model under Apache 2.0.

The main idea: frontier-quality text-to-speech, but small enough to run in realtime on CPU. No GPUs, no cloud APIs, no rate limits.

Why we built this: - Most speech models today live behind paid APIs → privacy tradeoffs, recurring costs, and external dependencies. - With Air, you get full control, privacy, and zero marginal cost. - It enables new use cases where running speech models on-device matters (edge compute, accessibility tools, offline apps).

Repo: https://github.com/neuphonic/neutts-air

Would love feedback from HN on performance, applications, and contributions.

makkes

3 months ago

"the model has a 2048 tokens limit. This includes the reference text/phones as well as the reference and generation audio tokens."

https://github.com/neuphonic/neutts-air/issues/15#issuecomme...

So "no rate limits", while true, is kind of setting different expectations.

ranger_danger

3 months ago

1 reply

How does this compare to Piper?

Appears to use a proprietary codec as well.

neuphonicAuthor

3 months ago

2 replies

Piper is a VAE model which is quite robotic. This is a speech language model, which sound quite realistic.

You can listen to the model on this video => https://www.youtube.com/watch?v=YAB3hCtu5wE

The codec is open source: https://huggingface.co/neuphonic/neucodec

yjftsjthsd-h

3 months ago

Then may I suggest that this should be edited?

> Audio Codec: NeuCodec - our proprietary neural audio codec that achieves exceptional audio quality at low bitrates using a single codebook

( https://huggingface.co/neuphonic/neutts-air#model-details )

ranger_danger

3 months ago

> The codec is open source: https://huggingface.co/neuphonic/neucodec

This says it was trained on proprietary data.

arkensaw

3 months ago

From the examples it seems that every time you run it, it clones a voice from the samples folder. Is there any way to compile this so it doesn't happen every time?

Sorry if I'm missing the point.

leobg

3 months ago

So basically kokoro, but VC backed? Both models use espeak, so it seems like it’s the same general approach.

Demo sounds great.

arkensaw

3 months ago

I tried it today, amazing quality but VERY slow. took nearly 2 mins to generate audio for a single sentence on an M1 mac

snowfield

3 months ago

Really cool, all my experiments with local speech has been spotty. Keen to try this out.

Can it run on a gpu? Would it be faster?

View full discussion on Hacker News

ID: 45450363Type: storyLast synced: 11/20/2025, 4:02:13 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN