Tts Still Sucks
Key topics
The author of the blog post 'TTS still sucks' criticizes the current state of open-source Text-to-Speech (TTS) technology, sparking a discussion on the limitations and potential of TTS models, as well as the importance of voice cloning and model quality.
Snapshot generated from the HN discussion
Discussion Activity
Very active discussionFirst comment
26m
Peak period
42
0-6h
Avg / period
10.6
Based on 53 loaded comments
Key moments
- 01Story posted
Nov 10, 2025 at 4:37 PM EST
2 months ago
Step 01 - 02First comment
Nov 10, 2025 at 5:03 PM EST
26m after posting
Step 02 - 03Peak activity
42 comments in 0-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 13, 2025 at 12:14 AM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
1. https://github.com/user-attachments/assets/0fd73fad-097f-48a...
I've heard a lot of Substacks voiced by Eleven Labs models and they seem fine (with the occasional weirdness around a proper noun.) Not a bad article but I think more examples of TTS usage would be more useful.
I guess the outcome is, open weight TTS models are only okay and could be a lot better?
Even with a local model and hardware you already own, you're not beating that on electricity costs.
On Android I’d imagine, big emphasis on imagine since I don’t use it, you could probably script something up and use a phone with an audio jack to record it. Theoretically hitting that maximum of 720 hours of content per month, but I’d imagine at some point they’d find it peculiar you’re listening to content 24/7.
I believe its power usage is negligible in comparison to, for example, screen or maybe even Bluetooth audio.
When it's a foreign web novel with no English translation, I first translate the Web novel with Claude Sonnet.
[1]: https://github.com/resemble-ai/chatterbox
Also, I suspect these AI-Podcast blogs are probably just generated with AI too, so it's likely safe to skip the whole mess
But he also not only reads it himself, he has someone else narrate quotes and he uses chapter art that goes along with the article.
So just feed it batches smaller than 1000 characters? It's not like TTS requires maintaining large contexts at a time.
The simplest examples are punctuation marks which change your speech before you reach the mark, but the problem extends past sentence boundaries
For example:
"He didn't steal the green car. He borrowed it."
vs
"He didn't steal the green car. He stole the red one."
A natural speaker would slightly emphasize steal and borrowed in the 1st example, but emphasize green and red in the 2nd.
Or like when you're building a set:
"Peter called Mary."
vs
"John called Mary. Peter called Mary. Who didn't call Mary?"
-
These all sound like small nits but for naively stitched together TTS, at best they nudge the narration towards the uncanny valley (which may be acceptable for some usecases)... but at worst they make the model sound broken.
I agree, but it seems unusual for this to matter past paragraph boundaries, and it sounds like there should be enough room for a full paragraph of context.
And the current SOTA for TTS includes breathing too, so you can't just put a fixed empty pause between your paragraphs.
People are chunking by paragraphs anyways (or even sentences) and it works, but the top commercial models support maintaining a context or passing in the most recently generated text for that reason.
I do a lot of desktop screen-reader and pdf/doc/epub/etc text to speech every single day. It's been 20 years and I still use Festival 1.96 TTS with voice_nitech_us_slt_arctic_hts voice because it's so computational cheap and just slightly a step about normal festival/espeak/mbrolla/etc type TTS quality to be clear and tolerable. In terms of this local "do screenreader stuff really fast" use case I've tried modern TTS like vibevoice, kokoro tts, sherpa-onx, piper tts, orpheus tts, etc. And they all have consistency issues, many are way too slow even with a $300 GPU dedicated to them, and most output weird garble noises at unpredictable times along with the good output.
I've been working a product called WithAudio (https://with.audio). Are you open to me reaching out and give a free license so you can use it and let me know what you think ? I should say it only supports Windows and Mac(arm). I'm looking for people who have used similar products to get their feedback.
Because of the potential for abuse, nobody wants to release a truly good, general model, because it makes lawyers lose sleep. A few more generations of hardware, though, and there will be enough open data and DIY scaffolding out there to produce a superhuman model, and someone will release it.
Deepfake video is already indistinguishable from real video (not oneshot prompt video generation, but deliberate skilled craft using AI tools.)
Higgsfield and other tools allow for spectacular voice results, but it takes craft and care. The oneshot stuff is deliberately underpowered. OpenAI doesn't want to be responsible for a viral pitch-perfect campaign ad, or fake scandal video sinking a politician, for example.
Once the lawyers calm down, or we get a decent digital bill of rights that establishes clear accountability on the user of the tool, and not the toolmaker, things should get better. Until then, look for the rogue YOLO boutique services or the ambitious open source crew to be the first to superhuman, widely available TTS.
I think a lot of younger people are also not mentally equipped to handle it. Outside of the hackernews sphere of influence, people are really bad at spotting AI slop (and also really bad at caring about it)
They recognize AI slop easily and definitely do care enough to avoid it. As do their friends.
AI-generated content has near-zero commercial value long-term.
Heck, even humans subtly trying to sell something give off a vibe you can pick up quickly. But now and then they're entertaining or subliminal enough that they get through.
Erm, guilty as charged? Although, I don't think you can blame people for that.
There was a video recently comparing a bunch of "influencer girls" that had signs of "This is AI" and "This is Real". They could all have been AI or could all have been real. I have zero confidence that I could actually spot the difference.
This is doubly true as an "Online Video Persona" has a bunch of idiosyncrasies that make them slightly ... off ... even if they're real (example: YouTube Thumbnail Face, face filters, etc.). AI is really good at twigging into those idiosyncrasies and it serves as nice camouflage for AI weirdness.
I'm glad you mentioned this because the "Grandma - I was arrested and you need to send bail" scams are already ridiculously effective to run. Better TTS will make voice communication without some additional verification completely untrustworthy.
But, also, I don't want better TTS. I can understand the words current robotic TTS is saying so it's doing the job it needs to do. Right now there are useful ways to use TTS that provide real value to society - better TTS would just enable better cloaking of TTS and allow actors to more effectively waste human time. I would be perfectly happy if TTS remained at the level it is today.
I'm not sure what the bottle neck right now is. Either this idea isn't as fun as I think or we can't do it in real time on consumer hardware yet.
There’s flashes of brilliance but most of it is noticeably computer generated.
That's a good rule.
> You must enable DRM to play some audio or video on this page.
Looks like `embed.podcasts.apple.com` isn't in the same spirit.
Maybe that's not so important?
That's such a strange requirement. A TTS is just that. It takes a text and speaks it out loud. The user generally doesn't care whose voice it is, and personally I think TTS's sharing the same voice is a good thing for authenticity since it lets users know that it's a TTS reading the script and not a real person.
You want your voice to be reading the script, but you don't want to personally record yourself reading the text? As far as I'm concerned that's an edge case. No wonder that TTS's can't do that properly since most people don't need that in first place.
Totally agree on the pain points - I covered similar thoughts in my post: https://lielvilla.com/blog/death-of-demo/