Omnilingual Asr: Advancing Automatic Speech Recognition for 1600 Languages
Postedabout 2 months agoActiveabout 2 months ago
ai.meta.comTechstory
excitedpositive
Debate
20/100
Automatic Speech RecognitionMultilingual AILanguage Preservation
Key topics
Automatic Speech Recognition
Multilingual AI
Language Preservation
Meta AI releases Omnilingual ASR, a groundbreaking automatic speech recognition model supporting over 1600 languages, sparking excitement and discussion about its potential applications and implications.
Snapshot generated from the HN discussion
Discussion Activity
Moderate engagementFirst comment
12m
Peak period
10
3-6h
Avg / period
4.3
Comment distribution43 data points
Loading chart...
Based on 43 loaded comments
Key moments
- 01Story posted
Nov 10, 2025 at 1:10 PM EST
about 2 months ago
Step 01 - 02First comment
Nov 10, 2025 at 1:22 PM EST
12m after posting
Step 02 - 03Peak activity
10 comments in 3-6h
Hottest window of the conversation
Step 03 - 04Latest activity
Nov 12, 2025 at 4:18 AM EST
about 2 months ago
Step 04
Generating AI Summary...
Analyzing up to 500 comments to identify key contributors and discussion patterns
ID: 45878826Type: storyLast synced: 11/20/2025, 8:47:02 PM
Want the full context?
Jump to the original sources
Read the primary article or dive into the live Hacker News thread when you're ready.
GitHub: https://github.com/facebookresearch/omnilingual-asr
Original post:
You can use the OmniASR SSL models instead of their older MMS models to create TTS models: https://github.com/ylacombe/finetune-hf-vits
What might be interesting is the newly released OmniASR data, because the MMS data, which was used for the MMS TTS, was never released.
Also, the OmniASR can be used to transcribe some untranscribed speech to train a TTS on it.
[1] MMS paper: https://arxiv.org/pdf/2305.13516
As for "very clean data," see section 5.7.4: "Omnilingual + OMSF ASR was intentionally curated to represent naturalistic (i.e., often noisy) audio conditions, diverse speaker identities, and spontaneous, expressive speech."
Freely downloadable and usable by anyone for almost anything.
We truly live in the future.
Music is the universal language, but one day soon it will be replaced by Chinese.
Half joking - hopefully, we can still contribute something to this to this field. Looking forward to doing some tests with this.
Also, 1.6k < 6k, and I highly doubt this model is anywhere near as good as it is on EU languages for most of them.
As a linguist, I would like to know more about the kinds of languages this works well with, or does not work well with. For example, half the world's languages are tone languages, and the way tones work varies greatly among these. Some just have high and low tones, while others are considerably more complicated; Thai has high, mid, low, rising and falling. Also, tone is relative, e.g. a man's high tone might be a woman's low tone. And some African languages have tones whose absolute frequencies vary across an utterance. So transcribing tone is a quite different problem from transcribing phonemes--and yet for many tone languages, the tone is crucial.
There are also rare(r) phonemes, like the clicks in many languages of southern Africa. Of course maybe they've already trained on some of these languages.
The HuggingFace demo says "Supported Languages[:] For this public demo, we've restricted transcription to low-resource languages with error rates below 10%." That's unclear: 10% word error rate, or character/ phoneme error rate? The meta.com page refers to character error rate (CER); a 10% character error rate can imply a much higher word error rate (WER), since most words contain several characters/ phonemes. That said, there are ways to get around that, like using a dictionary to select among different paths through possible character sequences so you only get known words, and adding to that a morphological parser for languages that have lots of affixes (meaning not all the word forms will be in the dictionary--think walk, walks, walked, walking--only the first will be in most dictionaries.)
Enquiring minds want to know!
https://xkcd.com/1838/
What would it take to start working on them?
"You haven't experienced Shakespeare until you've read him in the original Bonobo". :-)
https://voice-ai.knowii.net
can't say for sure, but a lot of the UI (and text) is quite familiar. the history page is a near rip off which is a giveaway.
i believe the mit license should be distributed since it's almost certainly a derivative work.
"The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software."
I can't confirm if the license is infact distributed, since I would have to pay $50, which quite frankly I'm not going to do.
a bit sad to see a ui reskin claimed as original work. the reskin is totally fine, but I believe the license must be distributed. i believe in the proliferation of this software so im happy to see this overall (it's good enough someone wants to charge for it! that's a big win!) but it's just a bit of a shame how this project has gone about it imo.
Status: Endangered
"The child-bearing generation can use the language among themselves, but it is seldom being transmitted to children."
What!? A lot must have changed in one generation..
South Estonian: "vulnerable" – sure, yeah
Karelian: "endangered" – seems correct
Swedish: also "endangered" – wat
Ghari (12k speakers): "safe" – :facepalm:
Are these really language-vulnerability ratings or did they just make a mapping from Trump's tariff rates?
https://aidemos.atmeta.com/omnilingualasr/language-globe
- we are getting closer to BabelFish.. at least for the Earth!
> Omnilingual ASR was designed as a community-driven framework. People around the world can extend Omnilingual ASR to new languages by using just a few of their own samples.
The world just got smaller
Few-shot new languages is going to be a game changer for linguists
And their use of LLMs as part of the transcription process makes it likely that they trained the model to correct mispronounciations by the speaker. This lowers CER because the human transcription often corrects for mispronounciations as well, but reduces the ability of the model to actually transcribe what was said.
2 more comments available on Hacker News