Meta Segment Anything Model Audio

Posted20 days agoActive15 days ago

megaman821

209 points

32 comments

ai.meta.comTech Discussionstory

informativepositive

Debate

20/100

Audio_recognitionAISegment Anything Model

Key topics

Audio_recognition

Segment Anything Model

The Meta Segment Anything Model Audio has sparked a lively debate about the impact of AI on the music industry, with some commenters marveling at the demo's ability to separate music, voice, and background noise. While some worry that AI will render skilled labor obsolete, others point out that technological advancements have always disrupted traditional industries, citing the introduction of synthesizers, music videos, and digital audio workstations as examples. The discussion also touches on the potential benefits of AI, such as improving the listening experience for the hearing impaired, and the creative possibilities of isolating individual tracks. Amidst the discussion, a consensus emerges that AI is not a replacement for human creativity, but rather a tool that can augment and transform the music-making process.

Snapshot generated from the HN discussion

Discussion Activity

Very active discussion

First comment

Peak period

48-60h

Avg / period

4.6

Comment distribution37 data points

Loading chart...

Based on 37 loaded comments

Key moments

01Story posted
Dec 16, 2025 at 1:26 PM EST
20 days ago
Step 01
02First comment
Dec 16, 2025 at 5:46 PM EST
4h after posting
Step 02
03Peak activity
25 comments in 48-60h
Hottest window of the conversation
Step 03
04Latest activity
Dec 21, 2025 at 6:56 AM EST
15 days ago
Step 04

Generating AI Summary...

Analyzing up to 500 comments to identify key contributors and discussion patterns

Discussion (32 comments)

Showing 37 comments

emsign

19 days ago

3 replies

Hey cool, another AI model that makes accumulated knowledge and experience like mic selection, mic placement, mixing, eq'ing and so on uneccesary until there's no skilled labor left and humanity becomes as unskilled amd helpless as children.

qoez

18 days ago

1 reply

Basically the same thing musicians said about the synth and music made by computers back in the day

motoxpro

18 days ago

1 reply

100%. The music world has gone through the "but what will we do now?" at least 6-7 times. Music videos ("video killed the radio star"), sampling, the DAW (and time aligning), home studios, auto tune, plugins and amp simulators, napster/piracy, etc, etc.

alex1138

17 days ago

> napster/piracy

This one rankles me because of a) the benefits piracy has (third world consumers can now discover you, for starters) and b) the absolute bad faith way in which the industry acts, screwing over artists, unethically going after Pirate Bay by making it into a trade war with Sweden (I think)

subdavis

18 days ago

That’s pretty much been the story since the Neolithic revolution though?

dang

18 days ago

"Don't be curmudgeonly. Thoughtful criticism is fine, but please don't be rigidly or generically negative."

"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."

https://news.ycombinator.com/newsguidelines.html

ortusdux

18 days ago

1 reply

Would be great for the hearing impaired and CAPD sufferers when combined with Meta glasses or the like.

djabatt

18 days ago

very cool idea

hbn

18 days ago

3 replies

I hope we keep making progress in isolating tracks in music. I love listening to stems of my favorite songs, I find all sorts of neat parts I missed out on. Listening to isolated harmonies is cool too.

TacticalCoder

18 days ago

It shall also allow to make re-recordings in higher quality of stuff that are impossible to find in good quality. Like that cover that that band played only once at that obscure concert and that was recorded on an old tape. Or many very old reggae songs: although many from Jamaica/Kingston had great recordings (there was know-how and great recording studios there) there's also a shitload of old reggae songs that are just barely listenable to because the recording is so poor (and, no, it's not an artistic choice by the artist: it's just, you know, a crappy recording).

peatmoss

16 days ago

From the papers I've read, the stem separation models all seem to train off what seems like a fairly small dataset that doesn't have great instrument representation.

I wonder if you could assemble a big corpus of individual solo instruments, then permute a cacophonous mix of them. IIRC the main training dataset is comprised of a limited number of real songs. But I think a model trained on real songs might struggle with more "out there" harmonies and mixes.

bigmadshoe

17 days ago

The problem of track isolation is sometimes underconstrained, and so any AI system that does this will probably invent "neat parts" for us to hear that weren't necessarily in the original recording. It feels like using super-resolution models to notice details about your great-grandma's wedding dress.

mwmisner

18 days ago

1 reply

Playing with the background I tried to Isolate just the espresso machine and the train sounds in one of their demos and it seemed to fail. Maybe not the desired use case, but I thought it was odd that I could break it so easily on the sample material.

gpm

18 days ago

1 reply

Footsteps worked pretty well when I tried that on the other hand.

freeflow1448

17 days ago

i do think that’s the case. i tried a few different ways to write x and got meaningfully varied results

tasty_freeze

17 days ago

1 reply

I use moises frequently for track separation for learning songs. It does pretty dang well. I was shocked that the score of moises is ranked way worse than just about everything else, including lalal.ai, which I also used before buying moises. Perhaps lalal.ai has gotten better since I last tried it.

Reubend

17 days ago

1 reply

Maybe I'm totally misinterpreting, but the chart I'm looking at says "Net Win Rate of SAM Audio vs. SoTA Separation (text prompted)", so perhaps a lower number means that the alternative model is better?

tasty_freeze

17 days ago

Now that I go back and read it again I agree with you. Presumably "win rate" means what percent of the time did the SAM model (Meta's new one) beat the other tool over some set of examples.

cyberax

17 days ago

1 reply

Can this be used to nuke the laugh tracks?!?

helpfulclippy

17 days ago

2 replies

It’s really a shame how popular it was to mar shows with this… I saw a DVD set of a show once with a no-laugh-track version. It sucked because the actors pause for the laughs after each line. This is bad enough with the laugh track in place, but if it’s just dead air it makes every scene feel awkward.

AmbroseBierce

17 days ago

AI can remove those pauses by the actors too so maybe that would work.

cyberax

17 days ago

I don't even mind awkward pauses. I tried using the laugh track silencer on an episode of Black Adder, and it worked out OK.

Escapade5160

17 days ago

1 reply

From my brief testing in the playground, it is not very good. Maybe it needs better prompting than the 1 word examples.

gpm

17 days ago

For me it either worked great or not at all. Extracting footsteps, the air conditioner noise, voices, one particular persons voice (identified by gender), all worked great (across multiple clips for most of those).

A few prompts failed almost entirely though, "train noises", "background noise" and "clatter"... so definitely sensitive to either prompting or the kind of noise being extracted.

blagie

17 days ago

2 replies

For future ML developers: A post like this should include system requirements.

It's not clear from the blog post, the git page, and most other places if this will run on, even in big-O:

* CPU

* 16GB GPU

* 240GB server (of the type most business can afford)

* Meta/Google/Open AI/Anthropic-style data center

PeterStuer

17 days ago

1 reply

It realy depends on your runtime environment, but I agree it would be nice to have some references with commonly used setups.

blagie

15 days ago

It does, but my comment was "even in big-O."

Environments might mean the difference between e.g. 16GB and 24GB, but not 16GB and 160GB.

lelag

17 days ago

Indeed. I've tried to run it locally this but couldn't get it running on my measly gaming-spec workstation.

It's seems you need lot's of ram and vram. Reading the issues on github[1], it does not seem many others have had success in using this effectively:

- someone with a 96 Gb VRAM RTX 6000 Pro had cuda oom issues

- someone somehow made it work on a RTX 4090 somehow, but RTF processing time was 12...

- someone with a RTX 5090 managed to use it, but with clips no longer than 20s

It seems utility of the model for hobbyist with consumer grade cards will be low.

[1]: https://github.com/facebookresearch/sam-audio/issues/24

nmstoker

17 days ago

Would be interesting to leverage the non spoken/environment noises to guide what level of detail and style of speech a chatbot replied with, for instance being more casual, gentle, with a touch more detail if in a quiet home/office environment, but more curt and concise with emphasized diction if the person is traveling, such as in a noisy train concourse. People tend to do that subconsciously but bots ignorantly wittering on can be annoying and hard to use because they miss the cues.

brador

17 days ago

How about they fix their MusicGEN model on hugging face first.

obahareth

16 days ago

I tried it with their examples, trying to isolate speech only or background music only, but it seems to produce audio that is near identical to the original.

htrp

20 days ago

super amazing demo performance being able separate out music voice and background noises. do you have to explicitly specify what type of noise to separate?

sorbusherra

17 days ago

can i use this to remove all stupid tiktok music from videos?

moss_dog

17 days ago

This is incredible! I wouldn't have thought it was possible to cleanly separate tracks like that. I wonder to what extent the model is filling in gaps, akin to Samsung's "ultra zoom" moon.

almosthere

18 days ago

mSAMA haha, get it

kace91

17 days ago

Funny that:

- This feature is awesome for sample-based music

- Sample music is not what it was due to difficulties related to legal rights

- This model was probably created by not giving a damn about said rights

locusofself

18 days ago

As someone recording myself playing music, I've been meaning to see if any of these tools are good enough yet to not only separate vocals from another instrument (acoustic guitar for example), but do so without any loss of fidelity (or least not a perceivable one).

The reason I'm interested in this is because recording with multiple microphones (one on guitar, one on the vocal), has it's own set of problems with phase relationship and bleed between the microphones, which causes issues when mixing.

Being able to capture a singing guitarist with a single microphone placed in just the right spot, but still being able to process the tracks individually (with EQ, compression, reverb, etc), could be really helpful.

View full discussion on Hacker News

ID: 46292266Type: storyLast synced: 12/19/2025, 6:25:39 PM

Want the full context?

Jump to the original sources

Read the primary article or dive into the live Hacker News thread when you're ready.

Open link View on HN