Sam Audio
about.fb.comKey Features
Tech Stack
Key Features
Tech Stack
At the time the first SAM was created, Meta was already spending over 2B/year on human labelers. Surely that number is higher now and research like this can dramatically increase data labeling volume
How is creating 3D objects and characters (and something resembling bones/armature but isn't) supposed to help with data labeling? As synthetic data for training other models, maybe, but seems like this new release is aimed at improving their own tooling for content creators, hard to deny this considering their demos.
For the original SAM releases, I agree, that was probably the purpose. But these new ones that generate stuff and do effects and what not, clearly go beyond that initial scope.
How does that work? Correlating sound with movement?
Could you watch a music video and say "that's the snare drum, that's the lead singer, keyboard, bass, that's the truck that's making the engine noise, that's the crowd that's cheering"? So can AI.
Could you point out who is lead guitar and who is rhythm guitar? So can AI.
And in those situations it won't work. Is any of this really a surprise?
That doesn't seem any better than typing "rhythm guitar". In fact, it seems worse and with extra steps. Sometimes the thing making the sound is not pictured. This thing is going to make me scrub through the video until the bass player is in frame instead of just typing "bass guitar". Then it will burn some power inferring that the thing I clicked on was a bass.
If I had to do it synthetically, take single subjects with a single sound and combine them together. Then train a model to separate them again.
For mash-ups specifically, using yt-dlp to download music and split into stems with Demucs, using the UVR frontend, before importing into a DAW is effortless. The catch is that you can't expect to get OK-ish separation on anything other than vocals and "other", which really isn't a problem for mash-ups.
While it's convenient not having to split stems into separate files beforehand, by using a VST, you usually end up doing so anyway while editing and arranging.
https://www.reddit.com/r/LocalLLaMA/comments/1pp9w31/ama_wit...
> If you are interested in how well we do compared to demucs in particular, we can use the MUSDB18 dataset since that is the domain that demucs is trained to work well on. There our net win rate against demucs is ~17%, meaning we do perform better on the MUSDB18 test set. There are actually stronger competitors on both this domain and our "in-the-wild" instrument stem separation domain that we built for SAM Audio Bench, but we either match or beat all of the ones we tested (AudioShake, LalalAI, MoisesAI, etc.)
So ~20% better than demucs, better than the ones they tested, but the acknowledge there are better models out there even today. So not sure "competes against SOTA models" is right, but "getting close to compete against SOTA models" might be more accurate.
What did you compare it to? Ableton recently launched a audio separation feature too, and probably the highest ROI on simple/useful/accurate so far I've tried, other solutions been lacking in one of the points before.
This is much easier than source separation. It would be different if I were asking to isolate a violin from a viola or another violin, you’d have to get much more specific about the timbre of each instrument and potentially understand what each instruments part was.
But a vibration made from a string makes a very unique wave that is easy to pick out in a file.
String instruments create similar harmonic series to horns, winds, and voice (because everything is a string in some dimension) and the major differences are in the spectral envelope, something that STFT tools are just ok at approximating because of the time/frequency tradeoff (aka: the uncertainty principle).
This is a very hard problem "in theory" to me, and I'm just above casually versed in it.
You can look up HPSS and python libraries like Essentia and Librosa.
It’s because of this that you can have a relatively inexpensive synthesizer (not sample or PCM based) that does a crude job of mimicking these different instruments by just changing the harmonics.
Humans are amazing at it. You can discern the different instruments way better than any stem separating AI.
Also https://zanshin.sh, if you'd like speaker diarization when watching YouTube videos
Yeah unfortunately, since the diarization is acoustic features based, it really does require high recorded voice fidelity/quality to get the best results. However, I just added another knob to the Diarizer class called mer_cos, which controls the speaker merging threshold. The default is 0.875, so perhaps try lowering to 0.8. That should help.
I'll also get around to adding a oracle/min/max speakers feature at some point, for cases where you know the exact number of speakers ahead of time, or wanna set upper/lower bounds. Gotten busy with another project, so haven't done it yet. PR's welcome though! haha
Senko has two clustering types, (1) spectral for audio < 20 mins in length, and (2) UMAP+HDBSCAN for >= 20 mins. In the clustering code, spectral actually already supports orcale/min/max speakers, but UMAP+HDBSCAN doesn't. However, someone forked Senko and added min/max speakers to that here (for oracle, I guess min = max): https://github.com/DedZago/senko/commit/c33812ae185a5cd420f2...
So I think all that's required is basically just testing this thoroughly to make sure it doesn't introduce any regressions in clustering quality. And then just wiring the oracle/min/max parameters to the Diarizer class, or diarize() func.
* remove background noise of tech products, but keep the nature
* isolate the voice of a single person and feed into STT model to improve accuracy
* isolating sound of events in games and many more
Github: https://github.com/facebookresearch/sam-audio
I quite like adding effects such as making the isolated speech studio-quality or broadcast-ready.
Not affiliated with Hacker News or Y Combinator. We simply enrich the public API with analytics.