Launch HN: Mosaic (YC W25) – Agentic Video Editing
mosaic.soSome feedback initially on the landing page, looks great but I thought that there is, for me, too much motion going on on the homepage and the use cases page. May be an unpopular opinion!
very valid point though — I think a demo clip of a BEFORE vs AFTER immediately somewhere in the hero even or right below it would be helpful
thanks for the feedback
This is a long winded way of saying that I think creators need what you're making! People who have hours of awesome footage but have to spend dozens of hours cutting it down need this. Then also people who have awesome footage but aren't good at editing or hiring an editor, same thing. I'd love to see someone solve this so that 90th percentile editing is available to all, and then it can be more about who has the interesting content, rather than who has the interesting content and editing skills.
soon, we also plan to incorporate style transfer, so you could even give it a video from the channel you enjoy watching + your raw footage, and have the agent edit your footage in the same style of the reference video.
In relation to the demo requests below, I think this would be a good example of how an average person might use your platform.
https://edit.mosaic.so/links/c51c0555-3114-45f4-ab8f-c25f172...
I am waiting for a tool that does stuff along those lines for a long time. Apps like dji kinda do it but they have generic music and the cuts do not fit the tune at all, and are rather random. Doing it myself with little effort using davinci or premiere takes ~30 minutes but the results are 5 times better.
Was hoping that this app would do it for me. And even if it would, if it would cost >X$ to create a video like that, then probably I'd still do it myself.
Would have been nice if there was a killer demo on your landing page of a video made with Mosaic.
a lot of tooling is being built around generative AI in particular, but there's still a big gap for people that want to share their own stories / experiences / footage but aren't well-versed with pro tools.
valid feedback on the landing page — something we'll add in.
Hidden behind a UI? Most of the major tools like blade, trim, etc. are right there on the toolbars.
> We recorded hours of cars driving by, but got stuck on how to scrub through all this raw footage to edit it down to just the Cybertrucks.
Scrubbing is the easiest part. Mouse over the clip, it starts scrubbing!
I’m being a bit tongue in cheek and I totally agree there is a learning curve to NLE’s but those complaints were also a bit striking to me.
Scrubbing is easy enough when you have short footage, but imagine scrubbing through the footage we had of 5 hours of cars driving by, or maybe a bunch of assets. This quickly becomes very tedious.
Good luck out there!
Do you think this is the next Dropbox?
Good luck with it, sincerely.
I'm really tired of editing videos in the cloud. I'm also also tired of all these AI image and video tools that make you work over a browser. Your workflow seems so second class buried amongst all the other browser tabs.
I understand that this is how to deploy quickly to customers, but it feels so gross working on "heavy" media in a browser.
if our goal is to bring more people into the fold, minimizing the steps for them to start editing is something we want to optimize for
that being said, being on the browser presents its own set of challenges, many of which are rightfully mentioned in this thread
it does present its own set of challenges, but something we've thought about
I will be checking this out!
they're very powerful, when you put them together, it almost feels like Cursor for Video Editing
Multimodal models are good at frame-level recognition, but editing requires understanding relationships between scenes, have you found any methods that work reliably there?
Node based workflows are typical in NLE software. See Fusion & Color panels in Davinci Resole, Fusion (color grading), etc. Industry folks will take to this node based canvas with ease.
Great question @danishSuri1994
we've actually found that multimodal models are surprisingly good at maintaining temporal context as well
that being said, there's also a bunch of additional processing using more traditional CV / audio analysis we do to extract this information out as well (both frame-level and temporal) in your video understanding
for example, with the mean-motion analysis — you can see how subjects move over a period of time, which can help determine where important things are happening in the video, which ultimately can lead to better placements of edits.
I'm building something exactly similar and couldn't believe my eyes when I saw the HN post. What i'm building (chatoctopus.com) is more like a chat-first agent for video editing, only at a prototype stage. But what you guys have achieved is insane. Wishing you lots of success.
to healthy competition!
how did you find the chat-first interface to work out for video? what we found is that the response times can be so long that the chat UX breaks down a bit. how are you thinking about this?
If the LLM needs to place captions, it calls one of these expert discrete-algorithm tools to determine the best place to put the captions -- you aren't just asking the LLM to do it on its own.
If I'm correct about that, then I absolutely applaud you -- it feels like THIS is a fantastic model for how agentic tools should be built, and this is absolutely the opposite of AI slop.
Kudos!
we're using a mix of out-of-the-box multimodal AI capability + traditional audio / video analysis techniques as part of our video understanding pipeline, all of which become context for the agent to use during its editing process
A browser that was discontinued 30 years ago.
Not really relevant anymore, though? As long is it's not called "Project: Prometheus" I think we count it as a win.
our original name was Frame, only to realize that frame.io existed already.
we brainstormed names for a while and had several notes full of possible names
mosaic is one which stood out to us because it not only represents artwork, but also the tiles (nodes) in the canvas come together to form your mosaic — we thought that was a fitting name
I'm going to move the overly sus ones to a collapsed stub now. (https://news.ycombinator.com/item?id=45988584)
These seem like problems that LLMs are especially well-suited for. I might have spent a fraction of the time if there was some system that could "index" my content library, and intelligently pull relevant clips into a cohesive storyline.
I also spent an ungodly amount of time on animations - it felt like "1 hour of work for 1 minute of animation". I would gladly pay for a tool which reduces the time investment required to be a citizen documentarian.
we don't yet support that volume of footage (1TB), however if you'd like to try this at a smaller scale, you can already do this today with the Rough Cut tile — simply prompt it for the moments that you're interested in (it can take visual cues, auditory cues, timestamp cues, script cues) and it will create an initial rough cut or assembly edit for you.
I'd also recommend checking out the new Motion Graphics tile we added for animations. You can also single-point generate motion graphics using the utility on the bottom right of the timeline. Let me know if you have any questions on that.
- Batch transcribe your videos to smaller proxy files preserving the same file names (to allow easy re-linking to full quality media later) - Upload proxys to Mosaic - Do your Agentic rough-cut with Mosaic - Export EDL or NLE project file - In NLE, Re-link proxy media to full-quality video & render locally.
To Mosaic:
I need to look deeper at your project, but support for EDL export (Avid, Premiere, Final Cut compatible, as well as commercial grading and conform software workflows) and upload/management of proxy media could be helpful additional features.
we have couple investigative journalists and lawyers using us for a similar usecase.
Video is hard, but it's also a fun modality which presents some interesting challenges. And is where content is converging towards.
[see https://news.ycombinator.com/item?id=45988611 for explanation]
i playback parts of the cinematic edit I made to the conversation between Dwarkesh Patel and Satya Nadella (e.g. added cinematic captions, motion graphics)
i can post the full edit as well if you're interested
My end goal was to let an agent make semantic changes (e.g., "remove the parts where the guy in the blue dress is seen") by simply grepping the context spec for the relevant timestamps and using ffmpeg to cut them out.
How are you extracting context from videos?
Will be using this a ton in the future
I didn’t expect great video editing to become democratized so quickly. Kudos to the team!!
- a happy customer
same with returning that back to the user as manipulated output (text / code generation is much more rapid than rendering a video)
for example, if you have a workflow setup to create 5 clips from a podcast and add b-rolls and captions and reframe to a few different aspect ratios, any time you invoke this workflow (regardless of which podcast episode you're providing as input), you'll get 5 clips back that have b-rolls, captions, and are reframed to a few different aspect ratios
however, which clips are selected, what b-rolls are generated, where they're placed — this is all non-deterministic
you can guide the agent via prompting the tiles individually, but that's still just an input into a non-deterministic machine
Also, do you have an API available to trigger workflows programmatically?
5 more comments available on Hacker News