Omni2Sound: AI Creates Audio from Video, Text, or Both

New research introduces a unified AI model capable of generating high-quality audio from diverse inputs.

Researchers have unveiled Omni2Sound, a new AI model that generates audio from video, text, or a combination of both. This innovation addresses significant challenges in multimodal audio generation, promising more realistic and contextually rich soundscapes for various applications. It achieves state-of-the-art performance across multiple tasks.

By Sarah Kline

January 7, 2026

4 min read

Omni2Sound: AI Creates Audio from Video, Text, or Both

Key Facts

Omni2Sound is a unified diffusion model for video-text-to-audio (VT2A) generation.
It supports flexible inputs: video-to-audio (V2A), text-to-audio (T2A), and joint VT2A.
The research introduced SoundAtlas, a large-scale dataset with 470,000 high-quality audio-visual-text pairs.
SoundAtlas creation achieved a 5 times cost reduction using an agentic pipeline.
Omni2Sound achieves state-of-the-art performance across all three generation tasks.

Why You Care

Ever wished you could conjure the sound for your video, just by describing it or showing a clip? Imagine creating the precise audio for a silent film, or adding sound effects to a text-based story. This is no longer science fiction. A team of researchers has introduced Omni2Sound, an AI model that generates audio from video, text, or both. How will this change your creative workflow?

This creation is huge for content creators, game developers, and anyone working with digital media. It promises to simplify complex audio production, making high-quality sound accessible. You can now generate incredibly specific audio, enhancing your projects like never before.

What Actually Happened

Researchers unveiled Omni2Sound, a unified diffusion model designed for video-text-to-audio (VT2A) generation, according to the announcement. This model supports flexible input modalities, meaning it can take video, text, or both to create audio. The team behind Omni2Sound identified two core challenges in this field. First, there’s a scarcity of high-quality audio captions with tight alignment between audio, visual, and text data. Second, previous models faced competition between different generation tasks, like video-to-audio (V2A) and text-to-audio (T2A), leading to performance trade-offs.

To tackle the data scarcity, the team introduced SoundAtlas. This is a large-scale dataset featuring 470,000 pairs of meticulously aligned audio, video, and text data. SoundAtlas significantly outperforms existing benchmarks and even human experts in quality, as detailed in the blog post. What’s more, Omni2Sound uses a three-stage multi-task progressive training schedule. This converts cross-task competition into joint optimization, mitigating modality bias in the VT2A task. The result is a single model achieving performance across all three generation tasks.

Why This Matters to You

This system holds immense practical implications for various industries. For content creators, it means faster and more precise audio design. Imagine you’re editing a short film and need the sound of a specific type of rain hitting a particular surface. You could either describe it or provide a visual reference. The model would then generate that exact sound.

What’s more, Omni2Sound’s ability to handle off-screen audio generation is particularly useful. This means it can create sounds that are implied by the scene but not directly visible. Think of it as generating the sound of a distant siren in a city shot, even if the ambulance isn’t on screen. This adds a layer of realism to your projects. The company reports that Omni2Sound achieves “unified SOTA performance across all three tasks within a single model.” This means you get top-tier results whether you’re using video, text, or both as your input. How might this audio generation capability change your approach to storytelling?

Here’s a breakdown of the supported modalities:

Input Modality	Description
Video-to-Audio (V2A)	Generates audio from video footage alone.
Text-to-Audio (T2A)	Creates audio based solely on textual descriptions.
Video-Text-to-Audio (VT2A)	Combines video and text inputs for highly specific audio generation.

This flexibility ensures that you can choose the best input for your specific needs, making the tool highly adaptable.

The Surprising Finding

One of the most surprising findings from this research concerns data quality. Conventional wisdom often suggests that simply having more data is enough. However, the team revealed that the scarcity of high-quality audio captions with tight audio-visual-text alignment was a major bottleneck. They found that this scarcity led to “severe semantic conflict between multimodal conditions.” This means that without precise data, AI models struggled to understand the true relationship between what they saw, what they read, and what they should hear.

Their approach, SoundAtlas, didn’t just increase data volume; it drastically improved data quality. SoundAtlas was created using a novel agentic pipeline that integrates Vision-to-Language Compression and a Junior-Senior Agent Handoff. This process ensures fidelity and semantic richness. The result was a 5 times cost reduction in data creation, while still outperforming existing benchmarks and human experts in quality, according to the paper states. This challenges the assumption that quality always comes at a prohibitive cost.

What Happens Next

Looking ahead, we can expect to see Omni2Sound’s capabilities integrated into various creative tools within the next 12-18 months. Imagine your video editing software offering AI-powered sound design suggestions based on your footage. This could significantly reduce post-production time and costs. For example, a game developer might use Omni2Sound to procedurally generate environmental soundscapes that dynamically react to in-game events and player actions, based on visual cues and textual descriptions of the environment.

What’s more, the research indicates strong generalization across benchmarks. This suggests the model can adapt to diverse and heterogeneous input conditions. This means it’s not just a lab curiosity; it’s a tool ready for real-world applications. Expect to see continued advancements in multimodal AI, with a focus on even more nuanced and context-aware audio generation. The industry implications are vast, from enhancing accessibility features with descriptive audio to creating entirely new forms of interactive media. The team’s work provides a solid foundation for future audio AI developments.

Ready to start creating?