New AI Speeds Up Speech Enhancement Fivefold

COSE framework offers faster, more efficient audio cleanup without quality loss.

A new AI framework called COSE significantly boosts speech enhancement speed. It achieves up to 5x faster processing and reduces training costs by 40%. This innovation promises clearer audio for various applications.

Katie Rowan

By Katie Rowan

September 22, 2025

3 min read

New AI Speeds Up Speech Enhancement Fivefold

Key Facts

  • COSE is a one-step Flow Matching (FM) framework for speech enhancement.
  • It achieves up to 5x faster sampling compared to previous methods.
  • COSE reduces training costs by 40%.
  • The framework maintains competitive speech enhancement quality.
  • It addresses high training overhead of Jacobian-vector product (JVP) computations.

Why You Care

Ever struggled to understand someone on a video call because of background noise? Or perhaps your podcast recording sounds a bit muffled? What if AI could clean up that audio five times faster than before? This new creation in speech betterment directly impacts your daily digital interactions. It promises clearer conversations and improved audio quality in countless applications, making your listening experience much better.

What Actually Happened

Researchers have unveiled a novel AI structure called COSE (Compose Yourself: Average-Velocity Flow Matching for One-Step Speech betterment). This structure tackles a long-standing challenge in AI-powered audio cleanup. According to the announcement, traditional methods like diffusion and flow matching models often require many steps to process audio. This multi-step approach is computationally expensive and can introduce errors. The team behind COSE, including Gang Yang and his colleagues, developed a one-step approach. They efficiently compute average velocity, eliminating costly computations while maintaining theoretical consistency, as detailed in the blog post. This creation significantly streamlines the speech betterment process.

Why This Matters to You

This new COSE structure has practical implications for anyone interacting with digital audio. Imagine clearer voice calls, crisper podcast recordings, or even better speech recognition in noisy environments. The benefits extend across various industries.

Here are some key benefits of the COSE structure:

  • Up to 5x Faster Sampling: Your audio processing tasks could complete much quicker.
  • 40% Reduced Training Cost: Developing and deploying these AI models becomes more affordable.
  • Preserved Speech Quality: You don’t sacrifice clarity for speed.

For example, think about live transcription services. With COSE, these services could process speech in real-time with much greater accuracy, even in a bustling coffee shop. This means fewer mistakes and a smoother experience for you. “Diffusion and flow matching (FM) models have achieved remarkable progress in speech betterment (SE), yet their dependence on multi-step generation is computationally expensive and vulnerable to discretization errors,” the paper states. How might this speed boost change the way you interact with voice AI every day? Your smart home devices, for instance, could understand your commands more reliably.

The Surprising Finding

Here’s the twist: The researchers achieved this remarkable speed increase and cost reduction without compromising speech quality. Historically, making AI models faster often meant accepting a trade-off in performance or accuracy. However, the study finds that COSE delivers its benefits “without compromising speech quality.” This challenges the common assumption that efficiency gains in AI must come at the expense of output quality. The team introduced a “velocity composition identity” to compute average velocity efficiently, as mentioned in the release. This clever mathematical approach is what allows them to bypass the expensive computations of previous models like MeanFlow. It’s surprising because it shows that with algorithmic design, you can have both speed and high-quality results in complex AI tasks like speech betterment.

What Happens Next

The COSE structure is slated for presentation at ICASSP 2026, indicating further academic and industry scrutiny. We can expect to see initial integrations of this system in specialized audio processing software within the next 12-18 months. For example, professional audio editing suites might incorporate COSE to offer faster noise reduction tools. For you, this means future software updates could bring significant improvements to your audio experience without you even realizing the complex AI working behind the scenes. Developers should consider exploring the available code to integrate these efficiencies into their own projects. The industry implications are vast, potentially leading to more efficient and widespread adoption of AI-powered speech betterment in consumer electronics and communication platforms. The company reports that code is available, encouraging broader experimentation and creation.

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice