Sidon: Open-Source AI Cleans Up Noisy Speech for Better TTS

A new model rapidly transforms 'in-the-wild' audio into studio-quality sound for large-scale AI training.

Researchers have introduced Sidon, an open-source AI model designed to clean up noisy speech recordings. This tool can convert low-quality audio into studio-grade sound across many languages, significantly speeding up data preparation for text-to-speech (TTS) systems. It promises to make high-quality AI speech more accessible.

Katie Rowan

By Katie Rowan

September 23, 2025

4 min read

Sidon: Open-Source AI Cleans Up Noisy Speech for Better TTS

Key Facts

  • Sidon is an open-source multilingual speech restoration model.
  • It converts noisy speech into studio-quality audio.
  • Sidon achieves performance comparable to Google's Miipher model.
  • It runs up to 3,390 times faster than real time on a single GPU.
  • Training TTS models with Sidon-cleansed data improves synthetic speech quality.

Why You Care

Ever tried to train an AI model with messy, real-world audio? It’s like trying to build a house with broken bricks. What if there was a tool that could instantly turn those broken bricks into pristine ones, no matter their original state? This is precisely what a new open-source AI model, Sidon, promises for speech data. It quickly transforms noisy, everyday speech into crystal-clear, studio-quality audio. This means better, more natural-sounding AI voices are on the horizon for everyone, including you.

What Actually Happened

Researchers recently unveiled Sidon, a fast and open-source model for multilingual speech restoration, according to the announcement. This model addresses a significant challenge in developing large-scale text-to-speech (TTS) systems: the scarcity of clean, diverse multilingual recordings. Sidon’s core function is to convert ‘noisy in-the-wild speech’—audio captured in uncontrolled environments—into ‘studio-quality speech.’ The team revealed that Sidon scales to dozens of languages. It achieves this by using two main components. First, a w2v-BERT 2.0 finetuned feature predictor cleanses features from the noisy speech. Second, a vocoder synthesizes the restored speech from these newly cleansed features.

Why This Matters to You

Imagine you’re a content creator, podcaster, or developing a new voice assistant. High-quality audio is crucial for your projects. However, obtaining clean, diverse speech data can be incredibly expensive and time-consuming. Sidon changes this by offering an efficient approach. It allows you to take existing, noisy recordings and elevate their quality dramatically. This can significantly reduce your production costs and accelerate your creation cycles. For example, if you have a vast archive of interviews recorded with varying microphone quality, Sidon could process them to a consistent, high standard. This ensures your AI models learn from the best possible input. How much time and money could you save if your existing audio assets were instantly studio-ready?

As the paper states, “Sidon achieves restoration performance comparable to Miipher: Google’s internal speech restoration model with the aim of dataset cleansing for speech synthesis.” This means it offers a approach that was previously only available to large tech companies. What’s more, the research shows that training a TTS model using Sidon-cleansed data improves the quality of synthetic speech in a zero-shot setting. This means even without specific training for new voices, the output is better.

Here are some benefits of Sidon:

  • Cost Reduction: Less need for expensive studio recording sessions.
  • Time Savings: Automates the laborious process of manual audio cleaning.
  • Improved AI Quality: Leads to more natural and accurate AI-generated voices.
  • Multilingual Support: Works across many languages, expanding global reach.

The Surprising Finding

Here’s the twist: despite its high performance, Sidon is incredibly efficient. The technical report explains that Sidon is “computationally efficient, running up to 3,390 times faster than real time on a single GPU.” This is a remarkable speed, especially when compared to similar models. Typically, AI models require significant computational power and time to process large datasets. Sidon defies this expectation by offering both top-tier restoration and blistering speed. This efficiency means that even smaller teams or individual developers can process massive amounts of audio data quickly. It challenges the assumption that high-quality AI tools must be slow or resource-intensive. Think of it as having a super-fast, professional audio engineer working tirelessly on your data for free.

What Happens Next

Sidon’s release as an open-source tool has significant implications for the AI community. The authors have made the code and model available to facilitate reproducible dataset cleansing for the research community, as mentioned in the release. This means developers and researchers can start integrating Sidon into their workflows immediately. We can expect to see its adoption accelerate over the next 6-12 months. For example, a podcaster could use Sidon to automatically enhance audio quality for their entire back catalog. Your next AI-powered voice assistant might sound more natural because its training data was cleaned by Sidon. The industry implications are clear: higher quality, more accessible speech data will drive advancements in voice AI. My advice to you? Explore the open-source code and consider how Sidon can improve your current or future projects involving speech synthesis. It’s a tool now at your fingertips.

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice