AI Improves Code-Switching Speech Recognition with Synthetic Data

New research uses text-to-speech models to enhance AI's understanding of mixed-language conversations.

A recent paper reveals that augmenting real speech data with synthetic speech generated by multilingual text-to-speech (TTS) models significantly improves automatic speech recognition (ASR) for code-switching conversations. This method addresses the challenge of limited high-quality training data, making ASR more robust for speakers who blend languages.

By Mark Ellison

January 7, 2026

4 min read

AI Improves Code-Switching Speech Recognition with Synthetic Data

Key Facts

Automatic speech recognition (ASR) for code-switching speech is challenging due to limited high-quality labeled data.
Multilingual text-to-speech (TTS) models, specifically CosyVoice2, were used for data augmentation.
Synthetic Chinese-English code-switching speech was generated using the SEAME dataset.
Augmenting real speech with synthetic speech reduced the mixed error rate (MER) from 12.1% to 10.1% on DevMan.
The MER also decreased from 17.8% to 16.0% on DevSGE, showing consistent performance gains.

Why You Care

Ever found yourself effortlessly switching between languages in a single conversation? It’s a common, natural part of communication for many, but for AI speech recognition, it’s been a tough nut to crack. What if AI could understand your multilingual conversations as easily as you do?

New research from Yue Heng Yeo and a team of collaborators is making significant strides in this area. They are using text-to-speech (TTS) system to train AI models. This creation directly impacts how well voice assistants and transcription services can handle diverse speech patterns, making them much more useful for your daily life.

What Actually Happened

Automatic speech recognition (ASR) systems often struggle with conversational code-switching speech, according to the announcement. This difficulty stems from a scarcity of realistic, high-quality labeled speech data. Code-switching refers to the practice of alternating between two or more languages or dialects in a single conversation.

The research team explored multilingual text-to-speech (TTS) models as an effective data augmentation technique. Data augmentation means creating more training data from existing data. Specifically, as detailed in the blog post, they fine-tuned the multilingual CosyVoice2 TTS model. They used the SEAME dataset to generate synthetic conversational Chinese-English code-switching speech. This process significantly increased both the quantity and speaker diversity of available training data for ASR models.

Why This Matters to You

Think about how often you or someone you know might mix languages. Perhaps you speak ‘Spanglish’ or ‘Singlish.’ Current AI often falters when encountering these natural speech patterns. This new approach directly tackles that problem, making AI more inclusive and accurate for multilingual speakers. Imagine a voice assistant that truly understands your blended language commands.

This improved accuracy means fewer frustrating misunderstandings with your devices. It also opens doors for better accessibility in various applications. For example, transcription services for podcasts or meetings involving code-switching will become far more reliable. How much easier would your digital life be if AI truly understood your unique way of speaking?

Performance Gains from TTS Data Augmentation:

Dataset	Original Mixed Error Rate (MER)	MER with Synthetic Data
DevMan	12.1%	10.1%
DevSGE	17.8%	16.0%

These results confirm that multilingual TTS is an effective and practical tool. It enhances ASR robustness in low-resource conversational code-switching scenarios, the team revealed. One of the researchers noted, “augmenting real speech with synthetic speech reduces the mixed error rate (MER) from 12.1 percent to 10.1 percent on DevMan and from 17.8 percent to 16.0 percent on DevSGE, indicating consistent performance gains.”

The Surprising Finding

Here’s the twist: the core challenge for code-switching ASR wasn’t necessarily the complexity of the languages themselves. Instead, it was the sheer lack of enough diverse, high-quality training data. Many might assume that building more algorithms is the only path forward. However, the study finds that simply generating more synthetic data using existing TTS models can yield substantial improvements.

This is surprising because it suggests a more practical, approach than previously thought. Instead of waiting for vast amounts of real-world, labeled code-switching speech — which is expensive and time-consuming to collect — researchers can now create it. This challenges the common assumption that only ‘real’ data can produce AI models. The team demonstrated that synthetic data, when carefully generated, can be just as effective in closing performance gaps.

What Happens Next

This research, accepted by APSIPA 2025, points to a promising future for voice system. We can expect to see these techniques integrated into commercial ASR systems within the next 12 to 18 months. This means your smart devices could become much better at understanding multilingual commands by late 2026 or early 2027.

For example, imagine a customer service chatbot that seamlessly switches between English and Spanish based on your natural speaking pattern. This system will empower developers to build more inclusive AI applications. For you, this means more reliable voice interfaces across various platforms. The industry implications are significant, potentially leading to a wider adoption of voice AI in diverse global markets. Companies will be able to serve multilingual users more effectively. The paper states that this method is a “practical tool for enhancing ASR robustness in low-resource conversational code-switching scenarios.” This suggests a clear path for future creation and deployment.

Ready to start creating?