New AI Fixes 'Hallucinations' in Text-to-Speech Models

Researchers tackle repetitive or omitted speech in LLM-based voice generation, enhancing audio quality.

A new research paper introduces methods to eliminate 'stability hallucinations' in large language model (LLM) based Text-to-Speech (TTS) models. By refining the attention mechanism and integrating a new metric called Optimal Alignment Score (OAS), the team significantly reduces errors like repetitive or omitted words in synthesized speech. This advancement promises more natural and reliable AI-generated voices.

By Katie Rowan

September 26, 2025

4 min read

New AI Fixes 'Hallucinations' in Text-to-Speech Models

Key Facts

Researchers addressed 'stability hallucinations' (repetitive or omitted speech) in LLM-based Text-to-Speech (TTS) models.
The team introduced the Optimal Alignment Score (OAS) using the Viterbi algorithm to evaluate text-speech alignment.
OAS was integrated into CosyVoice2 training to improve continuous, stable alignment.
Pre-trained attention values guided CosyVoice2 via 'chain-of-thought' (CoT) to further reduce hallucinations.
Experiments showed effective reduction of stability hallucinations without negative side effects.

Why You Care

Ever listened to an AI-generated voice suddenly repeat itself or skip words? It’s jarring, isn’t it? This common issue, known as a ‘stability hallucination’ in AI, makes synthesized speech sound unnatural. But what if AI voices could be as smooth and error-free as human speech? New research is tackling this head-on, promising a significant leap in AI voice quality. This directly impacts your experience with virtual assistants, audiobooks, and even personalized content. Are you ready for AI voices that sound truly ?

What Actually Happened

Researchers have published a new paper focusing on a essential problem in large language model (LLM) based Text-to-Speech (TTS) models. This problem, termed ‘stability hallucinations,’ involves AI voices repeating words or omitting them entirely, according to the announcement. The team, including ShiMing Wang and eight other authors, aims to resolve these issues by improving the attention mechanism within these models. They first analyzed how text tokens—individual words or parts of words—align with speech tokens—the smallest units of sound. This analysis led to a new metric called the Optimal Alignment Score (OAS), which uses the Viterbi algorithm to measure text-speech alignment quality. The team integrated OAS into the training of CosyVoice2, a specific TTS model, to help it learn more continuous and stable alignment. What’s more, they used pre-trained attention values to guide the student CosyVoice2 model through a ‘chain-of-thought’ (CoT) process. This approach further reduces stability hallucinations in the synthesized speech, as detailed in the blog post.

Why This Matters to You

This creation has practical implications for anyone interacting with AI-generated audio. Imagine listening to an audiobook where the narrator never stumbles or repeats a phrase. That’s the future this research is building. The study finds that these methods effectively reduce stability hallucinations without introducing other negative effects. This means your AI assistants could soon speak with greater clarity and fewer errors. For example, consider a navigation system that flawlessly pronounces street names without getting stuck in a loop. How much more pleasant would your daily interactions with AI become?

Key Improvements for TTS Models:

Reduced Repetitions: AI voices will be less likely to repeat words or phrases unintentionally.
Fewer Omissions: essential words or parts of sentences are less likely to be skipped.
Enhanced Naturalness: The overall flow and rhythm of synthesized speech will improve.
Improved Alignment: Better synchronization between written text and spoken output.

As mentioned in the release, the team’s approach ensures that the improvements come without new problems. This is crucial for widespread adoption. “This paper focuses on resolving stability hallucinations (e.g., repetitive or omitted speech) in LLM-based Text-to-Speech (TTS) models by improving and leveraging the attention mechanism,” the authors state. This directly translates to a more reliable and enjoyable listening experience for you.

The Surprising Finding

One of the most interesting aspects of this research is how effectively they tackled a persistent problem. It’s often assumed that improving one aspect of AI performance might degrade another. However, the team revealed that their methods reduce stability hallucinations without introducing additional negative effects. This challenges the common assumption that fixing one bug might create another. The experiments on the Seed-TTS-Eval and CV3-Eval test sets demonstrate this success. This means we can expect clearer AI voices without new robotic tones or strange pauses. The careful integration of the Optimal Alignment Score (OAS) and attention guidance seems to be key here. It’s a testament to targeted AI refinement.

What Happens Next

This research, submitted to ICASSP2026, suggests that these advancements could become more widely available in the coming years. We might see these improvements integrated into commercial TTS systems by late 2025 or early 2026. For example, your favorite podcast system could start using these enhanced AI voices for automatically generated summaries. This would make the content more accessible and enjoyable. The industry implications are significant, pushing the boundaries of what AI voices can achieve. Our advice for readers is to keep an eye on updates from major AI voice providers. These companies will likely adopt similar techniques to enhance their offerings. This will lead to a new standard of quality for AI-generated audio across various applications.

Ready to start creating?