New AI Method Boosts Text-to-Speech Accuracy for LLMs

TKTO improves pronunciation and naturalness in AI-generated speech, especially for complex languages.

A new research paper introduces TKTO, a data-efficient method for optimizing AI text-to-speech (TTS) models. This technique focuses on token-level feedback, significantly enhancing pronunciation accuracy and naturalness without needing extensive paired data. It shows strong results, particularly for Japanese TTS.

Mark Ellison

By Mark Ellison

October 9, 2025

4 min read

New AI Method Boosts Text-to-Speech Accuracy for LLMs

Key Facts

  • TKTO is a new data-efficient method for optimizing LLM-based text-to-speech (TTS).
  • It eliminates the need for paired desirable and undesirable utterance-level samples.
  • TKTO directly targets token-level units for fine-grained pronunciation alignment.
  • The method improved Japanese TTS accuracy by 39% and reduced Character Error Rate (CER) by 54%.
  • Targeted tokens received 12.8 times stronger reward signals automatically.

Why You Care

Ever listened to AI-generated speech and noticed awkward pauses or mispronounced words? It can be frustrating, right? Imagine if that speech sounded perfectly natural, every single time. A new method called TKTO promises to make your AI voices sound far better. This creation could soon improve how you interact with virtual assistants and audio content.

What Actually Happened

Researchers Rikuto Kotoge and Yuichi Sasaki have introduced a novel approach for enhancing text-to-speech (TTS) systems. As detailed in the blog post, their method, dubbed TKTO, addresses key limitations in current AI voice generation. Traditional methods often require extensive paired data, meaning both desirable and undesirable speech examples for training. However, the team revealed that TKTO eliminates this need, making the training process much more data-efficient. What’s more, it directly targets token-level units—individual words or parts of words—to refine pronunciation. This fine-grained optimization is crucial for achieving highly accurate and natural-sounding speech, especially in complex languages.

Why This Matters to You

This creation has direct implications for anyone who uses or creates AI-generated audio. Think about the audiobooks you listen to or the voice assistants you speak with daily. TKTO aims to make those experiences smoother and more natural. The research shows that this method significantly improves pronunciation alignment, a common challenge in TTS. How much better could your favorite AI assistant sound with this system?

For example, consider a podcast generated entirely by AI. With TKTO, the AI would likely pronounce names, technical terms, and foreign words with much greater accuracy. This means less jarring audio and a more pleasant listening experience for you. The study highlights its effectiveness in challenging languages, specifically Japanese.

Key Improvements with TKTO:

  • 39% betterment: Japanese TTS accuracy improved by 39%.
  • 54% Reduction: Character Error Rate (CER) reduced by 54%.
  • 12.8x Stronger Reward: Targeted tokens received 12.8 times stronger reward signals.

According to the announcement, “Aligning text-to-speech (TTS) system outputs with human feedback through preference optimization has been shown to effectively improve the robustness and naturalness of language model-based TTS models.” TKTO takes this a step further by focusing on the granular details of speech.

The Surprising Finding

Here’s the twist: current preference optimization methods for TTS typically rely on utterance-level feedback. This means judging an entire sentence or phrase as good or bad. However, the paper states that TKTO achieves superior results by focusing on token-level optimization. This is surprising because it suggests that micro-level adjustments are far more impactful than broad, sentence-level corrections. The team revealed that TKTO automatically assigns 12.8 times stronger reward to targeted tokens. This challenges the common assumption that large-scale, utterance-level feedback is always the most effective. Instead, it seems precision at the word or syllable level is key for truly natural speech. It means the AI learns exactly which parts of a word need adjustment, rather than trying to fix an entire sentence at once.

What Happens Next

While specific timelines were not provided, we can anticipate seeing TKTO-like techniques integrated into commercial TTS systems within the next 12 to 18 months. This will likely lead to a noticeable jump in the quality of AI voices across various platforms. For example, imagine a language learning app where the AI tutor’s pronunciation is virtually indistinguishable from a native speaker. This system could make that a reality. For content creators, this means higher quality AI voiceovers for videos and podcasts, reducing the need for costly human narration. The company reports that this method could set a new standard for natural-sounding synthetic speech. Your next smart speaker update might just sound a whole lot better. To stay ahead, consider experimenting with AI voice tools that prioritize fine-grained control and feedback mechanisms as they emerge.

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice