AI Synthesizes Creaky Voice, Opening New Speech Horizons

New research explores AI's ability to precisely control perceptual voice qualities like creak.

Researchers have developed a text-to-speech (TTS) system that can synthesize speech with specific perceptual voice qualities, focusing on 'creaky voice.' This system uses normalizing flows to manipulate localized voice characteristics without needing unreliable frame-wise predictors. Subjective tests confirmed successful manipulation, albeit with a slight reduction in overall quality.

Katie Rowan

By Katie Rowan

November 10, 2025

3 min read

AI Synthesizes Creaky Voice, Opening New Speech Horizons

Key Facts

  • A text-to-speech (TTS) system can now synthesize speech with selected perceptual voice qualities.
  • The system uses a global speaker attribute manipulation block based on normalizing flows.
  • It successfully manipulates 'creaky voice,' avoiding unreliable frame-wise predictors.
  • Subjective listening tests confirmed successful creak manipulation.
  • The manipulated speech had a slightly reduced Mean Opinion Score (MOS) compared to original recordings.

Why You Care

Ever noticed how a voice can change its texture, like that gravelly sound sometimes called ‘creaky voice’? What if artificial intelligence could perfectly replicate and control such subtle vocal nuances? A new study reveals a text-to-speech (TTS) system capable of synthesizing speech with selected perceptual voice qualities. This creation is significant for anyone interested in realistic AI voices or specific voice effects. Imagine the possibilities for your next podcast or virtual assistant.

What Actually Happened

Researchers have successfully augmented a text-to-speech system to manipulate specific perceptual voice qualities, according to the announcement. Their focus was on ‘creaky voice,’ a non-persistent, localized vocal characteristic. The team integrated a global speaker attribute manipulation block into the TTS system. This block utilizes ‘normalizing flows’—a type of generative model—to achieve precise control. This method avoids the need for a ‘frame-wise creak predictor,’ which the study finds is typically unreliable. The system can now manipulate creaky voice without requiring complex, moment-by-moment predictions, as detailed in the blog post.

Why This Matters to You

This advancement has practical implications for how you interact with AI voices. Think about voice assistants or audiobook narrators. The ability to control specific voice qualities means more expressive and natural-sounding AI. For example, imagine an AI narrator that can subtly alter its voice to convey emotion or emphasize certain words, much like a human actor. This precision in voice synthesis could greatly enhance user experience.

Key Benefits of Perceptual Voice Quality Control:

  1. Enhanced Realism: AI voices sound more human and less robotic.
  2. Improved Expressiveness: AI can convey a wider range of emotions and intentions.
  3. Educational Tools: Illustrate phonetic concepts that are otherwise difficult to grasp.
  4. Customization: Tailor AI voices to specific brand identities or character roles.

“The control of perceptual voice qualities in a text-to-speech (TTS) system is of interest for applications where unmanipulated and manipulated speech probes can serve to illustrate phonetic concepts that are otherwise difficult to grasp,” the paper states. This means you could use AI to demonstrate different speech patterns in language learning. How might more expressive AI voices change your daily interactions with system?

The Surprising Finding

Here’s the twist: while the system successfully manipulated creaky voice, subjective listening tests showed a slight reduction in MOS score. MOS, or Mean Opinion Score, measures the perceived quality of speech. This means that while the AI could create the desired creaky effect, the overall sound quality was a little lower than the original recording. This finding is surprising because you might expect manipulation to maintain or even improve quality. It challenges the assumption that adding complex vocal effects automatically results in a higher-fidelity output. The team revealed that despite this slight dip, the manipulation itself was confirmed as successful.

What Happens Next

Looking ahead, this research paves the way for more voice synthesis. We can expect further refinements in the next 12-18 months. The goal will be to achieve precise perceptual control without compromising overall audio quality. For instance, future systems might allow you to adjust not just creak, but also breathiness or vocal fry. This could lead to AI voices that are indistinguishable from human speech. Content creators could soon have access to tools that offer control over AI voice characteristics. The industry implications are vast, ranging from entertainment to accessibility. The technical report explains that this foundational work will enable new levels of customization for AI-generated audio.

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice