LibriQuote: The Key to More Expressive AI Voices

A new dataset promises to make text-to-speech systems sound more human, capturing emotion and character.

Researchers have unveiled LibriQuote, a new speech dataset designed to train AI in generating highly expressive voices. This dataset, derived from audiobooks, includes thousands of hours of character utterances, complete with contextual cues. It aims to significantly improve the emotional range and naturalness of synthesized speech.

By Mark Ellison

September 6, 2025

4 min read

LibriQuote: The Key to More Expressive AI Voices

Key Facts

LibriQuote is a new speech dataset for expressive zero-shot text-to-speech (TTS) systems.
It contains 12.7K hours of non-expressive speech and 5.3K hours of expressive speech from character quotations.
Expressive utterances include contextual information and pseudo-labels for speech verbs and adverbs.
Fine-tuning TTS systems on LibriQuote significantly improves synthesized speech intelligibility.
A 7.5-hour test set is included for benchmarking, covering a wide range of emotions and accents.

Why You Care

Have you ever listened to an AI voice and thought, “That sounds a bit robotic”? It’s a common experience. Imagine if those voices could whisper softly, exclaim with joy, or even sound like a grumpy old wizard. That’s precisely what a new creation in AI speech synthesis aims to achieve. This creation could soon make your favorite digital assistants, audiobooks, and podcasts sound incredibly lifelike.

What Actually Happened

Researchers Gaspard Michel, Elena V. Epure, and Christophe Cerisara have introduced a significant new resource called LibriQuote. This dataset is a collection of fictional character utterances specifically designed for expressive zero-shot text-to-speech (TTS) systems, according to the announcement. TTS systems convert written text into spoken audio. While current systems can produce natural-sounding speech, adding emotion and character has been a challenge. The paper states that LibriQuote addresses this by providing a vast corpus of expressive speech. It includes 12.7K hours of read, non-expressive speech and 5.3K hours of mostly expressive speech. This expressive part is drawn from character quotations within audiobooks. Each expressive utterance even includes its original written context. It also has pseudo-labels describing the quotation, such as “he whispered softly.” This helps the AI understand the emotional nuance.

Why This Matters to You

This creation directly impacts how you’ll interact with AI voices in the future. Think about the audiobooks you listen to or the voiceovers in your favorite videos. They could soon have a much richer, more engaging quality. The study finds that fine-tuning a baseline TTS system on LibriQuote significantly improves its synthesized speech intelligibility. This means not only more expressive voices but also clearer ones. Imagine an AI narrator that can convey the full spectrum of emotions in a story. This enhances immersion and makes content much more enjoyable.

How will you use these more expressive AI voices in your own projects?

Consider these practical applications:

Application Area	Current Limitation	LibriQuote’s Impact
Audiobooks	Often monotone, lacking character nuance	Characters will have distinct, emotional voices
Voice Assistants	Robotic, formal tone	More natural, empathetic interactions
Podcast Narration	Can sound flat, unengaging	Dynamic, expressive delivery for hosts
Video Game Characters	Limited emotional range in dialogue	Richer, more believable character performances

For example, if you’re a podcaster, you could use an AI voice that genuinely sounds excited when announcing a new segment. The team revealed that “recent systems fail to synthesize speech as expressive and natural as the ground-truth utterances.” LibriQuote aims to bridge that gap, making AI voices indistinguishable from human ones in their emotional range. Your listeners will feel a deeper connection to the content.

The Surprising Finding

Here’s an interesting twist: despite the advancements in text-to-speech, the research shows that even current systems struggle with true expressiveness. The paper states that “recent systems fail to synthesize speech as expressive and natural as the ground-truth utterances.” This is surprising because large-scale speech datasets have made AI voices very natural. However, the proportion of expressive speech in these large datasets is often unclear, as detailed in the blog post. This suggests that simply having more data isn’t enough. The data needs to be specifically curated for emotional content. LibriQuote’s focus on character quotations, complete with contextual cues like “he whispered softly,” is crucial. It provides the specific emotional training that general datasets lack. This challenges the assumption that sheer volume of data automatically leads to human-like emotional output.

What Happens Next

LibriQuote is freely available, which means developers and researchers can start using it immediately. This availability should accelerate the creation of more expressive text-to-speech systems in the coming months. We can expect to see noticeable improvements in AI voice quality by late 2025 or early 2026. For example, imagine a new generation of voice assistants that can detect your mood and respond with appropriate emotional intonation. This could lead to more empathetic AI interactions. The company reports that the dataset also includes a challenging 7.5-hour test set. This set is designed for benchmarking TTS systems. This will help measure progress and push the boundaries of what’s possible. The documentation indicates that the dataset covers a wide range of emotions and various accents. This promises a diverse and rich expressive capability for future AI voices. This will likely set a new standard for voice synthesis across the industry.

Ready to start creating?