New AI Model Streams Speech with Human-like Prosody

CC-G2PnP enhances text-to-speech for unsegmented languages by predicting phonemes and prosody in real-time.

A new AI model, CC-G2PnP, improves streaming text-to-speech by accurately predicting phonemes and prosody. This technology is especially effective for languages without clear word boundaries, like Japanese, making AI voices sound more natural and expressive.

By Katie Rowan

February 22, 2026

4 min read

New AI Model Streams Speech with Human-like Prosody

Key Facts

CC-G2PnP is a streaming grapheme-to-phoneme and prosody (G2PnP) model.
It uses a Conformer-CTC architecture for chunk-by-chunk processing.
The model works effectively for unsegmented languages, like Japanese.
It significantly outperforms baseline models in PnP label prediction accuracy.
CC-G2PnP was accepted by ICASSP 2026.

Why You Care

Ever noticed how some AI voices sound a bit robotic or unnatural? What if artificial intelligence could speak with the same natural rhythm and intonation as a human? A new creation, CC-G2PnP, is making this a reality for streaming text-to-speech. This system promises to transform how AI communicates, making interactions smoother and more engaging for you.

What Actually Happened

Researchers have unveiled CC-G2PnP, a novel streaming grapheme-to-phoneme and prosody (G2PnP) model. This model connects large language models (LLMs) with text-to-speech (TTS) systems in a continuous flow, according to the announcement. It uses a Conformer-CTC architecture. This design processes input grapheme tokens—the individual letters or symbols in written language—in small chunks. This chunk-by-chunk processing allows for real-time inference of phonemic and prosodic (PnP) labels.

What’s more, the model guarantees a minimal look-ahead size for each input token. This feature helps the model consider future context, leading to more stable PnP label prediction, the paper states. Unlike older streaming methods that relied on explicit word boundaries, CC-G2PnP’s CTC decoder learns grapheme-phoneme alignment during training. This makes it highly effective for languages without clear word segmentation.

Why This Matters to You

Imagine listening to an audiobook or podcast generated by AI. With CC-G2PnP, the AI voice won’t just read words; it will speak with appropriate pauses, emphasis, and tone. This makes the listening experience far more pleasant and understandable for you. The system is particularly impactful for content creators working with languages like Japanese.

Here’s how CC-G2PnP brings tangible benefits:

More Natural AI Voices: AI-generated speech gains human-like rhythm and intonation.
Improved Accessibility: Better speech quality helps those with visual impairments or reading difficulties.
Global Language Support: Excels in languages without explicit word boundaries.
Real-time Performance: Enables , fluid AI-driven conversations.

For example, consider a voice assistant providing live navigation. Instead of a monotone voice, it could deliver directions with urgency or calm, depending on the situation. “The proposed model can consider future context in each token, which leads to stable PnP label prediction,” the team revealed. This ensures the AI’s speech flows naturally. How do you think more natural AI voices could change your daily interactions with system?

The Surprising Finding

What truly stands out about CC-G2PnP is its unexpected strength in handling languages like Japanese. Many previous streaming methods struggled with languages that lack explicit word boundaries. These older systems often depended on clearly defined words to process speech effectively. However, CC-G2PnP’s CTC decoder sidesteps this limitation entirely. The research shows it effectively learns the alignment between graphemes and phonemes during training.

Experiments on a Japanese dataset demonstrated this capability. The results showed that CC-G2PnP significantly outperforms the baseline streaming G2PnP model in the accuracy of PnP label prediction. This finding challenges the assumption that explicit word segmentation is essential for high-quality streaming G2PnP. It highlights a more flexible and approach to speech generation, especially for linguistically diverse applications.

What Happens Next

This creation, accepted by ICASSP 2026, suggests a promising future for AI speech system. We can expect to see CC-G2PnP or similar models integrated into various applications within the next 12-24 months. Think of it as the next step in making AI assistants, virtual characters, and automated customer service sound more human. For instance, a podcast production studio could use this to generate voiceovers in multiple languages, all with natural-sounding prosody.

Developers will likely explore how to further refine the model’s performance across an even wider array of languages. For you, this means anticipating more and less jarring AI interactions. Consider experimenting with new voice AI tools as they emerge. The industry implications are clear: a higher bar for natural language generation and a move towards truly conversational AI experiences.

Ready to start creating?