SoulX-Podcast: AI Creates Realistic Long-form Podcasts

New system generates multi-speaker dialogues with dialectal and paralinguistic diversity.

A new system called SoulX-Podcast can generate realistic, long-form, multi-speaker podcasts. It handles various dialects and paralinguistic cues, offering state-of-the-art performance in speech synthesis. This advancement opens new doors for content creation.

By Katie Rowan

October 28, 2025

4 min read

SoulX-Podcast: AI Creates Realistic Long-form Podcasts

Key Facts

SoulX-Podcast is a new system for generating realistic long-form, multi-speaker podcasts.
It supports both Mandarin and English, along with several Chinese dialects (Sichuanese, Henanese, Cantonese).
The system can continuously produce over 90 minutes of conversation with stable speaker timbre and smooth transitions.
It integrates paralinguistic controls for higher naturalness in multi-turn dialogues.
SoulX-Podcast achieves state-of-the-art performance in both monologue TTS and multi-turn conversational speech synthesis.

Why You Care

Ever wished you could produce a professional-sounding podcast without hiring multiple voice actors or spending hours editing? What if AI could generate entire conversations, complete with natural pauses and regional accents? A new creation called SoulX-Podcast is making this a reality, and it could change how you create audio content.

This system aims to produce realistic, long-form podcasts. It offers multi-speaker dialogue with impressive dialectal and paralinguistic diversity. This means your future audio projects could sound more authentic and engaging than ever before.

What Actually Happened

A new technical report introduces SoulX-Podcast, an system for generating podcast-style speech. According to the announcement, this system excels at multi-turn, multi-speaker dialogic speech generation. It also achieves performance in conventional text-to-speech (TTS) tasks.

Traditional TTS systems often focus on single-speaker synthesis. However, SoulX-Podcast tackles the challenge of coherent multi-speaker conversational speech. The system integrates various paralinguistic controls (non-verbal communication like tone or pitch). It supports both Mandarin and English, as detailed in the technical report. What’s more, it includes several Chinese dialects, such as Sichuanese, Henanese, and Cantonese. This allows for highly personalized podcast-style speech generation, the team revealed.

Why This Matters to You

This system has significant implications for content creators, podcasters, and educators. Imagine generating educational dialogues in multiple languages or creating fictional podcasts with a diverse cast of AI voices. SoulX-Podcast offers an exciting new tool for your creative arsenal.

For example, a language learning app could use SoulX-Podcast to create practice conversations. These conversations could feature speakers with different regional accents, making the learning experience more immersive. This capability meets the higher naturalness demands of multi-turn spoken dialogue, as mentioned in the release.

Key Capabilities of SoulX-Podcast:

Long-form Generation: Produces over 90 minutes of continuous conversation.
Stable Timbre: Maintains consistent speaker voice quality throughout.
Smooth Transitions: Ensures natural flow between different speakers.
Contextual Prosody: Speakers adapt rhythm and intonation naturally.
Multilingual Support: Includes Mandarin, English, and Chinese dialects.

How will you use this power to tell your next story or share your message? The system demonstrates stable speaker timbre and smooth speaker transitions, according to the announcement. One of the authors stated, “SoulX-Podcast can continuously produce over 90 minutes of conversation with stable speaker timbre and smooth speaker transitions.” This ensures a high level of realism for your listeners.

The Surprising Finding

What truly stands out about SoulX-Podcast is its ability to maintain realism over extended durations. Many AI speech systems struggle with coherence and naturalness in long-form content. However, experimental results demonstrate a different outcome. SoulX-Podcast can continuously produce over 90 minutes of conversation with stable speaker timbre. It also ensures smooth speaker transitions, the research shows.

This is surprising because generating such long, natural dialogues with consistent voice characteristics is incredibly complex. Speakers also exhibit contextually adaptive prosody (the rhythm, stress, and intonation of speech). This reflects natural rhythm and intonation changes as dialogues progress, the paper states. This capability challenges the common assumption that AI-generated speech quickly becomes monotonous or robotic in longer formats.

What Happens Next

The future will likely see further refinement and broader application of technologies like SoulX-Podcast. We can expect to see early adopters integrating these tools into their production workflows within the next 12-18 months. This could include podcast networks and educational content providers.

For example, a marketing agency might use SoulX-Podcast to quickly generate diverse voiceovers for explainer videos. These videos could target different regional audiences. Our advice for readers is to start exploring how AI speech synthesis could enhance your current projects. Keep an eye on updates from this research and similar projects.

This advancement will undoubtedly impact the audio content industry. It lowers barriers to entry for high-quality production. It also allows for more creative freedom. The technical report explains that SoulX-Podcast achieves performance across multiple evaluation metrics. This applies to both monologue TTS and multi-turn conversational speech synthesis. The possibilities for audio experiences are vast.

Ready to start creating?