InstructAudio Unifies AI Speech and Music Creation

A new framework allows natural language instructions to generate both expressive speech and diverse music.

Researchers have introduced InstructAudio, a unified AI model capable of generating both speech and music from natural language instructions. This innovation overcomes previous limitations in controlling acoustic attributes for both modalities independently. It represents a significant step towards more intuitive and versatile audio content creation.

By Katie Rowan

December 1, 2025

4 min read

InstructAudio Unifies AI Speech and Music Creation

Key Facts

InstructAudio is a unified framework for generating speech and music using natural language instructions.
It controls acoustic attributes like timbre (gender, age), paralinguistic features (emotion, style), and musical elements (genre, instrument, rhythm).
The model supports expressive speech, music, and dialogue generation in English and Chinese.
InstructAudio was trained on 50,000 hours of speech and 20,000 hours of music data.
It represents the first instruction-controlled framework to unify speech and music generation.

Why You Care

Ever wished you could just tell an AI exactly what kind of audio you want to create? Imagine describing a voice or a song, and having it instantly generated. This is no longer a futuristic dream. A new creation called InstructAudio is making it a reality, offering control over AI-generated speech and music. Why should you care? Because this system could fundamentally change how you produce audio content, from podcasts to soundtracks.

What Actually Happened

Researchers have unveiled InstructAudio, a unified structure designed for instruction-based control of both speech and music generation. Previously, text-to-speech (TTS) and text-to-music (TTM) models operated largely independently, each with its own limitations, according to the announcement. TTS systems often needed reference audio for voice timbre and had limited control over attributes. TTM systems, meanwhile, required expert-level knowledge for input conditioning. InstructAudio bridges this gap. It allows users to describe desired acoustic attributes using natural language. This includes elements like gender and age for timbre, or emotion and style for paralinguistic features. For music, you can specify genre, instrument, rhythm, and atmosphere. The system supports expressive speech, music, and even dialogue generation in both English and Chinese, as detailed in the blog post.

Why This Matters to You

This unified approach simplifies the creation of complex audio content. You no longer need separate tools or specialized knowledge for different audio types. Think of it as a single command center for your audio needs. The model uses joint and single diffusion transformer layers, trained on extensive datasets. This includes 50,000 hours of speech data and 20,000 hours of music data, enabling multi-task learning. This means better quality and more consistent results for your projects. How much easier would your creative process become with such a tool?

Consider these practical implications for your work:

Podcasting: Easily generate character voices with specific emotions or ages for narrative segments.
Game creation: Create dynamic background music that changes based on player actions or mood descriptions.
Marketing: Produce voiceovers and jingles perfectly tailored to your brand’s tone and message.
Education: Develop interactive audio lessons with diverse vocal styles and accompanying musical cues.

“InstructAudio represents the first instruction-controlled structure unifying speech and music generation,” the team revealed. This means you can achieve a level of creative control previously unavailable. Your ability to bring auditory visions to life just got a major upgrade.

The Surprising Finding

Here’s the twist: despite sharing common acoustic modeling characteristics, speech and music generation have long been developed separately. This has created a challenge in achieving unified modeling. InstructAudio tackles this head-on. The surprising finding is that a single model can achieve optimal results across both modalities. This is visualized in their performance comparisons. The research shows that InstructAudio achieves optimal results on most metrics when compared to mainstream TTS and TTM models. This challenges the assumption that specialized, siloed models are always superior for distinct audio tasks. It indicates that a unified approach can actually outperform dedicated systems, offering both versatility and high quality. It suggests that the underlying acoustic principles are more intertwined than previously exploited.

What Happens Next

While InstructAudio is currently a research paper, we can expect its principles to influence commercial tools within the next 12-18 months. Developers will likely integrate similar capabilities into existing audio production suites. Imagine a scenario where, by late 2026, you can simply type “a calm, female voice explaining quantum physics over ambient piano music” into your favorite audio editor. This system could also lead to more accessible tools for indie creators. Actionable advice for you: keep an eye on updates from major AI audio companies. They will likely be incorporating these unified instruction-based controls. This will allow for more intuitive and efficient content creation across various industries. The industry implications are vast, potentially streamlining workflows for sound designers, musicians, and content creators alike. This unified approach could become the new standard for AI-driven audio production, according to the announcement.

Ready to start creating?