OpenAI Boosts Voice AI with gpt-realtime and API Upgrades

New features enable more natural, reliable, and production-ready voice agents for businesses and developers.

OpenAI has made its Realtime API generally available, introducing the advanced gpt-realtime speech-to-speech model. These updates offer improved naturalness, lower latency, and new capabilities like image input and phone calling support for voice agents. Businesses can now deploy more sophisticated AI-powered conversational experiences.

By Sarah Kline

August 29, 2025

4 min read

OpenAI Boosts Voice AI with gpt-realtime and API Upgrades

Key Facts

OpenAI's Realtime API is now generally available.
The new `gpt-realtime` model is OpenAI's most advanced speech-to-speech model.
The API now supports remote MCP servers, image inputs, and SIP phone calling.
The Realtime API processes and generates audio directly through a single model, reducing latency.
Two new voices, Cedar and Marin, are available exclusively in the Realtime API.

Why You Care

Ever wished talking to a computer felt as natural as chatting with a friend? What if your voice assistant could understand your complex requests without a hitch? OpenAI has just taken a significant leap forward in making this a reality. They announced major updates to their Realtime API, including a new model called gpt-realtime. This means your future interactions with AI voice agents could be smoother and more intelligent than ever before. For businesses, this opens doors to building more reliable and human-like customer experiences.

What Actually Happened

OpenAI has officially made its Realtime API generally available, according to the announcement. This API now includes several key new features. These additions are designed to help developers and enterprises create production-ready voice agents. Among the updates, the API now supports remote MCP (Media Control Protocol) servers. It also handles image inputs, allowing voice agents to process visual context. What’s more, the API supports phone calling through Session Initiation Protocol (SIP). This makes voice agents more capable by providing access to additional tools and context. The company also revealed its most speech-to-speech model, gpt-realtime. This new model shows improvements in following complex instructions. It also calls tools with greater precision, and produces speech that sounds more natural and expressive.

Why This Matters to You

Imagine a customer support call where the AI understands your nuanced needs. Think of it as a conversation where the AI agent doesn’t sound robotic. The Realtime API updates, especially gpt-realtime, aim to achieve this. Unlike older systems that link separate speech-to-text and text-to-speech models, this new API processes audio directly. This single-model approach reduces latency significantly. It also preserves the subtle nuances in your speech. This results in more natural and expressive responses from the AI. For example, a real estate company like Zillow could use this. Their AI could guide you through complex home-buying decisions. It would feel as natural as talking to a human agent. How might this change your daily interactions with system?

Here are some key improvements with the new Realtime API:

Lower Latency: Audio is processed and generated directly, not through chained models.
More Natural Speech: gpt-realtime produces speech with better intonation and emotion.
Improved Instruction Following: The model handles complex, multi-step requests more accurately.
Enhanced Tool Calling: Agents can use external tools with greater precision.
New Voices: Two new voices, Cedar and Marin, are now available exclusively.

According to Josh Weisberg, Head of AI at Zillow, “The new speech-to-speech model in OpenAI’s Realtime API shows stronger reasoning and more natural speech—allowing it to handle complex, multi-step requests like narrowing listings by lifestyle needs or guiding affordability discussions with tools like our BuyAbility score.” This highlights the practical benefits for your business and your customers.

The Surprising Finding

One particularly interesting aspect highlighted by the company is the shift in how voice agents process audio. Traditionally, building a voice agent involved chaining together multiple models. You would have a speech-to-text model first, converting your voice to text. Then, a text-to-speech model would convert the AI’s text response back into audio. The technical report explains that the Realtime API now processes and generates audio directly through a single model. This is a significant departure from the traditional pipeline. It directly reduces latency and preserves speech nuance. This means the AI can respond much faster. It also sounds more like a real person. This integrated approach challenges the common assumption that voice AI needs multiple, separate processing steps to function effectively. The team revealed this direct processing method is key to creating truly natural conversations.

What Happens Next

Developers and businesses can start integrating these new capabilities immediately. The Realtime API is generally available as of August 28, 2025. This means you can begin experimenting with gpt-realtime and the new API features now. For instance, imagine a multilingual customer support bot that seamlessly switches languages mid-sentence. This is now more feasible. Companies should consider piloting these voice agents for customer service or internal tools. The industry implications are vast, suggesting a future where AI voice interactions are commonplace and highly . The documentation indicates that continuous improvements are planned. This will further enhance reliability and quality. Your feedback will help shape future iterations. The company reports that thousands of developers have already contributed to shaping these improvements since last October.

Ready to start creating?