SpeechJudge: AI Achieves Near-Human Speech Naturalness Judgment

A new generative reward model, SpeechJudge-GRM, significantly closes the gap in evaluating synthesized speech.

Researchers have introduced SpeechJudge, a comprehensive suite to improve AI's ability to judge speech naturalness. This includes a large human feedback dataset and a benchmark. Their new model, SpeechJudge-GRM, shows superior performance, nearly matching human judgment.

By Sarah Kline

November 12, 2025

4 min read

SpeechJudge: AI Achieves Near-Human Speech Naturalness Judgment

Key Facts

SpeechJudge is a new suite for improving AI judgment of speech naturalness.
It includes SpeechJudge-Data, a dataset with 99K human-annotated speech pairs.
SpeechJudge-Eval is a benchmark revealing existing models struggle with naturalness judgment.
SpeechJudge-GRM, a generative reward model, achieved 77.2% accuracy (79.4% scaled).
Leading models like Gemini-2.5-Flash achieved less than 70% agreement with human judgment.

Why You Care

Ever listened to an AI-generated voice and felt something was just off? Do you struggle to make your AI-generated audio sound truly human? The ability for AI to create natural-sounding speech is crucial. Now, a new creation promises to help AI systems better understand what ‘natural’ really means. This could dramatically improve the quality of speech synthesis for your projects.

What Actually Happened

Researchers have unveiled SpeechJudge, a new system designed to enhance how AI evaluates speech naturalness. This initiative tackles a major hurdle in speech synthesis, according to the announcement. The core challenge lies in the absence of extensive human preference data. This data is vital for training models that truly resonate with human perception, the paper states. SpeechJudge comprises three key components. First is SpeechJudge-Data, a large dataset of 99,000 human-annotated speech pairs. These pairs cover diverse speech styles and multiple languages. They include human feedback on both intelligibility (how clear the speech is) and naturalness (how human-like it sounds). From this data, the team established SpeechJudge-Eval. This is a challenging benchmark specifically for judging speech naturalness. The research also introduced SpeechJudge-GRM, a generative reward model. This model is built on Qwen2.5-Omni-7B, a large language model. It was trained using a two-stage process. This involved Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales. Then, Reinforcement Learning (RL) was applied to challenging cases.

Why This Matters to You

If you work with text-to-speech (TTS) models or create audio content, this creation is highly relevant. Current AI models often struggle to accurately assess speech naturalness. The study finds that leading models, like Gemini-2.5-Flash, achieve less than 70% agreement with human judgment. This highlights a significant area for betterment, as mentioned in the release. SpeechJudge-GRM aims to bridge this gap. Imagine you’re a podcaster using AI to generate voiceovers. You want the voices to sound as human as possible. This new model helps the AI ‘learn’ what sounds natural, leading to higher quality outputs for your listeners. How much smoother would your workflow be if your AI could consistently produce speech that sounds genuinely human?

Here’s a look at the performance improvements:

Model	Accuracy (Human Agreement)
Gemini-2.5-Flash	< 70%
Classic Bradley-Terry	72.7%
SpeechJudge-GRM	77.2%
SpeechJudge-GRM (scaled)	79.4%

This table, derived from the research, clearly shows SpeechJudge-GRM’s superior performance. The team revealed that SpeechJudge-GRM can also act as a reward function. This means it can guide the post-training of speech generation models. This guidance helps these models better align with human preferences. “Aligning large generative models with human feedback is a essential challenge,” the authors state. This new system directly addresses that challenge.

The Surprising Finding

What truly stands out from this research is the significant gap in existing AI’s ability to judge speech naturalness. You might assume that AudioLLMs would be highly accurate by now. However, the evaluation reveals that even the leading models fall short. For example, Gemini-2.5-Flash achieved less than 70% agreement with human judgment. This is quite surprising, considering the rapid advancements in other AI domains. It challenges the common assumption that general-purpose large language models automatically excel at nuanced human perception tasks. The complexity of human speech naturalness is evidently harder for AI to grasp than previously thought. The fact that a specialized model like SpeechJudge-GRM is needed to push accuracy into the high 70s underscores this point. It shows that dedicated datasets and targeted training are essential for these subjective evaluations.

What Happens Next

The creation of SpeechJudge-GRM marks a significant step forward. We can expect to see this system integrated into future text-to-speech systems. Over the next 6-12 months, developers might begin incorporating SpeechJudge-GRM’s insights. This will help them refine their speech generation models. For example, a company creating AI voice assistants could use SpeechJudge-GRM. It would help them ensure their assistant’s voice sounds more empathetic and natural. This could lead to more engaging user interactions. For you, this means potentially higher quality AI voices in podcasts, audiobooks, and virtual assistants. The industry implications are vast, pushing speech synthesis closer to human parity. The team’s work provides actionable takeaways for researchers. It emphasizes the need for large-scale human preference datasets. This will continue to drive advancements in AI speech quality.

Ready to start creating?