AI Models Are Getting Better at Judging Speech Quality, Thanks to LLMs

New research shows large language models can act as 'pseudo-raters' to improve speech quality assessment tools.

Assessing speech quality in real-time calls is challenging due to limited data. Researchers are now using large language models (LLMs) to generate massive datasets, significantly improving the accuracy and generalization of non-intrusive speech quality assessment (SQA) systems. This could mean clearer audio for your podcasts and live streams.

By Sarah Kline

August 11, 2025

4 min read

AI Models Are Getting Better at Judging Speech Quality, Thanks to LLMs

Key Facts

LLMs can generate large datasets for speech quality assessment (SQA).
The LibriAugmented dataset contains over 100,000 LLM-labeled speech clips.
A two-stage training approach (LLM pretraining + human fine-tuning) significantly improves SQA model generalization.
This method enhances metrics like DNSMOS Pro on real-world datasets like NISQA_TEST_LIVETALK and Tencent with reverb.
The research aims to overcome limitations of scarce human-labeled SQA data.

Why You Care

Ever wonder why some remote interviews sound crystal clear while others are a garbled mess? For content creators, podcasters, and anyone relying on clear audio, the quality of speech is paramount. New research suggests that artificial intelligence, specifically large language models (LLMs), might be the key to consistently excellent audio, even in challenging real-world scenarios.

What Actually Happened

A team of researchers, including Fredrik Cumlin, Xinyu Liang, Anubhab Ghosh, and Saikat Chatterjee, have proposed a novel approach to address a long-standing problem in non-intrusive speech quality assessment (SQA) systems: the scarcity of training data and the high cost of human annotations. According to their paper, "Leveraging LLMs for expandable Non-intrusive Speech Quality Assessment," published on arXiv, these limitations often hinder SQA systems from generalizing effectively to diverse, real-time conferencing calls.

Their core idea involves using LLMs as "pseudo-raters" to generate vast quantities of labeled speech data. They constructed a dataset called LibriAugmented, which, according to the announcement, consists of "101,129 speech clips with simulated degradations labeled by a fine-tuned auditory LLM (Vicuna-7b-v1.5)." The researchers then compared three training strategies for SQA models: using only human-labeled data, using only LLM-labeled data, and a two-stage approach. This two-stage method involved pretraining models on the LLM-generated labels, then fine-tuning them with a smaller set of human-labeled data. They validated their methods using established metrics like DNSMOS Pro and DeePMOS across various datasets, languages, and quality degradations.

Why This Matters to You

For anyone producing audio content—from podcasts and live streams to online courses and virtual events—this research has prompt and tangible implications. Current speech quality assessment tools often struggle with the sheer variety of real-world audio challenges, like background noise, echo, and varying internet connections. This often leads to inconsistent audio quality in your final product or during live interactions.

By leveraging LLMs to create massive, diverse datasets, SQA systems can become significantly more reliable. Imagine an AI tool that can accurately predict how a listener will perceive the quality of your audio, even before you publish it, across different environments. This means better automated quality control, more reliable real-time audio processing in communication platforms, and potentially, more effective noise suppression and echo cancellation algorithms. The ability to generate such a large volume of labeled data cheaply and efficiently could accelerate the creation of new audio processing tools that ensure your voice always comes through clearly, regardless of the recording environment or internet connection.

The Surprising Finding

While using LLM-labeled data for direct training yielded "mixed results compared to human-labeled training," the research uncovered a particularly impactful strategy: the two-stage approach. The study provides "empirical evidence that the two-stage approach improves the generalization performance." For instance, the researchers reported that "DNSMOS Pro achieves 0.63 vs. 0.55 PCC on NISQATESTLIVETALK and 0.73 vs. 0.65 PCC on Tencent with reverb" when using this hybrid training method. This finding is significant because it suggests that LLMs aren't just a substitute for human raters, but rather a capable accelerator for training, enabling SQA models to learn from a vast initial pool of data before being refined by human-validated examples. It highlights the synergistic potential of combining the scale of AI with the nuanced judgment of human perception.

What Happens Next

This research points towards a future where AI-powered audio tools are not just reactive but proactive in ensuring high-quality speech. We can anticipate the integration of these more reliable SQA models into a wide array of applications. For content creators, this could mean more complex audio editing software that automatically flags and even corrects quality issues, or real-time communication platforms that dynamically adapt to maintain optimal speech clarity.

Looking ahead, the creation of even more specialized auditory LLMs could further refine this process, leading to SQA systems that understand subtle nuances of speech perception beyond what current models capture. While the research demonstrates significant progress, widespread adoption will depend on further validation across an even broader range of real-world scenarios and the integration of these findings into commercial products. Expect to see these advancements trickle down into your favorite audio recording, editing, and communication tools over the next few years, making high-quality audio more accessible than ever before.

Ready to start creating?