New Benchmark Reveals LLMs Struggle with Speech 'Ums' and 'Uh's

A new evaluation suite, DRES, uncovers surprising limitations and offers solutions for clearer AI communication.

A new benchmark called DRES has been introduced to test how well Large Language Models (LLMs) remove disfluencies like 'um' and 'uh' from speech. The research shows that while LLMs are powerful, they still face challenges in accurately cleaning up spoken language. This study provides key insights and recommendations for improving AI's understanding of natural speech.

By Sarah Kline

September 25, 2025

4 min read

New Benchmark Reveals LLMs Struggle with Speech 'Ums' and 'Uh's

Key Facts

DRES (Disfluency Removal Evaluation Suite) is a new benchmark for LLMs.
Disfluencies like 'um' and 'uh' degrade speech-driven system accuracy.
Simple segmentation consistently improves LLM performance in disfluency removal.
Reasoning-oriented models tend to over-delete fluent tokens.
Fine-tuning improves precision and recall but harms generalization abilities.

Why You Care

Ever noticed how often people say “um” or “uh” when they speak? What if your AI assistant or podcast editor could flawlessly remove these verbal hiccups, making every interaction crystal clear? A new study introduces DRES (Disfluency Removal Evaluation collection), a essential disfluency removal benchmark for Large Language Models (LLMs). This research reveals how well — or not so well — current AI handles the messy reality of human speech. Why should you care? Because clearer AI means better voice assistants, more accurate transcriptions, and smoother content creation for your projects.

What Actually Happened

Researchers have unveiled DRES, a new benchmark designed specifically to test LLMs’ ability to perform disfluency removal, according to the announcement. Disfluencies are those common verbal interruptions such as “um,” “uh,” interjections, and even self-corrections. These elements consistently degrade the accuracy of speech-driven systems, impacting everything from command interpretation to summarization and conversational agents, the paper states. The DRES collection is unique because it uses human-annotated Switchboard transcripts. This approach isolates the challenge of disfluency removal from other issues like ASR (Automatic Speech Recognition) errors or acoustic variability, as detailed in the blog post. The team systematically evaluated both proprietary and open-source LLMs across various scales, prompting strategies, and architectural designs. This comprehensive testing aims to provide a reproducible and model-agnostic foundation for improving spoken-language systems.

Why This Matters to You

Imagine you’re recording a podcast or dictating an important memo. You might naturally use filler words. This new research directly impacts how effectively AI can clean up your spoken words. The study found that simple segmentation consistently improves performance, even for large context models. What’s more, reasoning-oriented models sometimes over-delete fluent tokens, which means they might remove important words along with the fillers. Fine-tuning models can achieve high precision and recall, but it can also harm their ability to generalize to new situations, according to the research. Do you want your AI to be surgically precise or broadly adaptable?

Here’s a quick look at some key findings:

Finding	Impact on You
Simple segmentation helps	Your audio clean-up might get better with basic processing.
Reasoning models over-delete	AI might remove too much, changing your original meaning.
Fine-tuning harms generalization	Highly specialized AI might not work well for diverse speaking styles.

For example, think of a content creator using an AI tool to transcribe an interview. If the AI over-deletes, it could remove crucial context. “The research reveals that simple segmentation consistently improves performance, even for long-context models,” the team revealed. This suggests that foundational processing steps are still vital. Your future AI tools for voice will need to balance accuracy with preserving your original intent.

The Surprising Finding

Here’s a twist: while fine-tuning often leads to better performance in many AI tasks, this study found a drawback for disfluency removal. The research shows that “fine-tuning achieves near precision and recall but harms generalization abilities.” This is surprising because we often expect specialized training to lead to superior results across the board. However, for disfluency removal, making an LLM too good at one specific type of speech can make it worse at handling diverse speaking patterns. It means that an LLM perfectly tuned for formal presentations might struggle with a casual, conversational recording. This challenges the assumption that more specific training is always better, especially when dealing with the nuances of human speech.

What Happens Next

This research provides nine practical recommendations (R1-R9) for deploying disfluency removal in speech-driven pipelines. These recommendations will likely guide developers in the coming months, perhaps leading to improved AI tools by early 2026. For instance, expect to see voice assistants that are better at understanding your commands, even if you hesitate or self-correct. Imagine a future where your smart home device accurately processes “Turn, uh, turn on the lights” without a glitch. The industry will likely focus on combining pre-processing with more generalized LLM capabilities. The goal is to create systems that are both precise and adaptable. This will ensure that your interactions with AI become smoother and more natural over time.

Ready to start creating?