LLMs in Visual Speech Recognition: Hype vs. Reality

New research questions the true impact of large language models on visual speech understanding.

A recent paper by Rishabh Jain and Naomi Harte investigates the role of Large Language Models (LLMs) in Visual Speech Recognition (VSR). Their findings suggest that LLMs primarily enhance lexical processing, not visual understanding, challenging common assumptions in the field. The paper was later withdrawn for further development.

By Mark Ellison

September 22, 2025

4 min read

LLMs in Visual Speech Recognition: Hype vs. Reality

Key Facts

The paper "From Hype to Insight: Rethinking Large Language Model Integration in Visual Speech Recognition" was authored by Rishabh Jain and Naomi Harte.
The research evaluated LLM decoders in Visual Speech Recognition (VSR) systems.
Findings indicate LLM gains primarily stem from lexical rather than semantic processing.
Their Llama-2-13B model achieved 24.7% WER on LRS3 and 47.0% WER on WildVSR, setting a new SOTA for models without additional supervision.
The paper was withdrawn by the authors for further development.

Why You Care

Have you ever wondered if the hype around AI truly matches its real-world impact? When it comes to understanding speech from video, we often assume more AI means better visual comprehension. But what if the gains aren’t where you expect them to be? This new research dives into how Large Language Models (LLMs) integrate with Visual Speech Recognition (VSR), revealing some surprising insights that could reshape your understanding of AI’s capabilities.

What Actually Happened

Rishabh Jain and Naomi Harte recently explored the integration of Large Language Models (LLMs) into Visual Speech Recognition (VSR) systems. As detailed in the abstract, their work aimed to clarify whether performance improvements in VSR, when LLMs are used as decoders, come from enhanced visual understanding or stronger language modeling. They systematically evaluated LLM decoders by freezing or selectively updating the visual encoder, scaling decoder size, and comparing various adaptation strategies and architectures. The team also varied training data across LRS2, LRS3, and their combination, according to the announcement. Technical terms like “self-supervised encoders” refer to AI models trained on unlabeled data, learning patterns without explicit human guidance. “LLM decoders” are the parts of the system that use large language models to interpret the visual information into text.

Why This Matters to You

This research is crucial because it helps us understand the true mechanisms behind AI advancements in VSR. If you’re building AI applications that rely on interpreting visual speech, knowing where the improvements truly lie can guide your creation efforts. For example, imagine you are developing an accessibility tool for the hearing impaired. If LLMs are mainly improving language structure rather than visual lip-reading accuracy, your focus should shift to strengthening the visual component.

Key Findings on LLM Integration in VSR:

Limited Scaling Gains: Scaling LLM decoder size and adaptation strategies yielded only limited improvements, the research shows.
Dataset Combination Benefits: Combining datasets (LRS2 and LRS3) significantly enhanced generalization, the paper states.
Lexical vs. Semantic Processing: Gains primarily arose from lexical processing (word choice, grammar) rather than semantic processing (understanding meaning), according to the study.
SOTA Performance: Their Llama-2-13B model achieved 24.7% Word Error Rate (WER) on LRS3 and 47.0% WER on WildVSR, establishing among models without additional supervision, the team revealed.

Do you think focusing on language modeling alone is enough for truly visual speech recognition? This finding suggests a need for a more balanced approach. Your current AI projects might benefit from re-evaluating where their ‘intelligence’ truly resides. As Rishabh Jain and Naomi Harte concluded, “Our findings indicate LLM decoders refine contextual reasoning rather than visual features, emphasizing the need for stronger visual encoders to drive meaningful progress.”

The Surprising Finding

Here’s the twist: many might assume that integrating LLMs into VSR would automatically lead to a deeper visual understanding of speech. However, the study found the opposite. The research shows that the gains from LLM decoders primarily stem from lexical rather than semantic processing. This means the LLMs are getting better at predicting the next word based on language patterns, not necessarily at interpreting the visual cues of speech itself. It challenges the common assumption that simply adding a larger, more language model will automatically improve how an AI ‘sees’ and interprets lip movements. Instead, it refines the linguistic output, making the sentences sound more natural, but not necessarily making the system a better ‘lip-reader.’

What Happens Next

While the paper was withdrawn for further creation, its insights remain highly relevant. This withdrawal, as mentioned in the release, indicates that the authors plan to refine their work, potentially incorporating stronger visual encoders. We might see updated research emerge in the coming months, perhaps by late 2025 or early 2026, building on these initial findings. For example, future applications could involve developing hybrid VSR systems that prioritize novel visual feature extraction methods alongside language models. For you, this means staying informed about the evolution of VSR system. If you are involved in AI creation, consider exploring new ways to enhance visual feature extraction in your models. The industry implications are clear: the focus needs to shift towards improving the visual front-end of VSR systems to achieve genuine progress in visual speech understanding, rather than solely relying on the linguistic prowess of LLMs.

Ready to start creating?