LLMs for Speech: Tackling 'Attention Sinks' for Clearer Audio

New research reveals how to improve audio-visual speech recognition by fixing hidden issues in large language models.

Large language models (LLMs) are great for speech recognition. However, new research identifies and mitigates 'attention sinks' and 'massive activations' within these models. This discovery leads to better word error rates, especially in challenging audio conditions.

Mark Ellison

By Mark Ellison

October 28, 2025

4 min read

LLMs for Speech: Tackling 'Attention Sinks' for Clearer Audio

Key Facts

  • LLMs for speech recognition suffer from 'attention sinks' and 'massive activations'.
  • These issues occur in auditory, visual, and audio-visual speech recognition.
  • Attention sinks are found at both beginning-of-sequence (BOS) and intermediate low-semantic tokens.
  • Massive activations originate in MLP layers and correspond to fixed feature indices.
  • A 'decorrelation loss' method effectively mitigates these problems and improves word error rate (WER).

Why You Care

Ever struggle to understand someone speaking in a noisy environment or with poor video quality? Imagine an AI that hears and sees speech almost perfectly, even under those conditions. This new research tackles a hidden problem in how large language models (LLMs) process speech. It makes these systems much more reliable. Don’t you want your voice assistant to always understand you?

What Actually Happened

Researchers Anand, Umberto Cappellazzo, Stavros Petridis, and Maja Pantic have identified a essential issue within LLMs used for speech recognition. According to the announcement, these models can suffer from “attention sinks” and “massive activations.” Attention sinks are specific tokens that draw an unusually high amount of processing power. Massive activations occur when features of these sink tokens exhibit huge activation within the LLMs. This phenomenon was previously observed in natural language processing (NLP). However, the team revealed it also impacts multimodal speech recognition systems (ASR, VSR, and AVSR).

The study finds these issues appear not only at the beginning-of-sequence (BOS) token but also at intermediate, less meaningful tokens. The team discovered that massive activations originate in the MLP layers (Multi-Layer Perceptron layers) of the LLMs. They correspond to fixed feature indices across all sink tokens, as detailed in the blog post. What’s more, intermediate sink tokens show high cosine similarity to the BOS token, which amplifies these problematic attention and activation spikes.

Why This Matters to You

This research directly impacts the accuracy and robustness of voice AI. Think about your smart home devices or your car’s voice command system. When they misunderstand you, it’s frustrating. Mitigating these attention sinks means your interactions with AI will become smoother and more reliable. The decorrelation loss method introduced by the researchers significantly improves performance.

For example, imagine you’re trying to dictate a message in a busy coffee shop. Previously, the background noise might have caused your voice assistant to misinterpret words. With these improvements, the system becomes much better at filtering out the noise. It focuses on your speech, leading to fewer errors. This is particularly true under high audio-visual feature downsampling, which means when the audio or video quality is poor.

Key Improvements with Decorrelation Loss:

  • Reduces cosine similarity: Lowers the resemblance between BOS and other tokens.
  • Mitigates intermediate sinks: Prevents less important tokens from hogging attention.
  • Improves Word Error Rate (WER): Leads to more accurate speech transcription.
  • Stable at lower downsampling rates: Maintains performance even with good quality input.

How much better could your daily life be if every voice interaction with system was ? The research shows that this method improves word error rate (WER) under high audio-visual feature downsampling. It remains stable at lower downsampling rates, according to the paper.

The Surprising Finding

The most surprising finding centers on the pervasive nature of these attention sinks. While previously noted in general NLP, the research shows they are present across all aspects of speech recognition: auditory (ASR), visual (VSR), and audio-visual (AVSR). The team revealed that these sinks aren’t just an initial problem. They appear at “intermediate low-semantic tokens across ASR, VSR, and AVSR.” This challenges the assumption that these issues are confined to the very beginning of a processing sequence. The unexpected discovery is that these intermediate tokens mimic the BOS token. They exhibit “high cosine similarity to the BOS token,” as the study finds. This amplifies the negative effects. It highlights a deeper, more systemic issue within LLMs than previously understood.

What Happens Next

This work paves the way for more and accurate speech recognition systems. We can expect to see these mitigation techniques integrated into commercial LLMs over the next 12-18 months. Companies developing voice assistants, transcription services, and accessibility tools will likely adopt these methods. For example, future versions of virtual meeting platforms could offer much more accurate live captioning. This would be a huge benefit for all users.

For content creators and podcasters, this means more reliable automatic transcription. It reduces the need for extensive manual editing. Developers should consider implementing similar decorrelation losses in their custom LLM fine-tuning processes. This would ensure their models are not plagued by these hidden attention issues. The technical report explains that the code is available, suggesting a quick path to adoption. This will lead to a new generation of LLMs that are not just intelligent but also incredibly precise in understanding human speech.

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice