Do Audio LLMs Truly 'Hear' Music? New Research Reveals Surprising Truth

A new study investigates how Audio Large Language Models process musical information, challenging previous assumptions.

New research by Giovana Morais and Magdalena Fuentes explores how Audio Large Language Models (Audio LLMs) understand music. They used the MM-SHAP framework to quantify the contribution of audio versus text. The findings suggest these models rely heavily on text, even when seeming to 'listen' to music.

By Sarah Kline

September 27, 2025

4 min read

Do Audio LLMs Truly 'Hear' Music? New Research Reveals Surprising Truth

Key Facts

Audio Large Language Models (Audio LLMs) are being investigated for their modality contribution in music understanding.
The MM-SHAP framework, based on Shapley values, was adapted to quantify the contribution of audio versus text.
Models with higher accuracy in the study relied more on text to answer questions.
Despite text reliance, models successfully localized key sound events, indicating some audio processing.
This study is the first application of MM-SHAP to Audio LLMs.

Why You Care

Ever wonder if your favorite music AI truly understands the song you’re playing, or is it just guessing based on lyrics and metadata? This isn’t just a fun thought experiment. It directly impacts how useful and intelligent our music AI tools become. A new study delves into this very question. It investigates how Audio Large Language Models (Audio LLMs) process music. Understanding this could redefine your interaction with AI-powered music experiences.

What Actually Happened

Giovana Morais and Magdalena Fuentes recently published research investigating modality contribution in Audio LLMs for music. The study, titled “Investigating Modality Contribution in Audio LLMs for Music,” focuses on a essential question. Do these models genuinely ‘listen’ to audio, or do they primarily use textual reasoning? According to the announcement, previous benchmarks hinted at this ambiguity. The researchers adapted the MM-SHAP structure. This structure uses Shapley values to quantify the relative contribution of each modality. They evaluated two models on the MuChoMusic benchmark. This helped them understand how each model processes information. The team revealed their findings in a recent paper.

Why This Matters to You

This research has practical implications for anyone using or developing Audio LLMs. It sheds light on how these systems arrive at their conclusions. For example, if you’re using an AI to generate music descriptions, knowing its reliance on text versus actual audio is crucial. It impacts the quality and authenticity of the output. “Audio Large Language Models enable human-like conversation about music,” the paper states. “Yet it is unclear if they are truly listening to the audio or just using textual reasoning.” This uncertainty directly affects your trust in these AI systems. How much do you trust an AI’s musical taste if it’s not truly ‘hearing’ the music?

Here are some key insights from the study:

Higher Accuracy, More Text Reliance: Models with better overall performance tended to lean more on text. This suggests that text data might be a shortcut for answering questions.
Audio’s Role in Specific Tasks: Even with low overall audio contribution, models successfully localized key sound events. This means audio isn’t entirely ignored.
MM-SHAP Application: This study marks the first application of the MM-SHAP structure to Audio LLMs. This provides a new tool for future explainable AI research.

Imagine you ask an Audio LLM to describe the mood of a song. If it’s primarily using text tags like ‘upbeat’ or ‘sad,’ your experience might be less nuanced. However, if it can pinpoint a specific guitar riff or drum pattern, that’s a different story. This study helps us understand the difference. It also informs how developers can build more and genuinely ‘listening’ AI for music.

The Surprising Finding

Here’s the twist: The study found that the Audio LLM with higher overall accuracy actually relied more on text to answer questions. This challenges the intuitive idea that a better music AI would be more attuned to the audio itself. The research shows that even if the overall audio contribution is low, models can successfully localize key sound events. This suggests a complex interplay between modalities. It’s not a simple case of one or the other. The team revealed that models with higher accuracy relied more on text to answer questions. This is surprising because you might expect a ‘smarter’ model to process more audio. It indicates that current models might be finding textual patterns more efficient for general questions. However, the models still demonstrate an ability to process specific audio cues. This implies a selective ‘listening’ capability, rather than a comprehensive understanding.

What Happens Next

This research serves as a foundational step for future work in explainable AI and audio. We can expect to see more studies utilizing the MM-SHAP structure in the coming months. For example, developers might use these insights to train Audio LLMs to prioritize audio signals for certain tasks. This could lead to more musically intelligent AI by late 2026 or early 2027. Your future music AI might offer more detailed analyses of specific instruments or vocal techniques. Think of it as an AI becoming a better music critic. The industry implications are significant. Companies developing music recommendation systems or AI-powered music creation tools will need to consider these findings. They should aim to enhance the audio processing capabilities of their models. The paper states, “We hope it will serve as a foundational step for future research in explainable AI and audio.” For you, this means potentially richer and more accurate AI interactions with music. Look for AI systems that can explain why they made a certain musical judgment, not just what the judgment is.

Ready to start creating?