Multimodal LLMs Struggle with Cross-Modal Skill Composition

New research reveals a significant gap in how AI combines skills across different data types.

A recent paper highlights that Multimodal Large Language Models (MLLMs) do not effectively combine skills learned from different modalities like text and images. This 'skill composition gap' persists even with advanced prompting and fine-tuning techniques, suggesting a fundamental challenge for advanced AI development.

Sarah Kline

By Sarah Kline

November 22, 2025

5 min read

Multimodal LLMs Struggle with Cross-Modal Skill Composition

Key Facts

  • Multimodal Large Language Models (MLLMs) struggle to optimally combine skills across different data modalities.
  • Researchers designed three evaluation tasks requiring sequential composition of two modality-dependent skills.
  • All evaluated MLLMs showed a significant 'cross-modality skill composition gap'.
  • Chain-of-thought prompting and specific fine-tuning improved performance but did not eliminate the gap.
  • The research indicates a need for more advanced techniques to enhance cross-modal skill composition in MLLMs.

Why You Care

Ever wonder why your super-smart AI assistant can describe an image perfectly but struggles to understand a complex visual instruction? It’s not just you. New research suggests that even the most Multimodal Large Language Models (MLLMs) face significant hurdles when asked to combine skills from different data types, like vision and language. This is a big deal because it impacts how well AI can truly understand and interact with our multi-sensory world. Why should you care? Because this limitation affects everything from smarter virtual assistants to more intuitive creative AI tools. Your experience with AI could be much smoother if this challenge is overcome.

What Actually Happened

A team of researchers, including Paula Ontalvilla, Aitor Ormazabal, and Gorka Azkune, recently published a paper titled “Multimodal LLMs Do Not Compose Skills Optimally Across Modalities.” As detailed in the blog post, their study investigated how well MLLMs can perform skill composition. This refers to the ability to combine previously learned skills to solve new, more complex tasks. The research focused specifically on combining skills across different modalities—meaning different types of data, such as images and text. The team designed three evaluation tasks. These tasks required MLLMs to sequentially compose two modality-dependent skills. They several open MLLMs using two primary settings. One setting involved directly prompting the model to solve the task. The other used a two-step cascaded inference approach. This manually enforced the composition of the two skills for a given task. The findings revealed a substantial “cross-modality skill composition gap” across all evaluated MLLMs, according to the announcement.

Why This Matters to You

This skill composition gap means that even if an MLLM is excellent at recognizing objects in an image and great at generating descriptive text, it might struggle to follow an instruction like “describe the object in the top left corner that is red.” The model has the individual skills, but combining them for a specific, multi-modal command proves difficult. Imagine you’re using an AI art generator. You might want it to “create an image of a cat wearing a hat, with the hat being a specific shade of blue from a provided color palette.” The AI might understand ‘cat’ and ‘hat’ and ‘blue,’ but linking the specific visual input of the color palette to the ‘hat’ element can be problematic. This directly impacts the precision and creativity you can expect from such tools. The study finds that even with straightforward compositions, MLLMs exhibit this significant gap. What kinds of multi-modal tasks do you wish AI could handle more seamlessly in your daily life?

Key Findings on MLLM Performance:

  • Direct Prompting: MLLMs showed a significant skill composition gap when asked to solve tasks directly.
  • Cascaded Inference: Even when skills were manually enforced in a two-step process, the gap persisted.
  • Mitigation Strategies: Chain-of-thought prompting and specific fine-tuning improved performance but did not close the gap entirely.

According to the paper, “Even with these straightforward compositions, we find that all evaluated MLLMs exhibit a significant cross-modality skill composition gap.” This suggests that the problem isn’t just about how we ask the AI, but something deeper in its architecture or training. Your ability to create complex prompts and get accurate, multi-faceted responses from AI depends on addressing this issue.

The Surprising Finding

Here’s the twist: the researchers explored ways to mitigate this skill composition gap. They tried two main alternatives. One was using chain-of-thought prompting. This explicitly instructs MLLMs for skill composition. The other was a specific fine-tuning recipe designed to promote skill composition. You might expect these strategies to largely solve the problem. However, the study finds that while these strategies do improve model performance, they still exhibit significant skill composition gaps. This is surprising because chain-of-thought prompting is often touted as a technique for improving AI reasoning. The fact that it only partially helps suggests that the issue isn’t just about better instructions. It points to a more fundamental challenge in how these models integrate and apply knowledge across different data types. It challenges the assumption that simply giving AI more context or specialized training will automatically lead to multi-modal reasoning.

What Happens Next

This research clearly indicates that more work is needed to improve cross-modal skill composition in MLLMs. Over the next 12 to 18 months, we can expect researchers to delve deeper into novel architectural designs. They will also likely explore more training methodologies. For example, future MLLMs might incorporate explicit modules for skill coordination, rather than relying on implicit learning. For content creators and developers, this means being aware of these limitations. You should design your AI applications with this in mind. For instance, if you’re building an AI tool that needs to understand both visual and textual cues, consider breaking down complex tasks into simpler, modality-specific steps. The team revealed that their findings suggest “more research is needed to improve cross-modal skill composition in MLLMs.” This is not a quick fix but an ongoing area of active creation. Expect to see new papers and models emerging that specifically target this crucial aspect of AI intelligence.

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice