VLMs Struggle with Human-Like Common Ground, New Study Finds

Research reveals AI models fall short in interactive communication despite task success.

A new study introduces a four-metric suite to evaluate how Vision Language Models (VLMs) build common ground. It found that current VLMs diverge significantly from human communication patterns, even when achieving task success. This research highlights limitations in AI's interactive reasoning abilities.

Mark Ellison

By Mark Ellison

September 18, 2025

4 min read

VLMs Struggle with Human-Like Common Ground, New Study Finds

Key Facts

  • A new four-metric suite evaluates VLM performance in interactive grounding contexts.
  • The study involved 150 self-play sessions with three proprietary VLMs.
  • All three tested VLMs diverged from human communication patterns on at least three metrics.
  • GPT4o-mini was the closest VLM to human performance, but still showed differences.
  • Task success scores do not reliably indicate successful grounding in VLMs.

Why You Care

Ever wonder if an AI truly understands you, or just gives the right answer? What if an AI completes a task but doesn’t really ‘get’ the conversation? This new research challenges our assumptions about AI understanding. It reveals that even Vision Language Models (VLMs) struggle with something fundamental to human interaction: building common ground. This matters because it impacts how effectively you can communicate with AI in the future.

What Actually Happened

A recent paper, “Measuring How (Not Just Whether) VLMs Build Common Ground,” introduces a novel way to assess AI communication. According to the announcement, researchers developed a four-metric collection. This collection evaluates VLM performance in interactive grounding contexts. Grounding is the process where people develop shared understanding through ongoing communication. The study deployed this collection on 150 self-play sessions. These sessions involved interactive referential games. Three proprietary VLMs were , and their performance was compared to human dyads (pairs). The technical report explains that current benchmarks often evaluate VLMs in single-turn settings. However, real communication is interactive. This new approach offers a more comprehensive view of AI’s interactive capabilities.

Why This Matters to You

This research has direct implications for your interactions with AI. Imagine trying to explain a complex idea to an AI assistant. If it lacks true common ground, your conversation might feel disjointed. The study found that all three models diverged from human patterns. This happened on at least three of the four metrics. GPT4o-mini was identified as the closest overall to human performance. However, it still showed significant differences. This indicates a gap in how current VLMs process and adapt information. Do you ever feel like you’re talking at an AI, rather than with it? This study helps explain why.

For example, think about collaboratively designing a room with an AI. You might say, “Let’s put the ‘cozy’ chair by the window.” A VLM might identify a chair and a window. But it might not grasp the nuanced concept of ‘cozy’ in your shared context. This is where common ground becomes crucial. As detailed in the blog post, the study provides a structure for future VLM research.

Key Metrics for VLM Grounding:

  • Grounding Efficiency: How quickly a shared understanding is achieved.
  • Content Alignment: How well the communicated information matches.
  • Lexical Adaptation: How the VLM adjusts its language to its partner.
  • Human-Likeness: How similar its interactive patterns are to humans.

The Surprising Finding

Here’s the twist: simply succeeding at a task doesn’t mean an AI truly understands. The research shows that task success scores do not indicate successful grounding. What’s more, high image-utterance alignment does not necessarily predict task success. This challenges a common assumption in AI creation. We often believe that if an AI completes a task, it must have understood the underlying concepts. However, this study reveals a deeper issue. An AI might correctly identify objects in an image and respond appropriately. Yet, it might not be building a shared communicative foundation with its user. This means your VLM could be ‘getting it right’ without truly ‘getting it.’ The team revealed that this disconnect is a significant hurdle for more natural AI interaction.

What Happens Next

This research provides a clear roadmap for improving Vision Language Models. Over the next 12-18 months, we can expect developers to focus on these new metrics. They will aim to enhance VLM interactive capabilities. For example, future AI assistants might undergo training specifically designed to improve lexical adaptation. This would make their responses feel more natural and responsive to your unique communication style. The industry implications are significant. We could see a shift in how VLMs are benchmarked and developed. The paper states that their metric collection offers a structure for future research. This will likely lead to more human-centric AI designs. To make the most of future VLMs, you should pay attention to how they adapt to your language. Look for models that demonstrate better interactive grounding, not just task completion. This will be key for more intuitive AI experiences.

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice