AI's 'Telephone Game': How Unified Models Lose Meaning

New research reveals how AI models' understanding can drift when generating images from text and back again.

A new study introduces the Unified Consistency Framework for Unified Models (UCF-UM). It evaluates 'semantic drift' in AI models that handle both image understanding and generation. This framework uses a cyclic evaluation to show how meaning can be lost over multiple transformations.

By Mark Ellison

September 19, 2025

4 min read

AI's 'Telephone Game': How Unified Models Lose Meaning

Key Facts

The research introduces the Unified Consistency Framework for Unified Models (UCF-UM).
UCF-UM evaluates 'semantic drift' in AI models that perform both image-to-text (I2T) and text-to-image (T2I) tasks.
It uses a cyclic evaluation protocol, alternating between I2T and T2I over multiple generations.
Three new metrics are formulated: Mean Cumulative Drift (MCD), Semantic Drift Rate (SDR), and Multi-Generation GenEval (MGG).
Some models, like BAGEL, maintain semantics well, while others, like Vila-u, drift quickly despite strong single-pass scores.

Why You Care

Ever played the ‘telephone game’ where a message gets distorted with each retelling? Imagine that happening with AI. What if the AI you rely on to create images from text, or describe images, slowly loses the original meaning? A new study, “The Telephone Game: Evaluating Semantic Drift in Unified Models,” reveals this exact problem. This research introduces a essential new way to measure how accurately AI models maintain meaning. Understanding this ‘semantic drift’ is vital for anyone using or building AI. Your AI tools could be subtly changing your instructions without you even knowing it.

What Actually Happened

Researchers Sabbir Mollah, Rohit Gupta, Sirnam Swetha, Qingyang Liu, Ahnaf Munir, and Mubarak Shah have developed a new evaluation method. According to the announcement, this method assesses ‘semantic drift’ in unified models (UMs). These UMs are AI systems that can both understand images (image-to-text, or I2T) and generate images (text-to-image, or T2I). The team revealed that current evaluations often test these capabilities in isolation. They don’t check if a model can consistently cycle between understanding and generating. The new method, called the Unified Consistency structure for Unified Models (UCF-UM), uses a cyclic evaluation protocol. It alternates between I2T and T2I over multiple ‘generations.’ This process quantifies how much meaning is lost, or ‘semantic drift,’ during these transformations.

Why This Matters to You

This research directly impacts the reliability of visual language models (VLMs). Think about how you use AI. If you ask an AI to generate an image, then ask another AI to describe that image, you expect consistency. The study finds that this consistency is not always . The UCF-UM structure introduces three key metrics to measure this drift:

Mean Cumulative Drift (MCD): This metric measures the overall semantic loss using embedding-based comparisons.
Semantic Drift Rate (SDR): This metric summarizes how quickly meaning decays over successive cycles.
Multi-Generation GenEval (MGG): This score assesses object-level compliance, extending existing evaluation methods.

Imagine you’re a content creator using AI to generate variations of an image. You start with a text prompt, get an image, then feed that image’s description back into the AI. Will the new image still match your original intent? This study suggests it might not. The team revealed that some models, like BAGEL, maintain semantics well. However, others, such as Vila-u, drift quickly. This happens despite Vila-u showing strong performance in traditional single-pass evaluations. How confident are you that your AI-generated content truly reflects your original vision after a few iterations?

The Surprising Finding

The most surprising finding from this research challenges common assumptions about AI model performance. The study finds that models performing well on individual tasks (like generating an image from text, or describing an image) might still fail at maintaining meaning over cycles. As detailed in the blog post, “Existing evaluations consider these capabilities in isolation.” This means a model could get high scores for image generation and image understanding separately. However, it could still perform poorly when asked to repeatedly convert text to image and back to text. For example, a model might perfectly generate ‘a red car on a bridge’ from text. But if you then ask it to describe that image, and then generate a new image from that description, the ‘red car’ might become ‘an orange vehicle’ or even disappear entirely. This highlights that strong single-pass scores do not guarantee ‘cross-modal stability.’

What Happens Next

This new evaluation structure, UCF-UM, provides a crucial tool for developers and users. The research indicates that future AI models will need to prioritize ‘cyclic consistency.’ We can expect to see model developers focusing on improving this aspect in the coming months and quarters. For example, a unified model designed for creative workflows might integrate UCF-UM metrics into its training. This would ensure that iterative design processes maintain semantic integrity. For you, this means demanding more evaluations from your AI tools. Look for assurances that models can maintain meaning across multiple transformations. This research will push the industry towards more reliable and consistent visual language models. The team revealed that their work provides “practical metrics to consistently assess unified model’s cross-modal stability and strength of their shared representations.”

Ready to start creating?