New AI Framework Tackles Video Hallucinations in MLLMs

Researchers introduce SANTA to enhance accuracy in multimodal large language models describing dynamic video content.

A new framework called SANTA aims to reduce 'hallucinations' in multimodal large language models (MLLMs) when they describe videos. Developed by Kai-Po Chang and a team, SANTA improves accuracy by focusing on visual facts and contrasting potential inaccuracies. This development could make AI-generated video descriptions much more reliable.

By Sarah Kline

December 16, 2025

4 min read

New AI Framework Tackles Video Hallucinations in MLLMs

Key Facts

SANTA (Self-Augmented Contrastive Alignment) is a new framework for multimodal LLMs (MLLMs).
It aims to mitigate object and action hallucinations in AI-generated video descriptions.
SANTA uses a hallucinative self-augmentation scheme to create 'contrasted negatives' for learning.
The framework employs tracklet-phrase contrastive alignment to match visual and temporal elements.
SANTA significantly outperforms existing methods on hallucination examination benchmarks.

Why You Care

Ever watched an AI describe a video, only for it to get basic facts wrong? It’s frustrating, right? This common issue, known as ‘hallucination,’ plagues even AI models. What if your AI assistant could accurately describe complex actions in a video without making things up?

New research introduces a structure that directly addresses this problem. This creation means more reliable AI descriptions for your video content. It promises to make multimodal LLMs (large language models that process various data types, like text and video) far more trustworthy.

What Actually Happened

Researchers have developed a new structure designed to combat factual inaccuracies in multimodal LLMs (MLLMs). These MLLMs are tools that generate descriptive captions for input videos, according to the announcement. However, they often ‘hallucinate,’ meaning they produce descriptions that are factually incorrect. This can lead to significant errors in understanding video content.

The new structure, named Self-Augmented Contrastive Alignment (SANTA), aims to mitigate both object and action hallucinations. Object hallucinations involve misidentifying items in a video. Action hallucinations mean misinterpreting what is happening. The team revealed that while previous efforts focused on static images, SANTA specifically tackles dynamic videos. It ensures faithfulness to visual facts by avoiding spurious correlations, as detailed in the blog post.

Why This Matters to You

Imagine you’re a content creator relying on AI to generate summaries for your video uploads. You need those summaries to be accurate. SANTA directly improves this accuracy, reducing the need for manual corrections. This structure helps MLLMs emphasize real visual facts.

SANTA employs a hallucinative self-augmentation scheme, according to the paper. This scheme identifies potential hallucinations within the MLLM itself. It then transforms original captions into ‘contrasted negatives,’ helping the model learn what not to say. What’s more, the team developed a tracklet-phrase contrastive alignment. This matches regional objects and relation-guided actions with their corresponding visual and temporal phrases. This process ensures that the AI’s descriptions align precisely with the video’s content.

Key Benefits of SANTA for MLLMs

Feature	Description
Object Faithfulness	Accurately identifies objects in video.
Action Faithfulness	Correctly describes actions and events.
Reduced Hallucinations	Significantly lowers factual inaccuracies in captions.
Dynamic Video Focus	Specifically designed for complex, moving imagery.

One of the researchers stated, “We propose a Self-Augmented Contrastive Alignment (SANTA) structure for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts.” This means your AI will be less likely to invent details. How much time could you save if your AI descriptions were consistently accurate?

The Surprising Finding

What’s particularly interesting is how SANTA tackles the problem. Rather than just trying to prevent hallucinations, it actively creates ‘negative’ examples. This is a bit like teaching someone what not to do. The structure uses a ‘hallucinative self-augmentation scheme’ to identify potential errors. It then transforms original, potentially inaccurate captions into contrasted negatives. This process helps the AI learn by showing it incorrect possibilities. The study finds that this method significantly outperforms existing approaches. It yields superior performance on hallucination examination benchmarks. This counterintuitive approach of generating bad examples to teach good behavior is a clever twist.

What Happens Next

This research, presented at the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2026, suggests a promising future. We can expect to see SANTA integrated into commercial MLLMs within the next 12-18 months. This could lead to noticeable improvements in AI video analysis tools. For example, imagine a security system that uses MLLMs to detect unusual activity. With SANTA, it could provide much more reliable alerts. This means fewer false alarms and more accurate incident reporting.

For content creators and AI developers, the actionable advice is to monitor the adoption of contrastive alignment techniques. These methods will likely become standard in future MLLM creation. The industry implications are vast, promising more dependable AI assistants and automated content analysis. The team revealed that this work is a significant step towards more trustworthy AI systems. Your future interactions with AI-generated video content will likely be much more accurate and less prone to errors.

Ready to start creating?