New AI Benchmark MMDocIR Boosts Multimodal Document Search

Researchers introduce MMDocIR, a benchmark designed to enhance how AI systems find information within complex, long documents.

A new benchmark called MMDocIR has been developed to improve multimodal document retrieval. It helps AI systems better understand and extract information from long documents, including figures, tables, and charts. This tool aims to make document search more accurate and efficient.

Mark Ellison

By Mark Ellison

November 10, 2025

4 min read

New AI Benchmark MMDocIR Boosts Multimodal Document Search

Key Facts

  • MMDocIR is a new benchmark for multimodal document retrieval.
  • It evaluates AI performance in page-level and layout-level retrieval tasks.
  • The dataset includes 1,685 expert-annotated questions and 173,843 bootstrapped questions.
  • Visual retrievers significantly outperform text-only retrievers.
  • Text retrievers using VLM-text are superior to those using OCR-text.

Why You Care

Ever struggled to find that one specific chart or table buried deep within a lengthy PDF report? Imagine an AI that could pinpoint exactly what you need, not just based on text, but on visuals too. How much time would that save you every day?

Researchers have unveiled MMDocIR, a new benchmark for multimodal document retrieval. This creation promises to significantly improve how AI systems understand and extract information from long, complex documents. It directly impacts your ability to quickly find crucial data in reports, manuals, and academic papers.

What Actually Happened

Researchers have introduced MMDocIR, a new benchmark for evaluating multimodal document retrieval, according to the announcement. This benchmark addresses a significant gap in current AI capabilities. Multimodal document retrieval aims to identify and retrieve various forms of content. This includes figures, tables, charts, and layout information from extensive documents, as detailed in the blog post.

MMDocIR encompasses two distinct tasks. The first is page-level retrieval. This evaluates how well a system identifies relevant pages within a long document. The second is layout-level retrieval. This assesses the ability to detect specific layouts. Layouts refer to elements like textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark includes a rich dataset. It features 1,685 questions annotated by experts. It also has 173,843 questions with bootstrapped labels, the study finds. This makes it a valuable resource for both training and evaluating AI models.

Why This Matters to You

This new benchmark has practical implications for anyone working with digital documents. Think of the frustration of sifting through dozens of pages to find a single data point. MMDocIR aims to make that a thing of the past. It helps AI systems understand the visual context of information, not just the words.

For example, imagine you are a financial analyst. You need to quickly find all the revenue projection charts across 50 different company reports. With improved multimodal retrieval, an AI could instantly pull up those specific charts for you. This saves hours of manual searching. What if your AI could not only find the right page but also highlight the exact table you need?

MMDocIR’s focus on both page-level and layout-level retrieval means more precise results for you. “Multimodal document retrieval aims to identify and retrieve various forms of multimodal content, such as figures, tables, charts, and layout information from extensive documents,” the paper states. This means your search queries will yield more accurate and targeted information.

MMDocIR’s Two Key Retrieval Tasks

Task TypeDescriptionBenefit for You
Page-levelIdentifies the most relevant pages within a long document.Quickly narrows down your search to specific sections.
Layout-levelDetects specific elements like tables, figures, or equations on a page.Pinpoints exact data points, saving time and effort.

The Surprising Finding

Interestingly, the research revealed some unexpected insights into how different AI retrievers perform. The study demonstrates that visual retrievers significantly outperform their text counterparts. This challenges the common assumption that text-based search is always superior for document analysis. What’s more, text retrievers leveraging VLM-text (Vision-Language Model text) significantly outperform retrievers relying on OCR-text (Optical Character Recognition text), according to the announcement. This suggests that AI models which combine visual understanding with language processing are far more effective. It highlights the power of integrating visual context. This context helps AI better interpret the text it reads.

What Happens Next

This benchmark, accepted to EMNLP-2025, indicates a future where document search is far more intuitive. We can expect to see AI tools integrating MMDocIR’s findings within the next 12-18 months. For example, future versions of document management software could offer visual search capabilities. This would allow you to search for specific chart types or data layouts. Actionable advice for readers includes keeping an eye on AI-powered research tools. These tools will likely adopt these multimodal retrieval techniques. The industry implications are vast. We could see a new standard for information retrieval, moving beyond simple keyword searches. This will ultimately make your digital document interactions much more efficient.

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice