AI Summaries Struggle with Facts in Long Documents

New research reveals limitations of current factual consistency metrics for abstractive summarization.

A recent study highlights significant issues with how AI models evaluate factual accuracy in long-document summaries. Existing metrics, designed for short texts, often fail to ensure consistency, especially with complex information. This impacts the reliability of AI-generated content for users.

By Sarah Kline

November 21, 2025

3 min read

AI Summaries Struggle with Facts in Long Documents

Key Facts

Six widely used reference-free factuality metrics were evaluated for long-document summarization.
The metrics were originally designed for short-form summarization.
The study used seven factuality-preserving perturbations like paraphrasing and simplification.
Existing short-form metrics produce inconsistent scores for semantically equivalent summaries.
Reliability declines for information-dense claims in long documents.

Why You Care

Ever rely on an AI to summarize a lengthy report or article for you? Do you trust that summary to be completely accurate? A new study reveals that current methods for checking AI summary facts are falling short, especially for longer documents. This directly impacts your ability to trust AI-generated content.

What Actually Happened

Researchers Zain Muhammad Mujahid, Dustin Wright, and Isabelle Augenstein conducted a systematic evaluation of six widely used reference-free factuality metrics, according to the announcement. These metrics were originally developed for short-form summarization. The team then applied these metrics to long documents to test their reliability. They probed metric robustness using seven factuality-preserving perturbations applied to summaries. These included paraphrasing, simplification, and synonym replacement, as detailed in the blog post. The study aimed to understand how well these metrics perform when summarizing complex, extended texts. It found significant inconsistencies in their performance.

Why This Matters to You

If you use AI tools for summarizing, this research directly affects the quality and trustworthiness of the information you receive. Imagine you’re a content creator using AI to distill research papers. If the AI’s summary is factually inconsistent, your own content could be misleading. The study highlights that current metrics struggle with long-range dependencies and input length limitations.

Key Findings on Metric Reliability:

Inconsistent Scores: Existing short-form metrics produce inconsistent scores for semantically equivalent summaries.
Declining Reliability: Reliability declines for information-dense claims.
Context Sensitivity: Metrics are sensitive to retrieval context and claim information density.
No Consistent Alignment: No metric consistently maintains factual alignment under long-context conditions.

For example, think of an AI summarizing a legal brief. A slight factual error could have serious implications. “Evaluating the factual consistency of abstractive text summarization remains a significant challenge, particularly for long documents, where conventional metrics struggle with input length limitations and long-range dependencies,” the paper states. How confident are you now in the AI summaries you use daily?

The Surprising Finding

Here’s the twist: the research shows that even when summaries are semantically equivalent—meaning they convey the same core information—existing short-form metrics still produce inconsistent scores. This is surprising because you would expect metrics designed to check facts to recognize similar content as equally factual. What’s more, the study revealed a declining reliability for information-dense claims whose content is semantically similar to many parts of the source document. This challenges the common assumption that more context always leads to better factual checks. It suggests that too much similar information can actually confuse the metrics, making them less effective at identifying factual consistency. This means summaries of complex topics are especially vulnerable.

What Happens Next

The research points to concrete directions for improving factuality evaluation in AI summarization. Future developments will likely focus on multi-span reasoning, according to the announcement. This involves checking facts across different parts of the source document. Context-aware calibration is another key area, meaning metrics will adapt better to the specific content being summarized. Training on meaning-preserving variations will also enhance robustness in long-form summarization. For example, AI tools might get better at understanding paraphrased sentences as factually equivalent. You can expect to see these improvements integrated into AI summarization tools over the next 12-24 months. For content creators, this means more reliable AI summaries are on the horizon. The team revealed they are releasing all code and data to help advance this research.

Ready to start creating?