New Benchmark Reveals AI Research Agents Struggle with Depth

ResearchRubrics highlights current limitations in deep research capabilities of leading AI models.

A new benchmark called ResearchRubrics, developed with over 2,800 hours of human effort, reveals that even advanced AI research agents like those from Google's Gemini and OpenAI struggle with complex, open-ended queries. The benchmark evaluates factual grounding, reasoning, and clarity, showing significant room for improvement in AI's ability to conduct deep research.

By Katie Rowan

November 16, 2025

4 min read

New Benchmark Reveals AI Research Agents Struggle with Depth

Key Facts

ResearchRubrics is a new benchmark for evaluating Deep Research (DR) agents.
It includes over 2,800 hours of human labor and 2,500 expert-written rubrics.
Leading AI agents like Gemini's DR and OpenAI's DR achieved under 68% compliance with the rubrics.
AI agents struggled primarily with missed implicit context and inadequate reasoning.
The benchmark introduces a complexity framework with conceptual breadth, logical nesting, and exploration axes.

Why You Care

Have you ever wondered if an AI could truly conduct complex research like a human expert? A new benchmark suggests the answer is, not quite yet. Researchers have introduced “ResearchRubrics,” a comprehensive tool designed to evaluate how well AI models handle deep research tasks. This matters to you because it sheds light on the actual capabilities of AI assistants we increasingly rely on, from crafting reports to synthesizing information.

What Actually Happened

A team of researchers, including Manasi Sharma and 15 other authors, has unveiled ResearchRubrics, as detailed in the paper. This new benchmark aims to standardize the evaluation of Deep Research (DR) agents. DR agents are AI applications that use large language models (LLMs) to answer open-ended questions, according to the announcement. These agents need to perform multi-step reasoning, combine information from various documents, and create long, evidence-based answers. The challenge in evaluating DR has been the lengthy and diverse nature of AI responses. Many valid solutions can exist, and information sources often change. ResearchRubrics addresses this by providing realistic, domain-diverse prompts. It also includes over 2,500 expert-written, fine-grained rubrics to assess performance.

Why This Matters to You

This new benchmark isn’t just for academics; it has real implications for anyone using or developing AI. Imagine you’re a content creator relying on AI for factual research. This benchmark helps ensure the AI tools you use are truly reliable. The research shows that evaluating DR agents effectively requires specific criteria. ResearchRubrics assesses factual grounding, reasoning soundness, and clarity, according to the paper. This means your AI assistant should not only find facts but also explain them logically.

What’s more, the team developed a new complexity structure. This structure categorizes DR tasks along three axes:

Conceptual Breadth: How many different topics does the AI need to cover?
Logical Nesting: How many layers of reasoning are required?
Exploration: How much new information does the AI need to discover?

This structure helps us understand the nuances of AI research capabilities. “Evaluating DR remains challenging because responses are lengthy and diverse, admit many valid solutions, and often depend on dynamic information sources,” the authors state. This highlights why a structured evaluation is so crucial. Do you ever find yourself double-checking AI-generated content for accuracy? This benchmark aims to reduce that need by pushing AI to higher standards.

The Surprising Finding

Here’s the twist: even the most AI research agents are not as proficient as you might think. The study finds that leading DR systems, such as Gemini’s DR and OpenAI’s DR, achieved under 68% average compliance with the ResearchRubrics. This percentage is surprisingly low for systems considered . The primary reasons for this underperformance, the team revealed, were missed implicit context and inadequate reasoning about retrieved information. This challenges the common assumption that LLMs inherently understand complex queries deeply. It shows that while they can retrieve information, synthesizing it with sound reasoning remains a significant hurdle.

What Happens Next

Looking ahead, the release of ResearchRubrics (including all prompts, rubrics, and evaluation code) aims to accelerate progress in AI research assistants. We can expect developers to use this benchmark to refine their models over the next 12-18 months. For example, future AI research agents might be specifically trained to improve their contextual understanding. This could lead to AI tools that provide more accurate and well-reasoned reports for your projects. The industry implications are significant, pushing AI developers to focus on deeper reasoning rather than just information retrieval. The documentation indicates that this will facilitate progress toward well-justified research assistants. Your feedback on AI tool performance will also become more valuable as these benchmarks become standard.

Ready to start creating?