New Benchmark Reveals AI Research Agents Underperform

DeepResearch Bench II exposes significant gap between AI and human expert reporting.

A new benchmark, DeepResearch Bench II, has been introduced to rigorously evaluate Deep Research Systems (DRS). The findings reveal that even top AI models satisfy less than 50% of expert-derived evaluation criteria, highlighting a substantial performance gap.

By Katie Rowan

January 28, 2026

5 min read

New Benchmark Reveals AI Research Agents Underperform

Key Facts

DeepResearch Bench II is a new benchmark for evaluating Deep Research Systems (DRS).
It contains 132 grounded research tasks across 22 domains.
Reports are evaluated using 9,430 fine-grained binary rubrics across three dimensions: information recall, analysis, and presentation.
Rubrics were developed using an LLM+human pipeline with over 400 human-hours of expert review.
Even the strongest AI models satisfy fewer than 50% of the rubrics, showing a significant gap from human experts.

Why You Care

Ever wondered if the AI tools you rely on for research are truly accurate? Can they really deliver comprehensive, expert-level reports? A new evaluation tool, DeepResearch Bench II, suggests your favorite AI might not be as smart as you think. This benchmark reveals a significant gap between AI capabilities and human expertise in detailed research. What does this mean for your daily work and the future of AI-assisted knowledge?

What Actually Happened

Researchers have introduced DeepResearch Bench II, a new benchmark designed to evaluate Deep Research Systems (DRS). These systems aim to help users search the web and synthesize information, according to the announcement. The goal is to produce comprehensive investigative reports. However, rigorously evaluating these AI systems has been a challenge. Existing benchmarks often failed to adequately test a system’s ability to analyze evidence, the paper states. They also relied on evaluation criteria that were either too broad or defined by other large language models (LLMs). This led to biased scores that were hard to verify, the team revealed. DeepResearch Bench II addresses these issues with a more evaluation method.

This new benchmark includes 132 grounded research tasks across 22 different domains. For each task, a DRS must produce a long-form research report. These reports are then evaluated using 9,430 fine-grained binary rubrics. These rubrics cover three essential dimensions: information recall, analysis, and presentation, as detailed in the blog post. All rubrics are derived from carefully selected expert-written investigative articles. They were constructed using a four-stage LLM-plus-human pipeline. This process combined automatic extraction with over 400 human-hours of expert review. This ensures the criteria are atomic, verifiable, and aligned with human expert judgment, the technical report explains.

Why This Matters to You

Imagine you’re a content creator relying on AI to research complex topics. Or perhaps you’re a podcaster needing accurate, synthesized information quickly. This new benchmark directly impacts the quality of the AI-generated content you might use. The study finds that even the strongest deep research systems satisfy fewer than 50% of the rubrics. This reveals a substantial gap between current DRSs and human experts, according to the research. This means your AI assistant might be missing crucial details or making inaccurate connections. Do you really want to base your next project on incomplete or flawed information?

Consider these key evaluation dimensions:

Information Recall: How well does the AI remember and present relevant facts?
Analysis: Can the AI critically evaluate information and draw sound conclusions?
Presentation: Is the report clear, coherent, and well-structured, like an expert would write?

For example, if you ask an AI to research the latest trends in sustainable energy, it might gather many facts. However, its ability to analyze conflicting data or present a nuanced perspective could be severely limited. “Even the strongest models satisfy fewer than 50% of the rubrics, revealing a substantial gap between current DRSs and human experts,” the authors state. This quote underscores the current limitations. This finding should make you re-evaluate your reliance on AI for deep, essential research tasks. It highlights the need for human oversight and expertise in complex information synthesis.

The Surprising Finding

Here’s the twist: despite rapid advancements in AI, the performance gap between AI research agents and human experts remains surprisingly wide. Many might assume that with LLMs, AI could easily mimic human research capabilities. However, the evaluation on DeepResearch Bench II tells a different story. The team revealed that even the most deep research systems only met less than 50% of the expert-derived evaluation criteria. This suggests that while AI can retrieve information, its ability to perform high-level analysis and present findings like a human expert is still quite limited. This finding challenges the common assumption that AI is on the verge of fully automating complex research tasks. It indicates that human essential thinking and nuanced understanding are still irreplaceable in many areas. The detailed, fine-grained rubrics, aligned with human expert judgment, are crucial for exposing this reality.

What Happens Next

This benchmark provides a clear roadmap for improving Deep Research Systems. Developers will likely focus on enhancing AI’s analytical and presentation capabilities over the next 12-18 months. Expect to see new models specifically trained to address the weaknesses identified by DeepResearch Bench II. For example, future AI research agents might incorporate more reasoning modules. They could also have improved report generation algorithms to better mimic human expert writing styles. Our actionable advice for readers is to continue using AI as a tool for initial information gathering. However, always apply your own essential analysis and verification for comprehensive reports. The industry implications are significant. This benchmark could become a standard for evaluating AI research tools. It will push developers to create more reliable and accurate systems. “All rubrics are derived from carefully selected expert-written investigative articles,” the documentation indicates. This ensures future AI creation is guided by real-world expert standards.

Ready to start creating?