LLM Judges Face 'Curse of Knowledge' Bias in Complex Tasks

New research reveals how advanced evaluation contexts can both help and hurt AI models acting as judges.

A new study introduces 'ComplexEval,' a benchmark revealing that Large Language Models (LLMs) acting as judges exhibit significant biases when evaluating complex tasks. These biases increase with task complexity, even affecting advanced Large Reasoning Models (LRMs). The findings highlight the need for more robust AI evaluation methods.

By Katie Rowan

September 14, 2025

4 min read

LLM Judges Face 'Curse of Knowledge' Bias in Complex Tasks

Key Facts

Researchers created 'ComplexEval,' a new benchmark to quantify biases in LLM judges.
The study identified and validated 6 previously unexplored biases across 15 scenarios.
All evaluated LLM models showed significant susceptibility to these biases.
Bias magnitude increased with the complexity of the evaluation task.
Large Reasoning Models (LRMs) exhibited paradoxical vulnerability to these biases.

Why You Care

Ever wonder if the AI judging your content is truly fair? Imagine relying on an AI to evaluate complex creative work or intricate code. A new paper, titled “Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges,” reveals a surprising truth about how Large Language Models (LLMs) perform as evaluators. This research could fundamentally change how we trust AI in essential assessment roles, impacting your projects and workflows.

What Actually Happened

Researchers have uncovered significant biases in Large Language Models (LLMs) when they act as judges, especially in complex evaluation scenarios. According to the announcement, prior work mainly focused on simple evaluation settings. The reliability of LLMs in more intricate tasks remained largely unstudied. These complex tasks involve multi-faceted rubrics, unstructured reference answers, and nuanced criteria, as mentioned in the release. To address this, the team constructed ComplexEval, a new challenge benchmark. This benchmark systematically exposes and quantifies biases caused by auxiliary information—extra details provided during evaluation. The study systematically investigated six previously unexplored biases across fifteen different scenarios, the paper states. This work provides crucial insights for improving the accuracy of evaluation signals.

Why This Matters to You

This research directly impacts anyone using or developing AI for evaluation. If your AI judge is biased, your outcomes will be too. The study finds that all evaluated models are significantly susceptible to these biases. What’s more, the bias magnitude scales directly with task complexity, as detailed in the blog post. This means the more challenging the evaluation, the more likely the AI is to be swayed by irrelevant information. Think of it as a human judge being influenced by a lawyer’s charisma rather than the facts. How can you ensure your AI evaluations are truly objective?

Here’s a breakdown of the study’s key findings regarding LLM judges:

Finding Category	Implication for You
Significant Susceptibility	All models showed bias, affecting reliability.
Bias Scales with Complexity	Harder tasks mean more pronounced AI evaluation errors.
LRMs Are Vulnerable	Even reasoning models aren’t immune to these biases.

For example, imagine you’re using an LLM to grade student essays. If the prompt includes extra, potentially misleading context, the AI might unconsciously favor certain responses. This could lead to unfair grading or inaccurate feedback for your students. As the research shows, “all evaluated models exhibit significant susceptibility to these biases, with bias magnitude scaling with task complexity.” Understanding these vulnerabilities is crucial for building trustworthy AI systems.

The Surprising Finding

Here’s the twist: the research uncovered a paradoxical vulnerability in models. While you might expect more AI to be less prone to such errors, the study found the opposite. Notably, Large Reasoning Models (LRMs)—which are designed for complex problem-solving—showed unexpected susceptibility to these biases, according to the announcement. This challenges the common assumption that more capable models automatically lead to more reliable evaluations. The team revealed that these LRMs were particularly vulnerable despite their capabilities. This suggests that simply increasing model size or reasoning ability doesn’t automatically eliminate these evaluation pitfalls. It’s like giving a brilliant detective too much irrelevant information; they might still get distracted. The study highlights that even the most AI can be led astray by the ‘curse of knowledge’ when evaluating complex tasks.

What Happens Next

This research opens new avenues for developing more AI evaluation models. In the coming months, expect to see further work focusing on mitigating these identified biases. The insights gained are crucial for improving the accuracy and verifiability of evaluation signals, the paper states. For example, future applications might involve creating AI judges specifically trained to filter out auxiliary information. This could lead to more reliable automated grading systems or content moderation tools. As a reader, you should consider these findings when designing your own AI evaluation pipelines. Always question the context you provide to your LLM judges. The industry implications are significant, pushing developers to build more general and evaluation models. The team’s work is paving the way for a future where AI evaluations are not just , but also demonstrably fair and accurate.

Ready to start creating?