AI's Hidden Biases: How LLMs Cheat on Multiple-Choice Tests

New research reveals large language models exploit subtle cues in evaluations, not just true understanding.

A recent paper titled 'ABCD: All Biases Come Disguised' uncovers how large language models (LLMs) exhibit 'label-position-few-shot-prompt bias' in multiple-choice questions. This means LLMs often rely on answer position or prompt patterns instead of actual knowledge. Researchers propose a new evaluation method to reduce these biases, leading to more robust and accurate assessments of AI capabilities.

By Mark Ellison

February 26, 2026

4 min read

AI's Hidden Biases: How LLMs Cheat on Multiple-Choice Tests

Key Facts

LLMs exhibit 'label-position-few-shot-prompt bias' in multiple-choice questions.
This bias involves LLMs using answer position, labels, or prompt distributions.
A new bias-reduced evaluation protocol replaces labels with uniform, unordered ones.
The new protocol improves robustness and lowers standard deviation between answer permutations.
The method reduces mean accuracy variance by three times across benchmarks and models.

Why You Care

Ever wonder if your AI assistant truly understands your questions, or if it’s just really good at guessing? What if the AI models we rely on are acing tests not by intelligence, but by spotting hidden patterns in the questions themselves? This new research from Mateusz Nowak and his team reveals a surprising truth about how large language models (LLMs) are evaluated. Understanding these biases can help you better interpret AI performance and build more reliable AI applications.

What Actually Happened

Researchers Mateusz Nowak, Xavier Cadet, and Peter Chin have published a paper titled “ABCD: All Biases Come Disguised.” This paper explores how large language models (LLMs) are evaluated using multiple-choice question (MCQ) benchmarks, according to the announcement. The team found that LLMs often display a specific type of bias. This is called ‘label-position-few-shot-prompt bias,’ as detailed in the blog post. It means models might use the answer’s position, the label in front of the answer, or even patterns from previous examples in the prompt. They combine these cues to answer MCQs, rather than relying solely on their learned knowledge, the paper states. To combat this, the researchers propose a bias-reduced evaluation protocol. This protocol replaces standard labels with uniform, unordered ones. It also prompts the LLM to consider the entire answer presented, the team revealed.

Why This Matters to You

This finding has practical implications for anyone developing or using AI. If an LLM performs well on a benchmark, you might assume it truly understands the subject. However, this research suggests it might be exploiting evaluation artifacts. Imagine you’re using an LLM for essential decision-making. You need to know its answers are based on genuine understanding, not superficial cues. The proposed evaluation method, according to the research, improves robustness and reduces variability in results.

Here’s how the new protocol changes things:

Standard Evaluation: LLMs might pick answers based on their position (e.g., always choosing ‘B’).
Biased LLM Behavior: They could also rely on the specific label (e.g., ‘A’, ‘B’, ‘C’, ‘D’) or patterns in example questions.
New Protocol: Replaces labels with neutral, unordered ones. It forces the LLM to process the full answer content.
Improved Robustness: This leads to more consistent results when answer options are rearranged.
Lower Variance: The new method significantly reduces the accuracy variance across different answer permutations.

“We demonstrate improved robustness and lower standard deviation between different permutations of answers with a minimal drop in LLM’s performance,” the authors state. This reveals the LLM’s true capabilities under reduced evaluation artifacts. How much can you trust an AI’s performance if it’s unintentionally ‘cheating’ on its tests?

The Surprising Finding

Here’s the twist: The researchers found that even with a minimal drop in performance, the bias-reduced protocol significantly improved evaluation robustness. The study finds that this new method reduces mean accuracy variance by three times. This is surprising because one might expect a significant performance drop when LLMs can no longer rely on superficial cues. Instead, the models still perform well, but their answers become much more consistent when options are shuffled. This challenges the assumption that high scores on standard benchmarks always reflect deep understanding. It suggests that some of that perceived performance might be an artifact of the test design itself. The team revealed that this happens “without any help from the prompt examples or the option labels.”

What Happens Next

This research points to a clear path forward for AI evaluation. Expect to see new benchmark designs incorporating these bias-reduction techniques within the next 6-12 months. AI developers will likely start integrating these evaluation methods into their testing pipelines. For example, a company developing a medical diagnostic AI might use this protocol to ensure the AI’s recommendations are based on patient data, not the order of symptoms listed. Your role as an AI user or developer will involve scrutinizing evaluation metrics more closely. Look for assurances that LLMs are under conditions that minimize these hidden biases. The industry implications are significant, pushing for more transparent and trustworthy AI systems. This will lead to more reliable AI applications across various fields. The team’s work provides a concrete step towards achieving that goal, according to the paper.

Ready to start creating?