Are LLMs Cheating on Multiple-Choice Tests?

New research suggests advanced AI models exploit test formats, questioning true reasoning abilities.

A recent paper reveals that large language models (LLMs) might be 'test exploiters' in multiple-choice questions. This means their high scores don't always reflect genuine reasoning. The study evaluated 27 LLMs across 15 benchmarks.

By Mark Ellison

October 5, 2025

4 min read

Are LLMs Cheating on Multiple-Choice Tests?

Key Facts

The study evaluated 15 different question-answering benchmarks.
27 different Large Language Models (LLMs) were tested, including Qwen-2.5 7B, Llama-3.3 70B, and OpenAI's o3.
Researchers explored 5 ways of presenting questions to models.
LLMs performing reasoning *after* seeing options often exploit test structure.
Chain-of-thought reasoning *before* options better reflects genuine reasoning capabilities.

Why You Care

Ever wonder if those impressive AI scores are truly earned? What if the AI models you rely on are actually ‘cheating’ on their tests? New research suggests that large language models (LLMs) might be exploiting the format of multiple-choice questions (MCQA) rather than genuinely reasoning. This impacts how we understand AI capabilities and how you should interpret their performance.

What Actually Happened

Researchers Narun Raman, Taylor Lundy, and Kevin Leyton-Brown published a paper titled “Reasoning Models are Test Exploiters: Rethinking Multiple-Choice.” According to the announcement, this study investigates how reasoning models perform on multiple-choice question-answering (MCQA) benchmarks. MCQA is a common evaluation method for LLMs, allowing for easy automatic grading, as detailed in the blog post. The team systematically evaluated 15 different question-answering benchmarks like MMLU and GSM8K. They also 27 different LLMs, ranging from smaller models like Qwen-2.5 7B to large, models such as OpenAI’s o3. The research explored five different ways of presenting questions to the models. This included variations on whether explicit choices were provided and if ‘none of the above’ options were used. They also examined the timing of chain-of-thought (CoT) reasoning—a process where the model explains its steps—before or after choices were presented.

Why This Matters to You

This research has significant implications for anyone using or developing AI. If LLMs are exploiting test formats, their reported performance might not reflect their true reasoning capabilities. This means you might be overestimating what an AI can actually do. The study found that MCQA remained a good indicator of downstream performance only under specific conditions. “MCQA remained a good proxy for the downstream performance of models as long as they were allowed to perform chain-of-thought reasoning only before being presented with the options among which they had to select,” the paper states. This highlights a crucial distinction in how models process information.

Imagine you’re using an AI to help with complex decision-making. If its training scores were inflated by test exploitation, its real-world performance could disappoint you. What does this mean for the reliability of AI systems in essential applications?

Consider these key findings:

Exploitation: Large models often exploit information in options when reasoning after seeing choices.
Genuine Reasoning: Chain-of-thought reasoning before seeing options better reflects true capabilities.
Evaluation Impact: Current MCQA methods may not accurately assess LLM reasoning.

For example, if an AI is asked to solve a math problem and is given the answer choices immediately, it might work backward from the options. This is different from solving the problem from scratch and then selecting the correct answer. This distinction is vital for understanding the true intelligence of these large language models.

The Surprising Finding

Here’s the twist: The study revealed that large models tend to significantly outperform their free-text performance when they can perform reasoning after being given a set of options. This is due to exploiting the information available in those options, as the team revealed. This finding challenges the common assumption that higher scores on multiple-choice tests always indicate superior reasoning. It suggests that models aren’t necessarily ‘smarter’ but are better at using contextual clues from the test format itself. Think of it as a student who can guess the answer by eliminating unlikely options, even if they don’t fully understand the underlying concept. This exploitation of test structure means that some impressive benchmark results might not truly reflect an LLM’s genuine reasoning capabilities in real-world scenarios where choices aren’t pre-provided.

What Happens Next

This research provides practical guidelines for analyzing MCQA results. The authors identify and quantify the signals models use during MCQA, according to the announcement. We can expect to see new evaluation methods emerge, perhaps within the next 6-12 months, that better reflect LLMs’ genuine reasoning capabilities. For example, future benchmarks might prioritize open-ended questions or structured reasoning prompts that force models to generate answers without explicit choices. Developers will likely adjust their testing protocols to ensure more assessments. “We identify and quantify the signals models are using when answering MCQA questions, and offer practical guidelines when analyzing results from MCQA that better reflect LLMs’ genuine reasoning capabilities,” the paper states. This implies a shift towards more rigorous and nuanced testing. For you, this means a future where AI performance metrics are more trustworthy, leading to more reliable AI tools and applications across various industries.

Ready to start creating?