LLM Evaluations Flawed by Simple Word Changes, Study Finds

New research reveals large language models struggle with minor lexical and syntactic variations.

A recent study highlights a critical vulnerability in how large language models (LLMs) are evaluated. Researchers found that small changes in wording or sentence structure can drastically alter an LLM's performance, questioning the reliability of current benchmarks.

By Sarah Kline

February 27, 2026

4 min read

LLM Evaluations Flawed by Simple Word Changes, Study Finds

Key Facts

Lexical (word choice) and syntactic (sentence structure) changes significantly impact LLM performance.
The study examined 23 contemporary LLMs across MMLU, SQuAD, and AMEGA benchmarks.
Lexical perturbations consistently caused substantial performance degradation.
Model robustness did not consistently scale with model size, showing strong task dependence.
The findings suggest LLMs rely more on surface-level patterns than abstract linguistic competence.

Why You Care

Ever wonder if your favorite AI chatbot truly understands you, or just the words you use? A new study suggests that even minor tweaks to a prompt can significantly impact a large language model’s (LLM) performance. This finding challenges the reliability of current LLM evaluation methods. Why should this concern you? Because it means the scores you see for LLMs might not reflect their true capabilities. Are we accurately measuring AI intelligence?

What Actually Happened

Researchers Bogdan Kostić, Conor Fallon, Julian Risch, and Alexander Löser investigated how LLMs react to subtle changes in language. As detailed in the blog post, they examined the “lexical and syntactic sensitivity” of 23 contemporary LLMs. This means they looked at how LLMs performed when words were swapped for synonyms (lexical changes) or sentence structures were slightly rearranged (syntactic changes). The team created meaning-preserving variations of prompts. They used two linguistic pipelines for this. One pipeline performed synonym substitution for lexical alterations. The other used dependency parsing to determine applicable syntactic transformations, according to the announcement. The study spanned three major benchmarks: MMLU, SQuAD, and AMEGA.

Why This Matters to You

This research has significant implications for anyone relying on LLMs, from developers to everyday users. The study found that “lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks.” This means simply changing a word for its synonym often made LLMs perform much worse. Syntactic changes had more varied effects, sometimes even improving results, the paper states. Both types of changes, however, “destabilize model leaderboards on complex tasks.” This suggests that the top-ranked LLM today might not be the top-ranked LLM tomorrow if the evaluation questions are rephrased. What does this mean for your AI applications?

Imagine you’re using an AI for customer service. If a customer phrases their question slightly differently than expected, your AI might fail to understand. This is a real-world scenario where lexical sensitivity could cause problems. The company reports that model robustness did not consistently scale with model size. This reveals a strong task dependence, meaning bigger models aren’t always more . This finding challenges the common assumption that larger models are inherently better.

Here’s a breakdown of the impact:

For Developers: Current benchmarks might not accurately reflect your model’s real-world robustness.
For Businesses: Relying on LLMs for essential tasks requires more rigorous testing beyond standard benchmarks.
For Users: Be aware that slight rephrasing of your prompts can yield different, sometimes worse, results.

The Surprising Finding

Perhaps the most surprising finding from this research is the disconnect between model size and robustness. You might assume that larger, more LLMs would be less susceptible to minor linguistic changes. However, the study finds that “model robustness did not consistently scale with model size.” This means a massive LLM isn’t necessarily more resilient to a simple synonym swap than a smaller one. The team revealed that robustness showed “strong task dependence.” This indicates that an LLM might be for one type of task but highly sensitive to language variations in another. This challenges the idea that increasing model parameters automatically leads to more human-like understanding. It suggests LLMs might be relying more on surface-level patterns than deep linguistic comprehension, according to the announcement.

What Happens Next

This study underscores the important need for new evaluation standards for large language models. Experts will likely push for “robustness testing as a standard component of LLM evaluation.” This means future benchmarks, perhaps within the next 6-12 months, will need to include variations in phrasing and syntax. For example, imagine a new benchmark where each question has five slightly rephrased versions. This would provide a more accurate picture of an LLM’s true understanding. Your current LLM creation strategies should consider integrating adversarial testing. This involves intentionally crafting varied prompts to test your model’s limits. The industry implications are clear: a shift towards more nuanced and challenging evaluation metrics is inevitable. This will lead to the creation of more truly and reliable large language models in the long run.

Ready to start creating?