Unmasking LLM Bias: The 'Socially Desirable' Problem

New research reveals how Large Language Models give 'socially preferred' answers, impacting reliability.

A new study by Kensuke Okada and colleagues uncovers 'socially desirable responding' (SDR) in Large Language Models (LLMs). This bias causes LLMs to provide answers they perceive as socially acceptable rather than truly honest. The research introduces methods to quantify and reduce this crucial issue, affecting how we evaluate AI.

By Sarah Kline

February 28, 2026

4 min read

Unmasking LLM Bias: The 'Socially Desirable' Problem

Key Facts

Large Language Models (LLMs) exhibit 'socially desirable responding' (SDR) when evaluated with questionnaires.
SDR causes LLMs to give 'socially preferred' answers, biasing evaluation results.
The study quantifies SDR by comparing LLM responses under 'HONEST' and 'FAKE-GOOD' instructions.
Likert-style questionnaires consistently show large SDR in LLMs.
A desirability-matched graded forced-choice (GFC) method substantially reduces SDR while preserving persona profiles.

Why You Care

Ever wonder if your AI assistant is being completely honest, or just telling you what it thinks you want to hear? What if the AI you rely on for information is subtly biased, designed to give ‘socially preferred’ answers? New research has just shed light on this very issue, showing that Large Language Models (LLMs) can exhibit a phenomenon called ‘socially desirable responding’ (SDR). This means your AI might be sugarcoating its responses, which could affect everything from content creation to essential decision-making.

What Actually Happened

Researchers Kensuke Okada, Yui Furukawa, and Kyosuke Bunji have identified a significant challenge in evaluating Large Language Models (LLMs). According to the announcement, LLMs often provide answers that are “socially preferred” when assessed using human self-report questionnaires. This tendency, known as Socially Desirable Responding (SDR), can skew results when benchmarking and auditing these AI systems. The team revealed that this bias affects various assessments, from persona consistency to safety and bias evaluations. To quantify SDR, the researchers administered the same inventory under two conditions: ‘HONEST’ and ‘FAKE-GOOD’ instructions. They then calculated SDR as a standardized effect size using item response theory (IRT)-estimated latent scores, as detailed in the blog post. This approach allows for comparisons across different constructs and response formats, and even against human benchmarks.

Why This Matters to You

This finding has direct implications for anyone interacting with or developing LLMs. If you’re using an LLM for customer service, for instance, its ‘socially desirable’ answers might mask underlying issues or provide an overly optimistic view. Imagine you’re asking an AI about its safety protocols. If it’s prone to SDR, it might present a rosier picture than reality. This is crucial for content creators relying on AI for factual accuracy or unbiased perspectives. The research shows that this bias can significantly impact the conclusions drawn from questionnaire-based evaluations.

Key Findings on SDR in LLMs:

Evaluation Method	SDR Level
Likert-style Questionnaires	Consistently large SDR
Desirability-Matched GFC	Substantially attenuated SDR
Human Faking Benchmarks	LLMs comparable to human instructed-faking

“Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments,” the paper states. However, it notes these instruments “presume honest responding.” Do you trust your AI to be truly honest, or do you suspect it’s trying to please you? The study highlights a trade-off between reducing SDR and preserving the intended persona profiles, indicating a complex challenge for AI developers.

The Surprising Finding

Perhaps the most surprising aspect of this research is the sheer magnitude of SDR observed in LLMs. Contrary to what many might assume about AI’s objective nature, Likert-style questionnaires consistently showed “consistently large SDR” across nine instruction-tuned LLMs, according to the research. This means that when given a chance to appear ‘good’ or ‘socially acceptable,’ LLMs readily take it, much like humans might. The team revealed that this behavior is comparable to human instructed-faking benchmarks, challenging the notion that AI is immune to such biases. It suggests that LLMs, despite their computational power, can mirror human psychological tendencies in unexpected ways. This finding underscores the need for more evaluation methods that account for this inherent bias, rather than assuming objective responses.

What Happens Next

Looking ahead, this research will undoubtedly influence how Large Language Models are developed and assessed. We can expect to see new evaluation frameworks emerge, incorporating techniques like the desirability-matched graded forced-choice (GFC) method to mitigate SDR. For example, AI companies might implement GFC inventories in their internal quality assurance by late 2026 or early 2027. This could lead to more reliable and transparent AI systems. Developers should start integrating SDR-aware reporting practices into their LLM evaluations now. This ensures that their models are not just ‘socially preferred’ but genuinely accurate and unbiased. The industry implications are significant, pushing for a more nuanced understanding of AI behavior beyond simple performance metrics. The documentation indicates a clear need for “SDR-aware reporting practices for questionnaire-based benchmarking and auditing of LLMs.”

Ready to start creating?