AI's Medical Mystery: LLMs Struggle with Rare Disease Diagnosis

New research reveals significant challenges for large language models in identifying uncommon illnesses from patient narratives.

A recent study evaluated top large language models (LLMs) on rare disease diagnosis using cases from 'House M.D.' The findings indicate current LLMs face substantial hurdles, achieving only up to 38.64% accuracy, despite some improvements in newer models.

Katie Rowan

By Katie Rowan

November 29, 2025

4 min read

AI's Medical Mystery: LLMs Struggle with Rare Disease Diagnosis

Key Facts

  • Four state-of-the-art LLMs were evaluated on rare disease diagnosis.
  • A novel dataset of 176 symptom-diagnosis pairs from 'House M.D.' was used.
  • LLM accuracy ranged from 16.48% to 38.64%.
  • Newer model generations showed a 2.3 times improvement in performance.
  • The benchmark dataset is publicly accessible for further AI research.

Why You Care

Imagine you or a loved one is facing a baffling illness. Doctors are stumped, and time is running out. Could artificial intelligence, specifically large language models (LLMs), offer a lifeline? A new study suggests that while LLMs show promise, they are still far from being diagnostic wizards for rare diseases. This research directly impacts the future of AI in healthcare and your potential medical journey.

What Actually Happened

Researchers recently put four large language models to the test. These models included GPT 4o mini, GPT 5 mini, Gemini 2.5 Flash, and Gemini 2.5 Pro. Their mission was to diagnose rare diseases from narrative medical cases, according to the announcement. This area of AI performance has been largely unexplored until now, as detailed in the blog post.

The team developed a unique dataset for this evaluation. It consisted of 176 symptom-diagnosis pairs carefully extracted from the popular medical television series ‘House M.D.’ This show is for its educational value in teaching rare disease recognition in medical education. The study aimed to establish baseline performance metrics for how well LLMs handle complex medical reasoning tasks.

Why This Matters to You

This study highlights a essential gap in current AI capabilities. While LLMs excel in many areas, accurately diagnosing rare diseases from patient stories remains a significant hurdle. What if an AI could quickly identify a rare condition that human doctors might miss for years? This could dramatically shorten diagnostic odysseys for countless patients.

LLM Performance on Rare Disease Diagnosis

ModelAccuracybetterment Factor (vs. older models)
GPT 4o mini16.48%N/A
Gemini 2.5 Flash20.00%N/A
Gemini 2.5 Pro30.00%N/A
GPT 5 mini38.64%2.3x

As you can see from the table, there was significant variation in performance. The company reports that newer model generations showed a 2.3 times betterment over their predecessors. However, even the best performer, GPT 5 mini, only achieved 38.64% accuracy. This means more than 60% of the time, it couldn’t correctly identify the rare disease. “While all models face substantial challenges with rare disease diagnosis, the observed betterment across architectures suggests promising directions for future creation,” the paper states. How much more accurate do these models need to be before you would trust them with your diagnosis?

The Surprising Finding

Here’s the twist: despite the low overall accuracy, the research shows that newer large language models are significantly better. The study found that newer generations of models demonstrated a 2.3 times betterment in diagnostic accuracy. This is surprising because it indicates a rapid learning curve within the AI creation cycle. It challenges the assumption that LLMs might hit a performance ceiling quickly in specialized, complex domains like rare disease diagnosis. Instead, it suggests that architectural advancements are indeed making a tangible difference, even if the absolute accuracy is still low. This betterment signals that the current limitations are not inherent to the approach itself, but rather a reflection of the system’s current stage.

What Happens Next

This research provides a crucial benchmark for future AI creation in healthcare. The team revealed that their educationally validated benchmark is publicly accessible. This means other researchers can use it to test and improve their own large language models. We might see significant advancements within the next 12 to 18 months as developers refine their models using this structure.

For example, imagine a future where your primary care physician uses an AI assistant. This assistant could analyze your symptoms and medical history. It would then flag potential rare diseases that might otherwise be overlooked. This wouldn’t replace human doctors but would augment their capabilities, offering a second opinion. The industry implications are clear: continued investment in AI for medical diagnosis, especially for rare conditions, is essential. “Our educationally validated benchmark establishes baseline performance metrics for narrative medical reasoning and provides a publicly accessible evaluation structure for advancing AI-assisted diagnosis research,” the technical report explains. This will accelerate the pace of creation.

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice