AI Boosts Rare Disease Diagnosis from Doctor's Notes

New methods combine AI reasoning and data retrieval to improve accuracy in identifying rare conditions.

Large language models (LLMs) often struggle with rare disease diagnosis from unstructured clinical notes. New research introduces RAG-driven CoT and CoT-driven RAG, which significantly enhance diagnostic accuracy. These methods integrate Chain-of-Thought (CoT) reasoning with Retrieval Augmented Generation (RAG) to mimic expert medical analysis.

By Katie Rowan

March 3, 2026

4 min read

AI Boosts Rare Disease Diagnosis from Doctor's Notes

Key Facts

Large language models (LLMs) struggle with rare disease diagnosis from unstructured clinical notes.
RAG-driven CoT and CoT-driven RAG are new methods combining Chain-of-Thought (CoT) and Retrieval Augmented Generation (RAG).
These methods use a five-question CoT protocol and retrieve data from HPO and OMIM databases.
Both methods achieved over 40% top-10 gene accuracy on Phenopacket-derived clinical notes using a DeepSeek backbone.
RAG-driven CoT is better for high-quality notes, while CoT-driven RAG excels with lengthy, noisy notes.

Why You Care

Imagine a world where a doctor’s notes, full of complex medical jargon, could instantly help diagnose a rare disease. What if AI could unlock essential insights from your medical records, faster and more accurately than ever before? This isn’t science fiction anymore. New research is making significant strides in using AI to enhance rare disease diagnosis, directly impacting how quickly and effectively these conditions can be identified. This could mean a faster path to treatment for you or your loved ones.

What Actually Happened

Researchers have developed two novel methods, RAG-driven CoT and CoT-driven RAG, to tackle the challenge of diagnosing rare diseases from clinical notes, according to the announcement. These approaches combine Chain-of-Thought (CoT) and Retrieval Augmented Generation (RAG) to improve how large language models (LLMs) process complex medical information. LLMs like GPT and LLaMA often struggle with domain-specific tasks such as clinical diagnosis, especially when inputs are unstructured clinical notes rather than standardized terms, the study finds. The new methods employ a five-question CoT protocol, which mimics how medical experts reason through a case. Meanwhile, RAG retrieves relevant data from medical databases like Human Phenotype Ontology (HPO) and Online Mendelian Inheritance in Man (OMIM). This dual approach allows the AI to both reason and access external knowledge, leading to more precise diagnostic predictions.

Why This Matters to You

Your medical journey often starts with a doctor’s visit and detailed notes. These notes are goldmines of information, but they are also complex. This new research directly addresses how AI can make sense of these intricate details for rare disease diagnosis. For example, imagine you have a child with unusual symptoms. The doctor’s notes might contain subtle clues that, when analyzed by these AI methods, could lead to a diagnosis much faster. This reduces the agonizing wait for answers. The team revealed that these combined methods significantly outperform standalone foundation models in prioritizing candidate genes from clinical notes. “RAG-driven CoT works better for high-quality notes, where early retrieval can anchor the subsequent reasoning steps in domain-specific evidence,” the paper states. This means clearer notes get even better AI analysis. Do you think AI could shorten the diagnostic odyssey for rare diseases?

Here’s how the methods performed on different note types:

Method Type	Note Quality	Advantage
RAG-driven CoT	High-quality	Anchors reasoning with early retrieval
CoT-driven RAG	Lengthy & Noisy	Better for complex, less structured notes

The Surprising Finding

Here’s an interesting twist: while newer, larger foundation models like Llama 3.3-70B-Instruct and DeepSeek-R1-Distill-Llama-70B showed improved performance over older versions like Llama 2 and GPT-3.5, the true leap came from combining CoT and RAG. The research shows that both RAG-driven CoT and CoT-driven RAG significantly outperform these foundation models on their own. Specifically, both methods, when using a DeepSeek backbone, achieved a top-10 gene accuracy of over 40% on Phenopacket-derived clinical notes. This is surprising because one might assume simply using a more LLM would solve the problem. Instead, the structured integration of reasoning and retrieval proved to be the key differentiator. It challenges the common assumption that bigger models automatically mean better, especially in highly specialized fields like medicine.

What Happens Next

This research paves the way for more AI tools in clinical settings. We can expect to see further refinement of these RAG-driven CoT and CoT-driven RAG techniques over the next 12-18 months. Future applications could include AI assistants that help geneticists pinpoint potential rare disease genes from patient histories. For example, a hospital might integrate such an AI system to flag patients whose clinical notes suggest a rare genetic condition, prompting earlier specialized testing. The documentation indicates that these methods could reduce the diagnostic burden on medical professionals. For you, this could mean faster, more accurate diagnoses, potentially leading to earlier intervention and better health outcomes. The team revealed that future work will likely focus on expanding these methods to a wider range of clinical data and integrating them into existing electronic health record systems.

Ready to start creating?