AI System Outperforms GPT-5 on Medical Board Exam

January Mirror, an evidence-grounded AI, excels in endocrinology reasoning, surpassing frontier LLMs.

A specialized AI system called January Mirror achieved an 87.5% accuracy rate on a challenging endocrinology board-style examination. This performance significantly outranked leading large language models like GPT-5 and Gemini-3-Pro, highlighting the power of curated, evidence-based AI in complex medical subspecialties.

By Sarah Kline

February 19, 2026

4 min read

AI System Outperforms GPT-5 on Medical Board Exam

Key Facts

January Mirror achieved 87.5% accuracy on a 120-question endocrinology board-style examination.
Mirror outperformed GPT-5 (74.0%), GPT-5.2 (74.6%), and Gemini-3-Pro (69.8%) on the exam.
The system also surpassed a human reference accuracy of 62.3%.
Mirror operated under a closed-evidence constraint, unlike comparator LLMs with web access.
74.2% of Mirror's outputs cited at least one guideline-tier source, with 100% citation accuracy.

Why You Care

Imagine an AI that can diagnose complex medical conditions better than leading general AI models. What if this AI could even outperform human experts on tough medical exams? How might this change your future healthcare experiences?

Recent findings reveal a specialized AI system, January Mirror, has significantly outperformed top large language models (LLMs) like GPT-5 on a challenging endocrinology board-style examination. This isn’t just about a high score; it signals a potential shift in how artificial intelligence supports complex medical decision-making, directly impacting your health and the future of medicine.

What Actually Happened

Researchers evaluated January Mirror, an evidence-grounded clinical reasoning system, against several frontier LLMs. These included GPT-5, GPT-5.2, and Gemini-3-Pro, as mentioned in the release. The evaluation focused on a 120-question endocrinology board-style examination. Endocrinology is a medical subspecialty dealing with hormones and glands.

January Mirror integrates a curated endocrinology and cardiometabolic evidence corpus. This means it uses a carefully selected body of medical knowledge. It also employs a structured reasoning architecture to generate evidence-linked outputs, according to the announcement. Unlike the comparator LLMs, which had real-time web access, Mirror operated under a closed-evidence constraint without external retrieval. The team revealed this essential difference in their methodology.

Why This Matters to You

This creation holds significant implications for healthcare and AI’s role within it. The ability of a specialized AI to excel where general models struggle points to a future of highly accurate, domain-specific AI assistants. Think of it as having a super-specialist AI consultant available to your doctor.

For example, imagine your doctor is considering a complex treatment plan for a rare endocrine disorder. An AI like January Mirror could quickly cross-reference vast amounts of curated, evidence-based research. This would ensure the recommended approach is backed by the latest, most reliable medical guidelines. This level of precision could lead to better outcomes for you.

January Mirror’s Performance Highlights:

Overall Accuracy: 87.5% (105/120 questions)
Human Reference: 62.3%
GPT-5.2 Accuracy: 74.6%
GPT-5 Accuracy: 74.0%
Gemini-3-Pro Accuracy: 69.8%
Accuracy on 30 Most Difficult Questions: 76.7%

How much more confident would you feel knowing your medical team has access to such a , evidence-driven tool? The study finds that Mirror provided evidence traceability, with “74.2% of outputs cited at least one guideline-tier source, with 100% citation accuracy on manual verification.” This means you can trust the information it provides.

The Surprising Finding

Here’s the twist: January Mirror achieved its superior performance despite a significant handicap. While frontier LLMs like GPT-5 and Gemini-3-Pro had real-time web access, Mirror operated under a closed-evidence constraint. This means it couldn’t browse the internet for answers. This challenges the common assumption that more access to information always leads to better AI performance.

The research shows that curated evidence with explicit provenance can outperform unconstrained web retrieval for subspecialty clinical reasoning. This is particularly surprising because general LLMs are often praised for their vast knowledge base. However, for complex, nuanced fields like endocrinology, the quality and structure of information proved more essential than sheer volume. The team revealed that “curated evidence with explicit provenance can outperform unconstrained web retrieval for subspecialty clinical reasoning.” This suggests a focused, high-quality data approach is key.

What Happens Next

This success indicates a clear path forward for AI in specialized medical fields. We can expect to see more targeted AI systems developed and refined over the next 12-24 months. These systems will focus on other medical subspecialties, such as cardiology or oncology, according to the announcement.

For example, imagine a similar AI being developed to assist in cancer treatment planning. It could analyze a patient’s specific tumor markers and genetic profile. Then, it would suggest highly personalized treatment protocols based on the latest research and clinical trials. This would offer actionable advice for readers interested in healthcare creation.

This trend suggests a future where AI acts as a highly specialized co-pilot for medical professionals. The company reports that Mirror also supports auditability for clinical deployment. This means its reasoning process can be checked and , which is crucial for real-world medical applications. The industry implications are vast, promising more precise and evidence-based patient care.

Ready to start creating?