BRIDGE Benchmark: LLMs Tackle Real-World Clinical Text

New benchmark evaluates large language models on complex, multilingual healthcare data.

A new benchmark called BRIDGE has been introduced to rigorously test large language models (LLMs) on real-world clinical practice text. It covers 87 tasks across nine languages, evaluating 95 LLMs on their ability to understand complex healthcare data. This initiative aims to bridge the gap between current LLM capabilities and the demands of medical applications.

By Katie Rowan

October 30, 2025

4 min read

BRIDGE Benchmark: LLMs Tackle Real-World Clinical Text

Key Facts

BRIDGE is a new multilingual benchmark for evaluating LLMs on real-world clinical text.
It comprises 87 tasks from clinical data across nine languages and covers eight major task types.
95 LLMs, including GPT-4o and Gemini, were systematically evaluated.
Open-source LLMs can achieve performance comparable to proprietary models.
Medically fine-tuned LLMs based on older architectures sometimes underperform newer general-purpose models.

Why You Care

Ever wondered if AI can truly understand your doctor’s notes? Can large language models (LLMs) accurately interpret the complex language of your medical records? A new benchmark, BRIDGE, has emerged to answer these crucial questions. This creation is vital because it directly impacts the reliability of AI in healthcare, which could affect your future medical care. Your health data is incredibly sensitive and nuanced. Therefore, ensuring AI understands it correctly is paramount for patient safety and effective treatment.

What Actually Happened

Researchers have introduced BRIDGE, a comprehensive multilingual benchmark designed to evaluate large language models (LLMs) on real-world clinical practice text. This new benchmark addresses a significant gap in current LLM evaluation methods, according to the announcement. Most existing evaluations rely on simplified medical exam questions or text from PubMed. These methods often fail to capture the true complexity of data found in electronic health records (EHRs). BRIDGE, in contrast, features 87 tasks sourced directly from real-world clinical data across nine languages. It covers eight major task types spanning the entire patient care continuum. This includes stages like triage, consultation, diagnosis, and even billing coding. The team systematically evaluated 95 different LLMs, including major players like GPT-4o and the Gemini series, as detailed in the blog post.

Why This Matters to You

This benchmark is a big deal for anyone interested in AI’s role in healthcare. It moves beyond theoretical tests to practical applications. Imagine your doctor using an AI assistant that can accurately summarize your complex medical history. That’s the kind of reliability BRIDGE aims to foster. The research shows substantial performance variation across different models, languages, and clinical specialties. This means not all LLMs are equally good at understanding your medical information. For example, an LLM might excel at diagnosing a specific condition but struggle with interpreting a patient’s emotional state from their notes.

Key Findings from BRIDGE Evaluation:

95 LLMs evaluated: A broad range of models, both proprietary and open-source.
87 tasks: Sourced from real-world clinical data.
9 languages covered: Testing multilingual understanding in healthcare.
8 major task types: Spanning the full patient care journey.
14 clinical specialties: From cardiology to dermatology.

How confident are you that an AI could accurately interpret your personal health journey? The study finds that “open-source LLMs can achieve performance comparable to proprietary models.” This is a significant finding for accessibility and creation in AI healthcare tools. What’s more, medically fine-tuned LLMs based on older architectures sometimes underperform newer general-purpose models. This suggests that continuous updates are crucial for effective clinical AI.

The Surprising Finding

Here’s an interesting twist: the evaluation revealed that open-source LLMs can perform just as well as, or even better than, some proprietary models. This challenges the common assumption that only large, closed-source models from tech giants can handle complex tasks. The team revealed that “open-source LLMs can achieve performance comparable to proprietary models.” This is surprising because proprietary models often have massive creation budgets and vast training data. It suggests that specialized, focused creation, even in open-source communities, can yield impressive results in clinical text understanding. This finding could democratize access to AI tools in healthcare. It means smaller institutions or researchers might not need to rely solely on expensive commercial solutions.

What Happens Next

The introduction of BRIDGE provides a crucial resource for future LLM creation in healthcare. We can expect to see more targeted improvements in models over the next 12-18 months. Developers will likely use the BRIDGE leaderboard to identify weaknesses and refine their models. For example, an LLM struggling with oncology notes might receive additional training data in that specific area. The documentation indicates that the BRIDGE and its leaderboard serve as a unique reference. This will guide the creation and evaluation of new LLMs in understanding real-world clinical text. For you, this means potentially more accurate and reliable AI tools in your doctor’s office. You might see AI assisting with faster diagnoses or more personalized treatment plans. The industry implications are clear: a higher standard for clinical AI performance. The paper states that this benchmark is a “foundational resource.” This means it will likely influence how all future medical LLMs are built and .

Ready to start creating?