HARE Framework Boosts AI Accuracy in Cancer Pathology Reports

New evaluation tool significantly improves clinical relevance of AI-generated histopathology texts.

Researchers have developed HARE, a new framework for evaluating AI-generated histopathology reports. It uses a specialized benchmark and models to prioritize clinically relevant information. This leads to more accurate and reliable AI outputs for medical professionals.

By Katie Rowan

September 23, 2025

4 min read

HARE Framework Boosts AI Accuracy in Cancer Pathology Reports

Key Facts

HARE is a novel entity and relation centric evaluation framework for histopathology reports.
It comprises a benchmark dataset, a named entity recognition (NER) model, a relation extraction (RE) model, and a new metric.
The HARE benchmark was created by annotating 813 clinical and 652 TCGA histopathology reports.
HARE-NER and HARE-RE models, fine-tuned from GatorTronS, achieved an F1-score of 0.915.
The HARE metric significantly outperformed traditional and radiology-specific metrics, including GREEN.

Why You Care

Ever wondered if AI could truly understand complex medical reports? Imagine relying on AI for essential health insights. A new creation promises to make AI-generated medical texts much more reliable. This directly impacts how medical AI can assist doctors. Do you trust AI with your health data?

This creation addresses a crucial challenge in AI creation. It focuses on evaluating the clinical quality of AI-generated reports. This is especially important in specialized fields like histopathology. Your future medical care could benefit from this advancement.

What Actually Happened

Researchers recently introduced HARE (Histopathology Automated Report Evaluation). This is a novel structure designed to assess AI-generated medical reports, according to the announcement. HARE includes a benchmark dataset, a named entity recognition (NER) model, and a relation extraction (RE) model. It also features a new metric. This metric prioritizes clinically relevant content. It aligns essential histopathology entities and relations between reference and generated reports.

To build the HARE benchmark, the team annotated a significant number of reports. They used 813 de-identified clinical diagnostic histopathology reports. What’s more, they included 652 histopathology reports from The Cancer Genome Atlas (TCGA). These reports were annotated with domain-specific entities and relations. The team then fine-tuned GatorTronS, a specialized language model. This created HARE-NER and HARE-RE. These models achieved a high overall F1-score of 0.915, the research shows.

Why This Matters to You

This new HARE structure offers practical implications. It ensures that AI-generated medical reports are not just grammatically correct but clinically accurate. Think of it as a quality control system for AI in medicine. This means doctors can potentially trust AI summaries more. This could free up their time for direct patient care.

For example, imagine a pathologist reviewing hundreds of complex cancer reports daily. An AI system could generate initial summaries. However, if these summaries miss crucial details, they are useless. HARE helps ensure the AI focuses on what truly matters. This includes specific cancer characteristics or gene mutations. The team revealed that the HARE metric significantly outperformed traditional evaluation methods.

HARE’s Performance Against Other Metrics

Metric Type	Performance against Expert Evaluations
HARE (Proposed)	Highest Correlation & Best Regression
GREEN (LLM-based)	Second Best (Pearson r = 0.168 less)
ROUGE (Traditional)	Outperformed
Meteor (Traditional)	Outperformed
RadGraph-XL (Radiology)	Outperformed

As mentioned in the release, the HARE metric showed the highest correlation and best regression to expert evaluations. This is crucial for real-world medical applications. “Evaluating the clinical quality of generated reports remains a challenge, especially in instances where domain-specific metrics are lacking,” the team revealed. This structure directly addresses that gap. How might more reliable AI reports change your experience with healthcare?

The Surprising Finding

Here’s the twist: The HARE metric didn’t just perform well; it significantly surpassed existing, widely used evaluation tools. This is quite surprising given the prevalence of other metrics. The proposed HARE metric outperformed traditional metrics including ROUGE and Meteor, the study finds. It also beat radiology metrics such as RadGraph-XL. What’s more, it even surpassed GREEN, a large language model-based radiology report evaluator. HARE achieved this with a Pearson r (a statistical measure of correlation) that was 0.168 higher than GREEN, the paper states.

This challenges the common assumption that general-purpose AI evaluation metrics are sufficient for specialized medical fields. It highlights that domain-specific knowledge is paramount. Simply using a large language model (LLM) isn’t enough. You need to embed clinical relevance directly into the evaluation process. This finding underscores the need for tailored solutions in medical AI.

What Happens Next

The acceptance of this paper to EMNLP2025 Findings suggests its significance. We can expect further research and creation in this area. You might see similar specialized evaluation frameworks emerge for other medical domains within the next 12-18 months. This could lead to a new standard for medical AI validation.

For example, imagine a future where AI assists in diagnosing rare diseases. The HARE structure provides a blueprint for ensuring that AI’s output is not only accurate but clinically meaningful. This means medical professionals can confidently integrate AI into their workflows. The industry implications are vast. It could accelerate AI adoption in diagnostics. It could also improve patient outcomes. The documentation indicates that this will foster greater trust in AI-generated content within healthcare. This creation is a step towards more and reliable medical AI tools.

Ready to start creating?