Do LLMs Understand Italian Healthcare? New Study Raises Questions

Research explores multilingual AI for medical records, revealing varied performance.

A new study investigates the multilingual capabilities of Large Language Models (LLMs) in processing Italian Electronic Health Records (EHRs). The research highlights that while some LLMs show promise, many struggle with zero-shot, on-premises information extraction, particularly in generalizing across different diseases. This raises important questions about their real-world applicability in diverse linguistic healthcare settings.

By Katie Rowan

December 15, 2025

4 min read

Do LLMs Understand Italian Healthcare? New Study Raises Questions

Key Facts

The study explored the multilingual capability of open-source LLMs for information retrieval from Italian Electronic Health Records (EHRs).
The research focused on zero-shot, on-premises settings for comorbidity extraction.
Some LLMs struggled significantly, while others showed varied performance and difficulty generalizing across diseases.
Native pattern matching and manual annotations often outperformed LLMs in specific tasks.
The complexity and variability of clinical language, along with high inner semantics, pose challenges for LLMs.

Why You Care

Imagine you’re a doctor in Italy, relying on AI to quickly understand complex patient records. Would you trust it completely? A new study challenges our assumptions about how well Large Language Models (LLMs) truly grasp languages beyond English, especially in essential fields like healthcare. This directly impacts how you might use AI in multilingual environments.

What Actually Happened

A recent paper, authored by Vignesh Kumar Kembu and his team, explored the multilingual capabilities of open-source LLMs. Specifically, they focused on extracting information from Italian Electronic Health Records (EHRs). The research aimed to determine if these AI models could effectively understand and process clinical data in a non-English language without prior training. This is known as a zero-shot, on-premises setting, according to the paper. The team conducted a detailed experimental campaign, primarily focusing on comorbidity extraction—identifying co-occurring medical conditions—from these records. This is a crucial task in digital healthcare, as mentioned in the release. The findings indicate a mixed bag of results for the LLMs.

Why This Matters to You

This research is particularly relevant if you work with AI in international or specialized contexts. Large Language Models (LLMs) are often lauded for their versatility, but their performance in specific, non-English domains like Italian healthcare needs closer scrutiny. The study’s focus on zero-shot learning means the models were without any specific Italian medical training, which is a common expectation for general-purpose LLMs. The study found that some LLMs struggled significantly in these conditions. This impacts the reliability you can expect from these tools.

Consider this: if an LLM cannot reliably extract comorbidities from an Italian EHR, what does that mean for its ability to summarize legal documents in German or analyze financial reports in Japanese? How much can you truly depend on a single LLM for diverse linguistic tasks?

As the team revealed, “some LLMs struggle in zero-shot, on-premises settings, and others show significant variation in performance, struggling to generalize across various diseases when compared to native pattern matching and manual annotations.” This highlights a essential limitation. For example, imagine a system designed to flag potential drug interactions based on patient conditions. If the underlying LLM misses a key comorbidity due to language or generalization issues, patient safety could be compromised. Your trust in such systems depends on their accuracy.

Here are some key findings from the study:

Some LLMs struggled significantly in zero-shot Italian EHR extraction.
Performance varied widely among different LLMs.
LLMs struggled to generalize across various diseases.
Native pattern matching often outperformed LLMs in specific tasks.

The Surprising Finding

Here’s the twist: despite the general hype around LLMs as universal language processors, the study found that native pattern matching and manual annotations often outperformed these AI models. This might seem counterintuitive given the sophistication of LLMs. The research shows that while LLMs are , their ability to generalize across different diseases in a zero-shot, non-English medical context is still a significant challenge. This challenges the common assumption that LLMs inherently possess multilingual capabilities across all domains. For instance, you might expect an LLM to seamlessly switch between languages and specialized jargon, but the paper states this isn’t always the case. The complexity and variability of clinical language, coupled with its high inner semantics, present unique hurdles for these models, as the team revealed. This means that for highly specialized tasks, a simpler, more tailored approach might still be more effective than a general-purpose LLM.

What Happens Next

Looking ahead, this research suggests that developers and healthcare providers need to be cautious. We might see a push for more specialized fine-tuning of LLMs for specific languages and domains. For example, an LLM specifically trained on Italian medical texts could emerge in the next 12-18 months. This would improve accuracy significantly. Actionable advice for you: always test LLMs rigorously in your target language and domain before full deployment. Do not assume universal proficiency. The industry implications are clear: the dream of a single, all-encompassing multilingual LLM for essential applications like healthcare is still some way off. Instead, expect a future where specialized, localized AI solutions become more prevalent. The study finds that while LLMs have indeed become a tool for understanding human-like text, their application in complex, multilingual healthcare scenarios requires more targeted creation.

Ready to start creating?