New Benchmark Reveals LLMs Struggle with Sinhala Language

SinhalaMMLU highlights significant gaps in AI understanding of low-resource languages.

A new benchmark, SinhalaMMLU, evaluates Large Language Models (LLMs) on the Sinhala language. The research reveals that even top models like GPT-4o struggle with culturally specific content. This underscores a critical need for better AI development for non-English languages.

By Sarah Kline

September 14, 2025

4 min read

New Benchmark Reveals LLMs Struggle with Sinhala Language

Key Facts

SinhalaMMLU is the first multiple-choice question answering benchmark specifically for Sinhala.
The benchmark contains over 7,000 questions aligned with the Sri Lankan national curriculum.
It covers six domains and 30 subjects, including culturally grounded knowledge.
Claude 3.5 Sonnet achieved the highest accuracy at 67%, followed by GPT-4o at 62%.
Models showed particular struggle in culturally rich domains like the Humanities.

Why You Care

Ever wonder if your favorite AI assistant truly understands everyone, everywhere? Or is it mostly focused on English? A new benchmark, SinhalaMMLU, just revealed a significant gap. It shows that even Large Language Models (LLMs) struggle with languages like Sinhala. This impacts how useful these tools are for a vast portion of the global population. What does this mean for the future of truly global AI?

What Actually Happened

Researchers have introduced SinhalaMMLU, a comprehensive benchmark for evaluating multitask language understanding in Sinhala. This new dataset is specifically designed for a low-resource language, according to the announcement. It addresses a essential oversight in current AI evaluation methods. Many existing benchmarks often rely on automatic translation, which can introduce errors, the research shows. This new benchmark avoids those issues by being built directly in Sinhala. It includes over 7,000 multiple-choice questions. These questions span secondary to collegiate education levels. They are aligned with the Sri Lankan national curriculum. The benchmark covers six domains and 30 subjects. This includes both general academic topics and culturally grounded knowledge.

Why This Matters to You

This new benchmark directly impacts the accessibility and utility of AI for diverse linguistic groups. If you speak a language other than English, your experience with AI might be less effective. Imagine trying to use an AI for complex tasks in your native tongue. The accuracy might be far lower than for English speakers. This is precisely what SinhalaMMLU aims to highlight. The study finds that current LLMs, even top performers, have significant limitations in non-English contexts.

For example, consider a student in Sri Lanka trying to use an AI tutor. If the AI cannot grasp the nuances of Sinhala history or literature, its help will be limited. The models struggle particularly in culturally rich domains, the team revealed. This directly affects your ability to access information and services through AI in your preferred language.

Top LLM Performance on SinhalaMMLU:

Claude 3.5 Sonnet: 67% accuracy
GPT-4o: 62% accuracy

“While recent multilingual benchmarks attempt to bridge this gap, many rely on automatic translation, which can introduce errors and misrepresent the original cultural context,” as mentioned in the release. This statement underscores the importance of benchmarks like SinhalaMMLU. It provides a more authentic evaluation. How much more useful would AI be if it truly understood every language and culture?

The Surprising Finding

Here’s the twist: even the most LLMs, like Claude 3.5 Sonnet and GPT-4o, showed surprisingly limited performance. While they achieved the highest average accuracies, their scores were still relatively low. Claude 3.5 Sonnet reached 67% accuracy, and GPT-4o achieved 62% accuracy. This is quite surprising given their impressive capabilities in English. It challenges the common assumption that these models possess universal understanding. The models particularly struggle in culturally rich domains, according to the paper. This includes subjects like the Humanities. This reveals a substantial room for betterment in adapting LLMs to low-resource and culturally specific contexts.

What Happens Next

The introduction of SinhalaMMLU sets a new standard for evaluating LLMs in low-resource languages. Over the next 12-18 months, we can expect developers to focus more on improving AI performance in these areas. This will likely involve creating more diverse training data. For example, imagine future AI models that can provide nuanced, culturally appropriate responses in Sinhala. This could revolutionize education and information access in regions like Sri Lanka. Developers should focus on creating benchmarks for other low-resource languages. This will ensure more equitable AI creation. The industry implications are clear: AI models must become truly multilingual and multicultural. Your feedback on these models will be crucial in shaping their future creation. Researchers hope this benchmark will drive significant advancements. This will lead to more inclusive and globally relevant AI technologies.

Ready to start creating?