New Danish Linguistic Acceptability Benchmark Challenges LLMs

Researchers introduce DaLA, a tougher evaluation for large language models based on real-world Danish errors.

A new benchmark called DaLA has been developed to assess the linguistic acceptability of Large Language Models (LLMs) in Danish. It uses real-world errors to create a more rigorous test, revealing lower performance in current LLMs.

By Sarah Kline

December 18, 2025

4 min read

New Danish Linguistic Acceptability Benchmark Challenges LLMs

Key Facts

Researchers introduced DaLA, a new benchmark for evaluating Danish linguistic acceptability in LLMs.
DaLA is based on an analysis of common real-world errors found in written Danish.
The benchmark uses fourteen 'corruption functions' to generate incorrect sentences.
LLMs show lower performance on DaLA compared to existing benchmarks, indicating higher task difficulty.
DaLA has higher discriminatory power, better distinguishing between strong and weak LLMs.

Why You Care

Ever wondered why even AI sometimes stumbles over simple grammar? Or why your favorite language model might sound a bit off in a non-English language? A new study reveals that current Large Language Models (LLMs) struggle significantly more with Danish linguistic acceptability than previously thought. This directly impacts how well AI can communicate in Danish, affecting everything from customer service bots to translation tools. Do you trust AI to speak your language flawlessly?

What Actually Happened

Researchers have unveiled a new benchmark called DaLA, which stands for Danish Linguistic Acceptability. This benchmark aims to provide a more evaluation for LLMs, according to the announcement. The team first analyzed common errors found in everyday written Danish. Based on this analysis, they developed fourteen specific “corruption functions.” These functions systematically introduce errors into otherwise correct Danish sentences. The goal is to generate incorrect sentences that mirror real-world mistakes. The accuracy of these corruptions was then validated using both manual and automatic methods, as mentioned in the release. This new dataset then serves as a benchmark for evaluating how well LLMs can judge linguistic acceptability in Danish.

Why This Matters to You

This creation has direct implications for anyone using or developing AI applications in Danish. If an LLM cannot reliably distinguish between correct and incorrect Danish, its utility in real-world scenarios is limited. Imagine trying to use an AI assistant that constantly makes grammatical errors; it would be frustrating, right? The new benchmark, DaLA, provides a clearer picture of these limitations. The research shows that current LLMs perform significantly worse on this new, more challenging benchmark compared to older ones. This suggests a need for better training data and models for specific languages.

Here’s a snapshot of what DaLA offers:

Broader Scope: Incorporates a wider variety of real-world error types.
Increased Difficulty: Makes the task of judging linguistic acceptability much harder for LLMs.
Higher Discriminatory Power: Better at identifying the differences between high- and low-performing models.

One of the authors, Gianluca Barmina, stated, “Our findings demonstrate that this extension is both broader and more comprehensive than the current state of the art.” This means your AI tools might need significant improvements to truly master Danish. How much do you rely on AI for essential communication in languages other than English?

The Surprising Finding

The most striking revelation from this research is the significant drop in LLM performance. Despite advancements, LLMs scored lower on DaLA than on existing benchmarks. This indicates that previous evaluations might have overestimated the linguistic capabilities of these models in Danish. The paper states that the new benchmark increases task difficulty, leading to lower performance. This is surprising because it challenges the assumption that LLMs are rapidly achieving near-human proficiency across all languages. The team revealed that their benchmark possesses a “higher discriminatory power.” This allows for a much clearer distinction between truly capable models and those merely performing adequately on simpler tests. It suggests that many LLMs still have a long way to go before truly mastering the nuances of languages like Danish, especially when confronted with common, real-world errors.

What Happens Next

Looking ahead, this new Danish Linguistic Acceptability benchmark will likely become a crucial tool for developers. We can expect to see new LLMs specifically trained or fine-tuned to perform better on DaLA in the coming months. For example, imagine a Danish customer support chatbot that can now handle complex, grammatically imperfect customer inquiries with greater accuracy. This research provides a clear roadmap for improving multilingual AI. Developers should use DaLA to rigorously test their models, ensuring higher quality outputs for Danish speakers. The industry implications are significant, pushing for more specialized and linguistically aware AI creation. As the team revealed, this benchmark will help distinguish truly well-performing models from lower-performing ones, driving creation in language-specific AI capabilities.

Ready to start creating?