Greek LLMs Get a Reality Check: New DemosQA Benchmark

Researchers introduce DemosQA, a new dataset and framework to evaluate large language models for Greek question answering, revealing biases.

A new study introduces DemosQA, a novel dataset and evaluation framework for Greek question answering. It assesses 11 large language models, highlighting performance differences between monolingual and multilingual models in under-resourced languages. The research aims to address data bias and improve LLM effectiveness for specific linguistic and cultural contexts.

By Mark Ellison

March 1, 2026

4 min read

Greek LLMs Get a Reality Check: New DemosQA Benchmark

Key Facts

Researchers developed DemosQA, a new dataset for Greek Question Answering.
DemosQA is constructed using social media user questions and community-reviewed answers.
A memory-efficient LLM evaluation framework was created, adaptable to diverse QA datasets.
The study evaluated 11 monolingual and multilingual LLMs on 6 Greek QA datasets.
The research highlights a training data bias in multilingual models towards high-resourced languages.

Why You Care

Ever wonder if the AI tools you use truly understand everyone, everywhere? Large Language Models (LLMs) are , but do they truly grasp the nuances of less common languages? This new research dives into how well these AI systems handle Greek, a language often overlooked by major developers. Why should you care? Because if AI isn’t culturally and linguistically inclusive, its benefits won’t reach everyone. What does this mean for your future interactions with AI?

What Actually Happened

Researchers Charalampos Mastrokostas, Nikolaos Giarelis, and Nikos Karacapilidis have introduced a significant new resource. They developed DemosQA, a novel dataset specifically for Greek Question Answering (QA), according to the announcement. This dataset uses real social media user questions and community-reviewed answers. Its purpose is to better capture the Greek social and cultural context, as detailed in the blog post. What’s more, the team created a memory-efficient LLM evaluation structure. This structure is adaptable for various QA datasets and languages, the technical report explains. They then conducted an extensive evaluation of 11 different large language models. These included both monolingual and multilingual LLMs, across six human-curated Greek QA datasets. The evaluation used three distinct prompting strategies, the study finds.

Why This Matters to You

This research directly impacts how effective AI can be for diverse language speakers. If you’re a content creator targeting a specific linguistic group, this study is highly relevant. It highlights the importance of language-specific data for accurate AI performance. Imagine trying to use an AI chatbot for customer service in Greece. If the AI doesn’t understand local idioms or cultural references, it won’t be very helpful. This is precisely the problem DemosQA aims to solve. The team revealed, “research on LLMs has primarily targeted high-resourced languages (e.g., English), and only recently has attention shifted toward multilingual models.” This means many existing LLMs might not serve your specific language needs adequately.

Consider these key contributions from the research:

DemosQA Dataset: Built from social media questions and community answers, reflecting authentic Greek cultural context.
Memory-Efficient Evaluation structure: A flexible tool for assessing LLMs across different QA datasets and languages.
Extensive LLM Evaluation: 11 models (monolingual and multilingual) on 6 Greek datasets using 3 prompting strategies.

How might this influence your choice of AI tools in the future? Do you think specialized datasets like DemosQA are essential for true global AI adoption?

The Surprising Finding

Here’s an interesting twist: while multilingual models are gaining popularity, their effectiveness for under-resourced languages remains less studied compared to monolingual counterparts, the paper states. The research indicates a training data bias in many multilingual models. This bias often favors a small number of popular languages. Alternatively, these models might rely on transfer learning from high-resourced to under-resourced languages, according to the announcement. This approach, however, may lead to a misrepresentation of social, cultural, and historical aspects. This is surprising because we often assume multilingual models are inherently better due to their broader scope. However, for specific, less common languages, a dedicated monolingual model, trained on rich, local data like DemosQA, could perform better. It challenges the assumption that ‘more languages’ automatically equals ‘better understanding’ for every single language.

What Happens Next

The release of DemosQA and its evaluation structure marks a crucial step forward. We can expect to see more targeted creation in LLMs for under-resourced languages in the coming months. For example, developers might use this structure to refine existing models or create new ones specifically for Greek. The research team has released their code and data to facilitate reproducibility, as mentioned in the release. This means other researchers can build upon their work immediately. This could lead to improved AI applications for specific linguistic communities by late 2026 or early 2027. If you’re an AI developer, consider using DemosQA to test your models’ cultural and linguistic accuracy. The industry implications are clear: a push towards more localized and culturally aware AI systems. This will ensure that AI truly serves a global audience, not just a few dominant languages.

Ready to start creating?