New CUS-QA Dataset Reveals LLMs Struggle with Local, Visual Knowledge

A new benchmark highlights significant limitations in large language models' ability to answer region-specific and visually-grounded questions.

Researchers have introduced CUS-QA, a novel dataset designed to test large language models (LLMs) on open-ended questions requiring local, regional knowledge, often involving visual understanding. Initial findings indicate that even top-tier LLMs perform poorly on these tasks, suggesting a gap in their current capabilities.

August 23, 2025

5 min read

New CUS-QA Dataset Reveals LLMs Struggle with Local, Visual Knowledge

Why You Care

If you're a content creator, podcaster, or anyone relying on AI for research and content generation, you've likely experienced the frustration of an LLM confidently providing incorrect or generic information when you need something precise and localized. A new dataset, CUS-QA, shows just how deep this problem goes, especially when local knowledge or visual context is essential.

What Actually Happened

Researchers Jindřich Libovický, Jindřich Helcl, Andrei Manea, and Gianluca Vico have introduced CUS-QA, a new benchmark dataset for open-ended regional question answering, as detailed in their paper `arXiv:2507.22752`. This dataset uniquely combines both textual and visual modalities, focusing on 'local knowledge'. According to the abstract, CUS-QA consists of "manually curated questions and answers grounded in Wikipedia, created by native speakers from Czechia, Slovakia, and Ukraine, with accompanying English translations." The questions range from purely text-based to those explicitly requiring visual understanding, a key differentiator.

The team evaluated "current large language models (LLMs)" using prompting techniques and complemented these automated assessments with human judgments of answer correctness. This dual approach allowed them to analyze the reliability of existing automatic evaluation metrics. The results, as reported in the abstract, were striking: "Our baseline results show that even the best open-weight LLMs achieve only around 50% accuracy on textual questions and below 30% on visual questions."

Why This Matters to You

For content creators and podcasters, these findings are a significant reality check. If you're building a podcast series on local history, creating video content about regional landmarks, or writing articles that require nuanced understanding of specific geographic areas, relying solely on current LLMs could lead to factual inaccuracies. The study underscores that while LLMs excel at broad factual recall and creative text generation, their 'local' intelligence is still nascent. This means you can't simply prompt an LLM to generate a script about the historical significance of a specific, lesser-known building in Prague and expect excellent, accurate details without extensive human verification.

The low accuracy on visual questions, specifically "below 30%," is particularly essential for anyone working with visual content. Imagine trying to use an LLM to describe a specific, culturally significant monument from an image, or to generate a caption that accurately reflects a nuanced visual detail from a regional festival. The CUS-QA results suggest that current models are largely incapable of this without significant human oversight. This implies that tasks requiring visual interpretation for localized content, such as generating descriptions for travel vlogs or creating detailed narratives for historical documentaries based on visual archives, will continue to demand large human expertise and verification.

The Surprising Finding

One of the more counterintuitive findings from the CUS-QA research concerns the performance of evaluation metrics. While human judgments were paramount, the researchers discovered that "LLM-based evaluation metrics show strong correlation with human judgment." This suggests that while LLMs struggle to answer these local knowledge questions accurately, they can be surprisingly effective at evaluating the correctness of answers, even their own. This presents an intriguing paradox: the models are poor at generation in this specific domain but potentially strong at assessment. Furthermore, the study noted that "traditional string-overlap metrics perform surprisingly well due to the prevalence of named entities in answers." This indicates that for questions where answers are primarily proper nouns (like names of places or people), simpler evaluation methods can still be effective, even if they don't capture deeper semantic understanding.

This finding has practical implications for developers and researchers working on improving LLMs. It suggests a potential avenue for self-correction or fine-tuning, where an LLM could generate multiple answers and then use another LLM-based evaluation mechanism to select the most probable correct one. For content creators, it means that while the raw output might be flawed, there's a glimmer of hope for AI-assisted fact-checking or quality control, provided the right evaluation frameworks are implemented.

What Happens Next

The introduction of CUS-QA is a crucial step in pushing the boundaries of LLM research beyond general knowledge. It establishes a clear benchmark for 'local knowledge' and 'visual understanding' in LLMs. Future developments will likely focus on how models can improve their grounding in specific geographic and cultural contexts, perhaps through more targeted fine-tuning on regional datasets or improved multimodal fusion architectures. For content creators, this means that while current LLMs aren't ready to be your sole research assistant for localized content, this new benchmark signals where future improvements will be concentrated.

We can expect to see more specialized models emerging that are designed to handle such nuanced, region-specific information. This could involve LLMs trained on vast archives of local news, historical documents, and geographically tagged visual data. While it's unlikely we'll see excellent accuracy on CUS-QA-like tasks in the prompt future, this dataset provides a roadmap for researchers. Over the next 12-24 months, expect incremental improvements in LLMs' ability to handle localized text-based questions, with visually-grounded regional knowledge likely remaining a significant challenge for a longer period. For now, human expertise and rigorous fact-checking remain indispensable for any content requiring deep local or visual understanding.