New Benchmark Challenges LLMs' Question Interpretation

CompoST benchmark reveals limitations in how large language models understand complex queries.

A new benchmark called CompoST has been introduced to test the compositional interpretation abilities of large language models (LLMs). The research, presented by David Maria Schmidt and colleagues, shows that while LLMs are good at basic language tasks, they struggle with more complex, structured questions. This highlights a critical area for future AI development.

By Sarah Kline

October 31, 2025

4 min read

New Benchmark Challenges LLMs' Question Interpretation

Key Facts

CompoST is a new benchmark for analyzing LLMs' compositional interpretation of questions.
The benchmark uses three datasets of varying difficulty based on DBpedia graph patterns.
LLMs struggle with structurally complex questions, even when understanding atomic parts.
Experimental results show macro F1 scores of 0.45, 0.26, and 0.09 on the datasets.
The research was presented at the 24th International Semantic Web Conference (ISWC 2025).

Why You Care

Ever asked an AI a seemingly simple question, only for it to completely miss the nuance? What if the AI understood individual words but failed to grasp their combined meaning? A new benchmark, CompoST, is shedding light on this exact challenge for large language models (LLMs). This research directly impacts how well your AI tools can truly comprehend your complex requests.

What Actually Happened

Researchers David Maria Schmidt, Raoul Schubert, and Philipp Cimiano have proposed a new benchmark called CompoST. This benchmark is designed to analyze the ability of LLMs to compositionally interpret questions, according to the announcement. Language interpretation is a process where the meaning of complex structures comes from understanding their individual parts. While LLMs excel at many language tasks, the systematic nature of their interpretation process has been an open question. The team generated three datasets of varying difficulty, based on graph patterns in DBpedia, as detailed in the blog post. These datasets were created in a controlled fashion to test how LLMs interpret structurally complex questions, even when they “understand” the atomic building blocks. This allows for a deeper evaluation of their comprehension.

Why This Matters to You

This research has significant implications for anyone using or developing AI. If an LLM can’t grasp the full meaning of a complex question, its utility is limited. Imagine you’re using an AI assistant to plan a multi-stop trip. If you ask, “Find hotels in Paris near the Eiffel Tower that also have a pool and are pet-friendly,” an LLM with poor compositional interpretation might find hotels in Paris, or hotels with pools, but fail to combine all your criteria. Your AI experience directly depends on this capability.

Here’s what the CompoST benchmark evaluates:

Atomic Building Blocks: Can the LLM understand individual components of a question?
Structural Complexity: Can it combine these components to interpret a complex query?
Systematic Interpretation: Is its understanding consistent across similar question structures?

“Language interpretation is a compositional process, in which the meaning of more complex linguistic structures is inferred from the meaning of their parts,” the paper states. This means an AI needs to do more than just keyword matching. It needs to build meaning from the ground up. How often do you find yourself rephrasing questions for an AI because it doesn’t quite get it? This benchmark aims to improve that interaction for you.

The Surprising Finding

Here’s the twist: despite LLMs’ impressive general language abilities, their performance on compositional interpretation is still quite limited. The study finds that performance in terms of macro F1 score is 0.45, 0.26, and 0.09 across the three datasets. This suggests that even when LLMs have seen the individual components of a question, they struggle to combine them correctly in a novel, complex structure. This challenges the common assumption that simply training on vast amounts of text automatically leads to deep, systematic understanding. It indicates that current LLMs might be more adept at pattern matching than true compositional reasoning.

What Happens Next

This research, presented at the 24th International Semantic Web Conference (ISWC 2025) in November 2025, points to clear directions for future AI creation. Developers will likely focus on improving LLMs’ ability to handle structured queries and complex logical relationships. For example, future AI assistants might incorporate more explicit reasoning modules to better combine different criteria in your requests. Actionable advice for AI users is to be aware of these limitations and formulate questions as clearly and simply as possible. For industry, this means a push towards models that can demonstrate stronger compositional understanding, moving beyond superficial language processing. The team revealed that they conducted experiments with models of different sizes using various prompt and few-shot optimization techniques, as well as fine-tuning, indicating a broad approach to testing these capabilities.

Ready to start creating?