New Framework Reveals LLMs Struggle with Complex Reasoning

KG-MuLQA uncovers systematic failure modes in long-context AI models.

Researchers introduced KG-MuLQA, a new framework to evaluate Large Language Models (LLMs) on complex question-answering tasks. The study, involving 16 LLMs, found even top models struggle with multi-hop retrieval and set-based comparisons over long texts. This highlights critical areas for improving AI's understanding and reasoning abilities.

By Sarah Kline

January 26, 2026

4 min read

New Framework Reveals LLMs Struggle with Complex Reasoning

Key Facts

KG-MuLQA is a new framework for evaluating LLMs on multi-level QA extraction.
It tests LLMs across multi-hop retrieval, set operations, and answer plurality.
A dataset of 20,139 QA pairs was created from financial credit agreements.
16 proprietary and open-weight LLMs were evaluated using this framework.
Even top-performing LLMs struggle with set-based comparisons and multi-hop reasoning.

Why You Care

Ever wonder why your favorite AI chatbot sometimes gives you a surprisingly simple answer to a really complex question? Or perhaps it misses crucial details in a long document? A new research paper reveals some significant limitations in even the best Large Language Models (LLMs).

This matters to you because it impacts everything from AI assistants to data analysis tools. Understanding these limitations helps us build better, more reliable AI. Are we expecting too much from current AI models when it comes to deep understanding?

What Actually Happened

Researchers recently introduced KG-MuLQA (Knowledge-Graph-based Multi-Level Question-Answer Extraction), a novel structure designed to rigorously evaluate LLMs. This structure extracts question-answer (QA) pairs at varying levels of complexity, as detailed in the announcement. It focuses on three key dimensions: multi-hop retrieval, set operations, and answer plurality. This approach allows for a very precise assessment of how well models perform across controlled difficulty levels, the technical report explains.

The team leveraged knowledge graphs to represent documents. This method creates a structured way to test an LLM’s ability to navigate and understand complex information. Using KG-MuLQA, they built a dataset of 20,139 QA pairs. These pairs were based on financial credit agreements, according to the announcement. They then evaluated 16 different LLMs, including both proprietary and open-weight models, the study finds.

Why This Matters to You

The findings from the KG-MuLQA structure offer crucial insights into the current state of LLM capabilities. Even the most models face significant hurdles when dealing with intricate information. This directly affects how you might use these tools in your daily work or personal projects. Imagine relying on an AI to summarize a lengthy legal document or analyze complex financial reports.

For example, if you ask an LLM to identify all companies that meet three specific criteria across several pages of a contract, it might struggle. This is due to issues with set-based comparisons and multi-hop reasoning. “Even the best-performing models struggle with set-based comparisons and multi-hop reasoning over long contexts,” the paper states. This means your AI might miss essential connections or fail to aggregate information correctly.

Consider the implications for search or data extraction. How confident are you that an LLM can accurately answer a question requiring it to piece together information from multiple, non-sequential paragraphs? This research highlights that current models often fall short in these scenarios. What specific tasks do you currently rely on AI for that might involve this kind of complex reasoning?

LLM Evaluation Dimensions

Multi-hop Retrieval

Set Operations

Answer Plurality

The Surprising Finding

Here’s the twist: despite rapid advancements in LLM system, the research shows a surprising weakness. The analysis revealed that even the most capable models struggle significantly with specific types of complex reasoning. They consistently failed at tasks involving set-based comparisons and multi-hop reasoning over long texts, the study finds. This challenges the common assumption that simply increasing model size or training data will solve all understanding problems.

Specifically, the team observed systematic failure modes tied to semantic misinterpretation. They also noted an inability to handle implicit relations, as mentioned in the release. This means LLMs don’t just forget facts; they often misunderstand the meaning behind complex queries. They also struggle to infer relationships not explicitly stated. This is surprising because many believe LLMs excel at language understanding. However, true comprehension, especially of nuanced logical connections, remains a significant challenge for them.

What Happens Next

These findings point to clear directions for future LLM creation. Researchers will likely focus on improving models’ abilities in multi-hop reasoning and complex set operations. We might see new architectures or training methodologies emerge over the next 12-18 months. These could specifically target these identified weaknesses, the team revealed.

For example, imagine a future AI that can not only answer a direct question but also synthesize information from an entire legal brief. It would then identify all clauses that contradict each other. This would require multi-hop retrieval and set-based comparison skills.

For you, this means staying informed about these advancements. When choosing or developing AI solutions, prioritize models that demonstrate stronger capabilities in these complex reasoning areas. The industry will likely see a push for more evaluation benchmarks like KG-MuLQA. This will ensure models are truly intelligent, not just good at memorizing patterns. Your awareness of these limitations will help you make better decisions about AI adoption.

Ready to start creating?