New Benchmark Reveals LLMs Struggle with Structured Data

SKA-Bench uncovers significant challenges for large language models in understanding complex information.

A new benchmark called SKA-Bench evaluates how well large language models (LLMs) understand structured knowledge. The findings indicate that even advanced LLMs face difficulties with noisy or complex data, highlighting areas for future improvement.

August 27, 2025

4 min read

New Benchmark Reveals LLMs Struggle with Structured Data

Key Facts

  • SKA-Bench is a new benchmark for evaluating LLM understanding of structured knowledge.
  • It covers four types of structured knowledge: KG, Table, KG+Text, and Table+Text.
  • Evaluations on 8 LLMs, including DeepSeek-R1, show significant challenges in structured knowledge understanding.
  • LLM performance is influenced by noise, knowledge unit order, and hallucination.
  • SKA-Bench includes four ability testbeds: Noise Robustness, Order Insensitivity, Information Integration, and Negative Rejection.

Why You Care

Have you ever wondered if your favorite AI chatbot truly understands complex information, like a detailed company report or a database? A new benchmark, SKA-Bench, has just revealed that even large language models (LLMs) still struggle with structured knowledge. This matters because it affects how reliably you can use AI for tasks requiring precise data interpretation.

This new evaluation tool offers a clearer picture of current AI limitations. It directly impacts the accuracy and trustworthiness of AI applications you might use daily. Understanding these challenges helps you make better decisions about AI adoption.

What Actually Happened

Researchers have introduced SKA-Bench, a new benchmark designed to rigorously evaluate the structured knowledge understanding of LLMs. This tool addresses previous evaluation methods that were less comprehensive, as detailed in the blog post. The team aimed to diagnose specific shortcomings in how LLMs process different types of structured information.

SKA-Bench includes four common forms of structured knowledge: Knowledge Graphs (KG), Tables, KG combined with text, and Tables combined with text. The creation process involved a three-stage pipeline to create instances for evaluation, according to the announcement. Each instance includes a question, an answer, and both relevant and irrelevant knowledge units. This setup allows for a fine-grained assessment of various capabilities.

Why This Matters to You

This new benchmark has practical implications for anyone using or developing AI. It highlights that while LLMs are , their ability to handle structured data, like your company’s sales figures or a detailed medical record, is not yet . Imagine you’re using an AI to summarize a financial spreadsheet. SKA-Bench helps us understand why that summary might miss crucial details or misinterpret certain entries.

What’s more, the research shows that LLM performance is significantly influenced by factors such as noise in the data and the order of information. This means that feeding an LLM messy or unorganized data could lead to less reliable outputs for your tasks. How often do you work with perfectly clean data?

Key Structured Knowledge Forms Evaluated by SKA-Bench:

Knowledge FormDescription
Knowledge Graph (KG)Interconnected facts, like a web of relationships
TableData organized in rows and columns
KG + TextKnowledge graphs combined with descriptive text
Table + TextTabular data augmented with narrative text

One of the researchers stated, “existing LLMs still face significant challenges in understanding structured knowledge, and their performance is influenced by factors such as the amount of noise, the order of knowledge units, and hallucination phenomenon.” This quote from the paper underscores the ongoing need for betterment. Your future AI tools will benefit from these insights.

The Surprising Finding

Here’s the twist: despite the impressive advancements in LLMs, the empirical evaluations on eight representative models, including the DeepSeek-R1, indicate a surprising truth. The study finds that these models still struggle significantly with structured knowledge. This challenges the common assumption that modern LLMs can effortlessly digest any form of data.

Specifically, the team revealed that LLMs are sensitive to the “amount of noise” and the “order of knowledge units.” This means that even slight inaccuracies or a jumbled presentation of facts can severely impact an LLM’s understanding. It’s like trying to read a book with missing pages and chapters out of order. This finding is particularly notable because many assume LLMs are enough to handle imperfect real-world data.

The research shows that LLMs exhibit vulnerabilities in four fundamental ability testbeds:
* Noise Robustness: Handling irrelevant or misleading information.
* Order Insensitivity: Processing information regardless of its sequence.
* Information Integration: Combining disparate pieces of knowledge.
* Negative Rejection: Disregarding incorrect or false information.

What Happens Next

The introduction of SKA-Bench marks a crucial step for AI creation. Researchers now have a more rigorous tool to pinpoint the exact weaknesses of LLMs in understanding structured data. We can expect future LLM iterations to focus on improving these specific areas, potentially leading to more reliable AI applications by late 2025 or early 2026.

For example, imagine a future AI assistant that can accurately interpret complex legal documents or detailed engineering specifications without misinterpreting key facts. This benchmark provides the roadmap for achieving that. Developers will likely use SKA-Bench to refine their models, making them more against noise and order variations. For you, this means more dependable AI tools in the coming years. The industry implications are clear: a push towards LLMs that are not just fluent, but also factually precise.