New AI Benchmark Exposes LLM Weakness in Data Structuring

A new benchmark, AOE, reveals even advanced LLMs struggle to organize fragmented information into structured tables.

Despite expectations, large language models (LLMs) often produce disorganized text. Researchers introduced the Arranged and Organized Extraction Benchmark (AOE) to test LLMs' ability to create structured tables from complex documents. The results indicate significant struggles across both open-source and closed-source models.

Katie Rowan

By Katie Rowan

November 1, 2025

4 min read

New AI Benchmark Exposes LLM Weakness in Data Structuring

Key Facts

  • A new benchmark called Arranged and Organized Extraction Benchmark (AOE) has been introduced.
  • AOE evaluates LLMs' ability to reconstruct fragmented document information into organized tables.
  • The benchmark includes 11 tasks across three diverse domains, requiring context-specific schema generation.
  • Both open-source and closed-source state-of-the-art LLMs struggled significantly on the AOE benchmark.
  • The research highlights a gap in LLMs' capability to extract and structure deep knowledge effectively.

Why You Care

Ever felt overwhelmed by a wall of text, wishing someone would just summarize it into a neat table for you? What if the AI tools you rely on are struggling with this very task? This new research highlights a essential limitation in how large language models (LLMs) handle complex information, directly impacting your ability to get clear, structured data from them.

What Actually Happened

Researchers have introduced a new tool called the Arranged and Organized Extraction Benchmark (AOE). This benchmark aims to systematically evaluate how well LLMs can take fragmented information from various documents and reconstruct it into an organized table, according to the announcement. Unlike older text-to-table tasks, which used fixed structures, AOE features 11 unique tasks across three different areas. These tasks require models to create specific table structures based on the input questions, as detailed in the blog post. The team revealed they both publicly available and proprietary LLMs against this new benchmark. The findings show that even the most models encountered significant difficulties.

Why This Matters to You

Imagine you’re a content creator trying to extract key statistics from a lengthy report for your next video. Or perhaps you’re a podcaster needing to quickly compare features of different products mentioned across several articles. When LLMs produce “chaotic, disorganized, and untraceable” answers, as the research shows, your workflow suffers. This new benchmark highlights why your AI tools might not be delivering the structured data you expect.

Here’s what the AOE benchmark represents:

  • Complex Document Comprehension: Understanding scattered information.
  • Reconstruction of Isolated Data: Bringing disparate facts together.
  • Context-Specific Schema Generation: Creating tables tailored to the query.
  • Bilingual Capability: Handling data in multiple languages.

How often do you find yourself sifting through AI-generated text to pull out the exact data points you need? This research suggests that while LLMs are great at generating text, their ability to organize it into a precise, usable format is still developing. “Even the most models struggled significantly,” the paper states, which is a clear signal that current AI isn’t a silver bullet for data structuring.

The Surprising Finding

Here’s the twist: despite the widespread expectation that LLMs are excellent at extracting explicit information, the AOE benchmark reveals a substantial gap. The research shows that even leading LLMs, both open-source and closed-source, performed poorly when asked to construct structured tables from complex, fragmented documents. This challenges the common assumption that these AIs can effortlessly transform unstructured text into perfectly organized data.

The benchmark includes 11 carefully crafted tasks across three diverse domains. This goes beyond simple text-to-table conversions. It demands models generate context-specific table structures. The team revealed that this capability is far from perfected, indicating a major area for betterment in LLM creation.

What Happens Next

This research points to a clear direction for AI creation over the next 12-18 months. We can expect to see more specialized LLMs or fine-tuning techniques focusing on “structured table construction.” For example, imagine a future where you can feed an LLM a dozen research papers and it instantly generates a comparison table of methodologies and results, perfectly tailored to your query. Developers will likely work to improve models’ ability to understand context and generate flexible table schemas.

For you, this means that while current LLM outputs might require some manual cleanup, future versions will likely be much more adept at organizing data. Keep an eye out for updates and new models specifically designed for deep knowledge extraction. This will make your data analysis tasks much more efficient. The industry implications are significant, pushing LLM research beyond mere text generation into more data structuring capabilities.

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice