ConDABench: LLMs Struggle with Real Data Analysis

New evaluation framework reveals current AI models falter on interactive, long-form data tasks.

A new framework, ConDABench, has been introduced to better evaluate Large Language Models (LLMs) in real-world data analysis. It exposes a critical weakness: while LLMs solve more problems, they struggle with tasks requiring sustained user interaction and complex engagement. This highlights a gap in AI's ability to handle ambiguous data analysis needs.

By Sarah Kline

October 17, 2025

4 min read

ConDABench: LLMs Struggle with Real Data Analysis

Key Facts

ConDABench is a new framework for evaluating Language Models (LLMs) on data analysis tasks.
Existing benchmarks for LLMs often do not support interactivity or capture real-world data complexities.
ConDABench includes a multi-agent workflow for generating realistic benchmarks and 1,420 conversational data analysis (ConDA) problems.
Evaluation using ConDABench shows LLMs are better at solving more instances but struggle with sustained, long-form engagement.
The framework aims to help model builders develop truly collaborative AI models for complex interactive tasks.

Why You Care

Ever felt frustrated when an AI chatbot just doesn’t get what you’re asking, especially with complex data? What if the AI tools you rely on for data analysis are missing a crucial piece of the puzzle? A new evaluation structure called ConDABench suggests your instincts might be right. This research shows that while Large Language Models (LLMs) are improving, they still fall short in handling the messy, interactive nature of real-world data analysis.

What Actually Happened

Researchers have unveiled ConDABench, a novel structure designed to evaluate LLMs on conversational data analysis (ConDA) tasks. This structure addresses a significant gap in existing benchmarks, which often overlook the need for user interaction and the complexities of under-specified goals and unclean data, as detailed in the blog post. According to the announcement, real-world data analysis frequently requires back-and-forth communication to clarify user intent. Traditional benchmarks simply don’t capture this dynamic. ConDABench introduces a multi-agent workflow to generate realistic problems. It also includes an evaluation harness, making it possible to systematically test conversational data analysis tools. The team revealed that this allows for a more accurate assessment of how LLMs perform when faced with ambiguous data and evolving user needs.

Why This Matters to You

This creation is essential for anyone using or building AI tools for data insights. Current LLMs, despite their advancements, are not yet truly collaborative partners for complex data analysis. The study finds that while newer models can solve more instances, they are not necessarily better at tasks demanding sustained engagement. This means your data analysis projects might still require significant human oversight, even with AI assistance.

Imagine you’re trying to extract specific trends from a vast, uncleaned dataset. An LLM might give you an initial answer, but what if your goal evolves or the data has unexpected quirks? This is where ConDABench shows current LLMs struggle. According to the research, “Evaluation of LLMs on the benchmarks reveals that while the new generation of models are better at solving more instances, they are not necessarily better at solving tasks that require sustained, long-form engagement.” This suggests a need for AI to learn how to ask clarifying questions and adapt over time.

Consider the following implications for your work:

Aspect	Current LLM Performance (ConDABench)	Impact on You
Problem Solving	Solves more individual problems	Good for clear, single-shot queries
Interactive Tasks	Struggles with sustained engagement	Requires more human intervention for complex goals
Data Ambiguity	Limited ability to disambiguate	You’ll spend more time clarifying for the AI
Real-world Context	Lacks understanding of evolving goals	Less effective for dynamic projects

How much time do you currently spend clarifying your intent or refining your data queries for AI tools? This new evaluation highlights why that might be the case.

The Surprising Finding

Here’s the twist: you might assume that as LLMs get ‘smarter,’ they’d naturally become better at complex, interactive tasks. However, ConDABench reveals a surprising disconnect. The technical report explains that while LLMs are improving in solving more data analysis problems, this doesn’t translate to better performance on tasks requiring continuous, long-form interaction. This challenges the common assumption that increased problem-solving capacity automatically leads to superior collaborative intelligence. For example, an LLM might correctly answer 10 distinct, simple data questions. Yet, it could fail at a single, evolving data analysis project that requires several rounds of clarification and adjustment. This indicates a fundamental difference between solving isolated problems and engaging in a sustained, collaborative analytical process. The team revealed that this gap is a key area for future creation.

What Happens Next

ConDABench provides a crucial avenue for model builders to measure progress toward truly collaborative models. We can expect to see AI developers focusing on improving LLMs’ ability to handle interactive data analysis over the next 12-18 months. This will likely involve new training methodologies that emphasize conversational context and user intent clarification. For example, future LLMs might proactively ask clarifying questions like, “Are you looking for a correlation between these two variables, or a causal relationship?” rather than just attempting a best guess. Actionable advice for you: stay updated on models specifically touting improved interactive capabilities. The industry implications are significant, pushing AI towards becoming more of a conversational partner in data exploration. The documentation indicates that this structure will foster competition to bridge the gap between current LLM capabilities and the demands of real-world data analysis.

Ready to start creating?