ECom-Bench Reveals LLMs Struggle with Real E-commerce Support

A new benchmark exposes the current limitations of large language models in complex customer service scenarios.

A new benchmark, ECom-Bench, evaluates LLM agents in real-world e-commerce customer support. It reveals that even advanced models like GPT-4o achieve only a 10-20% success rate, highlighting significant challenges in complex scenarios. The framework uses dynamic user simulation and authentic dialogue tasks.

By Katie Rowan

November 12, 2025

4 min read

ECom-Bench Reveals LLMs Struggle with Real E-commerce Support

Key Facts

ECom-Bench is the first benchmark for evaluating LLM agents with multimodal capabilities in e-commerce customer support.
The benchmark uses dynamic user simulation based on real e-commerce customer interactions.
It features a realistic task dataset derived from authentic e-commerce dialogues.
Even advanced models like GPT-4o achieve only a 10-20% pass metric in ECom-Bench.
The code and data for ECom-Bench are publicly available.

Why You Care

Ever wonder why talking to a chatbot can sometimes feel like a conversation with a brick wall? What if those frustrating interactions were actually a sign of a deeper problem in AI creation? A new benchmark, ECom-Bench, has just dropped, revealing that even the most large language models (LLMs) are struggling with the real complexities of e-commerce customer support. This matters because it directly impacts your online shopping experience and the efficiency of businesses you interact with.

What Actually Happened

Researchers have introduced ECom-Bench, a pioneering benchmark designed to evaluate LLM agents. This benchmark focuses on their multimodal capabilities within e-commerce customer support, according to the announcement. It’s the first of its kind to offer a comprehensive testing ground for these AI systems. The structure uses dynamic user simulation, drawing from actual e-commerce customer interactions. What’s more, it includes a realistic task dataset derived from authentic e-commerce dialogues, as detailed in the blog post. These tasks cover a wide range of business scenarios, reflecting real-world complexities. This makes ECom-Bench particularly challenging for current AI models.

Why This Matters to You

Think about your last online purchase. Did you have a question about shipping, a return, or a product detail? If you ended up speaking to a human, this new research explains why. The study finds that even models like GPT-4o achieve only a 10-20% pass metric in the ECom-Bench structure. This low success rate highlights the substantial difficulties posed by complex e-commerce scenarios. It suggests that current AI agents are not yet ready to handle the nuanced, often multimodal (combining text, images, etc.) demands of customer service.

For example, imagine you’re trying to return a damaged item. You might need to describe the damage, upload a photo, and then discuss shipping options. An LLM agent needs to understand all these elements and respond appropriately. Current models often fall short here. “These tasks, covering a wide range of business scenarios, are designed to reflect real-world complexities, making ECom-Bench highly challenging,” the paper states. This directly impacts how quickly and effectively your issues are resolved. What kind of customer support experience do you expect from an AI?

Here’s a breakdown of what makes ECom-Bench so tough:

Dynamic User Simulation: Based on real customer personas.
Authentic Dialogue Tasks: Derived from actual e-commerce conversations.
Multimodal Capabilities: Requires understanding various types of information.
Complex Scenarios: Designed to mimic real-world problems, not simplified tests.

The Surprising Finding

Here’s the twist: despite all the hype around LLMs, their performance on ECom-Bench was surprisingly low. The research shows that even a model like GPT-4o only managed a 10-20% success rate. This challenges the common assumption that general-purpose LLMs are inherently capable of handling specialized, complex tasks like e-commerce customer support without extensive further creation. It reveals a significant gap between current AI capabilities and the demands of real-world applications. The team revealed that the benchmark’s design intentionally pushes the limits of what current AI can do. This low score underscores the need for more specialized training and creation for AI agents in this domain.

What Happens Next

This new benchmark will likely spur significant advancements in AI agent creation over the next 12-18 months. Developers will now have a clearer target for improving their models. For example, future AI agents might integrate more multimodal processing to better understand visual cues from customers. The code and data for ECom-Bench have been made publicly available, according to the documentation. This means researchers worldwide can now use it to refine their LLM agents. Companies relying on AI for customer support should consider these findings. They might need to adjust their expectations for fully automated solutions. Instead, they should focus on hybrid models that combine AI with human oversight. Your future interactions with chatbots will likely improve as a direct result of this kind of rigorous evaluation.

Ready to start creating?