New AI Dataset ProMQA Challenges Multimodal Understanding

Researchers introduce ProMQA, a novel dataset to evaluate AI's grasp of complex procedural activities.

A new dataset, ProMQA, has been developed to push the boundaries of AI's ability to understand multimodal procedural activities. It reveals a significant performance gap between current AI models and human capabilities, especially in tasks like following recipes. This development could reshape how we evaluate and improve AI systems.

Mark Ellison

By Mark Ellison

November 6, 2025

4 min read

New AI Dataset ProMQA Challenges Multimodal Understanding

Key Facts

  • ProMQA is a new question-answering dataset for multimodal procedural activity understanding.
  • It contains 401 multimodal procedural QA pairs based on user recordings of cooking activities and recipes.
  • The dataset was created using a human-LLM collaborative approach for annotation.
  • Experiments show a significant performance gap between current AI models and human capabilities on ProMQA.
  • The dataset aims to measure system advancements in application-oriented scenarios, not just traditional classification tasks.

Why You Care

Ever tried following a complex recipe or assembling furniture using only vague instructions? Imagine if your AI assistant struggled with these tasks. What if our AI systems aren’t as smart as we think when it comes to real-world procedural activities? A new dataset, ProMQA, has just revealed a essential gap in AI’s multimodal understanding. This directly impacts how useful AI can be in assisting your daily life, from cooking to complex repairs. Understanding this challenge is key to unlocking truly helpful AI.

What Actually Happened

Researchers have introduced ProMQA, a novel evaluation dataset designed to measure system advancements in application-oriented scenarios, according to the announcement. This dataset focuses on multimodal procedural activities, where people follow instructions to achieve specific goals. Unlike traditional AI evaluations that often use classification tasks—like simply identifying an action—ProMQA delves into more complex understanding. The dataset specifically consists of 401 multimodal procedural question-answering (QA) pairs. These pairs are based on user recordings of activities, primarily cooking, combined with their corresponding instructions or recipes, the paper states. To create this dataset efficiently, the team used a cost-effective human-LLM collaborative approach. Existing annotations were augmented with large language model (LLM)-generated QA pairs, which were then human-, as detailed in the blog post.

Why This Matters to You

This new dataset isn’t just for academics; it has direct implications for the AI tools you use every day. Think about your smart home devices or digital assistants. How well do they truly understand complex, multi-step instructions? ProMQA aims to answer that. For example, imagine you’re following a video recipe, and the AI needs to tell you if you’ve added the right amount of sugar or if your dough has risen enough. This requires more than just recognizing objects; it demands understanding the procedure.

How much better could your AI assistant be if it truly understood every step of a complex task?

The research shows that current systems, including competitive proprietary multimodal models, exhibit a significant gap compared to human performance. “Our experiment reveals a significant gap between human performance and that of current systems, including competitive proprietary multimodal models,” the team revealed. This means there’s a lot of room for betterment in how AI processes visual, auditory, and textual information together to understand a sequence of actions. This is crucial for developing truly intelligent assistants.

Here are some areas where ProMQA could drive improvements:

Application AreaCurrent AI LimitationProMQA’s Potential Impact
Smart KitchensStruggles with multi-step recipe verificationEnables AI to guide you through complex cooking processes
DIY & Assembly GuidesFails to identify incorrect assembly stepsAI could provide real-time error correction for your projects
Medical TrainingLimited understanding of surgical proceduresImproves AI’s ability to assist in procedural training

The Surprising Finding

Here’s the twist: despite the rapid advancements in AI, especially in large language models and multimodal systems, current models fall significantly short of human capabilities in understanding procedural activities. The study finds a “significant gap between human performance and that of current systems.” This is surprising because many people assume that if an AI can generate text and recognize images, it can seamlessly combine these skills to understand complex procedures. However, the ProMQA dataset highlights that simply processing different types of data isn’t enough. The challenge lies in comprehending the sequence, context, and goal of a series of actions, as well as handling the nuances of real-world interactions. This challenges the common assumption that simply scaling up existing multimodal models will automatically lead to human-level procedural understanding.

What Happens Next

The introduction of ProMQA is expected to spur new research and creation in multimodal AI. We can anticipate seeing new models emerge in late 2025 or early 2026, specifically designed to tackle the challenges highlighted by this dataset. For example, future applications could include AI systems that can proactively identify when you’ve made a mistake in a DIY project and offer corrective steps. This could lead to more and reliable AI assistants that genuinely understand your intentions and actions. The dataset’s availability, along with benchmark results, provides a clear target for researchers. This will likely lead to more AI that can truly assist you in complex, real-world tasks, moving beyond simple recognition to genuine procedural comprehension. The team hopes their “dataset sheds light on new aspects of models’ multimodal understanding capabilities.”

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice