New Research Reveals LMMs Struggle with Human Feedback

A novel framework, InterFeedback, exposes limitations in interactive intelligence of leading multimodal AI models.

New research introduces InterFeedback, a framework to test how well Large Multimodal Models (LMMs) learn from human feedback. The study found that even advanced LMMs like OpenAI-o1 struggle to refine responses, scoring less than 50%. This highlights a critical area for improving AI assistants.

By Mark Ellison

November 11, 2025

4 min read

New Research Reveals LMMs Struggle with Human Feedback

Key Facts

InterFeedback is a new framework designed to test the interactive intelligence of Large Multimodal Models (LMMs).
Existing benchmarks do not adequately assess LMMs' interactive intelligence with human users.
InterFeedback-Bench evaluates 10 different open-source LMMs using MMMU-Pro and MathVerse datasets.
InterFeedback-Human is a new dataset of 120 cases for manually testing leading models like OpenAI-o1 and Claude-Sonnet-4.
OpenAI-o1, a state-of-the-art LMM, scored less than 50% on average when refining responses based on human feedback.

Why You Care

Ever feel like your AI assistant just isn’t getting it, even after you’ve tried to explain things multiple times? What if the most AI models are not as ‘smart’ at learning from you as we thought? New research has unveiled a significant challenge for Large Multimodal Models (LMMs) – their ability to truly understand and integrate human feedback. This directly impacts how useful and intuitive your AI tools can be in the future.

What Actually Happened

Researchers have introduced a new structure called InterFeedback, according to the announcement. This structure is designed to autonomously assess the ‘interactive intelligence’ of LMMs. Interactive intelligence refers to an AI’s capacity to refine its responses based on user input and feedback. The team also developed InterFeedback-Bench, which uses datasets like MMMU-Pro and MathVerse to evaluate ten different open-source LMMs. What’s more, they created InterFeedback-Human, a dataset of 120 cases specifically for manual testing of top models, as mentioned in the release. This comprehensive approach aims to fill a gap in existing benchmarks, which often overlook this crucial interactive aspect.

Why This Matters to You

This research has practical implications for anyone using or developing AI. Imagine trying to teach an AI a new skill, or refine its output for a creative project. If the AI struggles to interpret your corrections, your experience becomes frustrating. The study highlights that current LMMs, even leading ones, are not yet adept at this crucial interaction.

Think of it as trying to explain a complex recipe to a new chef. If they don’t adjust their technique after your suggestions, the meal won’t improve. This is similar to how LMMs are currently performing with human feedback.

So, what does this mean for your daily interactions with AI? It suggests that while LMMs are impressive, their ability to learn dynamically from your input is still developing. “Our evaluation results indicate that even the LMM, OpenAI-o1, struggles to refine its responses based on human feedback, achieving an average score of less than 50%,” the paper states. This finding underscores the need for better feedback integration mechanisms in AI systems. Do you find yourself repeating instructions to your AI assistant more often than you’d like?

InterFeedback-Bench Evaluation Details

Dataset	Purpose	Models
MMMU-Pro	General interactive intelligence	10 open-source LMMs
MathVerse	Mathematical reasoning with feedback	10 open-source LMMs
InterFeedback-Human	Manual testing of leading models	OpenAI-o1, Claude-Sonnet-4

The Surprising Finding

Here’s the twist: despite the impressive capabilities of today’s Large Multimodal Models, their interactive intelligence is surprisingly low. The research shows that even models like OpenAI-o1 performed poorly when tasked with refining responses based on human input. Specifically, the team revealed that OpenAI-o1 scored less than 50% on average when trying to incorporate feedback. This challenges the common assumption that AI inherently learns well from user corrections. We might expect a model capable of generating complex text and images to easily adapt to simple feedback. However, the study finds that interpreting and effectively using feedback is a distinct and underdeveloped skill for these models. This suggests a significant gap between an LMM’s generation abilities and its interactive learning capacity.

What Happens Next

This research points to a clear direction for future AI creation. The team’s findings “point to the need for methods that can enhance LMMs’ capabilities to interpret and benefit from feedback,” as detailed in the blog post. We can expect to see more focus on improving how Large Multimodal Models process and integrate user input over the next 12-18 months. Developers will likely explore new training methodologies and architectural changes to address this. For example, imagine a design tool where your AI assistant truly learns your aesthetic preferences after just a few corrections, rather than needing constant re-instruction. For you, this means future AI assistants should become much more intuitive and less frustrating to use. As a user, consider providing specific, clear feedback to your AI tools, even if they don’t seem to ‘get it’ immediately. This helps generate valuable data for developers. The industry implication is a shift towards more truly adaptive and user-centric AI systems, moving beyond just impressive output generation to genuine interactive intelligence.

Ready to start creating?