New Benchmark Reveals Major Visual Reasoning Gaps in Top AI Models

Despite advancements, leading MLLMs struggle with basic visual tasks humans find trivial, according to new research.

A new benchmark called VisFactor, inspired by human cognitive psychology, exposes significant limitations in how Multimodal Large Language Models (MLLMs) process visual information. Top models from OpenAI, Google, and Anthropic reportedly fail at fundamental visual reasoning tasks, highlighting a critical gap in their understanding.

August 10, 2025

4 min read

Why You Care

If you're a content creator relying on AI for image analysis, video summarization, or even just generating visual ideas, you might assume today's most complex AI models 'see' the world much like we do. However, new research suggests that even the most complex Multimodal Large Language Models (MLLMs) are still fundamentally blind to basic visual relationships and spatial reasoning that humans effortlessly grasp.

What Actually Happened

A recent paper, "Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs," introduces VisFactor, a novel benchmark designed to test MLLMs on visual tasks derived from established human cognitive psychology assessments. According to the authors, this benchmark digitizes 20 vision-centric subtests from a well-known cognitive psychology assessment, spanning four core domains of human visual cognition: Visualization and Spatial Processing, Perceptual and Closure, Memory, and Reasoning. The study evaluated 20 frontier MLLMs, including models from the GPT, Gemini, Claude, LLaMA, Qwen, and SEED families. The research found that despite significant progress on popular multimodal benchmarks, current MLLMs continue to struggle with basic visual reasoning tasks that are trivially solved by humans, such as recognizing spatial relationships, as stated in the paper's abstract.

Why This Matters to You

For content creators, podcasters, and AI enthusiasts, this finding has prompt practical implications. If you're using MLLMs to generate descriptions for images, analyze visual trends, or even assist in video editing by understanding spatial layouts, the current limitations mean you might not be getting the nuanced, accurate results you expect. For example, an MLLM might identify objects in an image but fail to correctly describe their relative positions or how they interact spatially. According to the research, tasks like understanding 'which object is to the left of another' or 'identifying a complete shape from fragmented parts' remain challenging for these models. This directly impacts the quality and reliability of AI-generated visual content or analyses. Imagine asking an AI to summarize a video, and it accurately identifies all characters but completely misunderstands their movements and interactions within the scene. This isn't just a theoretical limitation; it translates to inaccurate captions, flawed visual narratives, and a need for significant human oversight in visual AI workflows.

The Surprising Finding

The most surprising finding, as highlighted by the researchers, is that despite MLLMs achieving impressive scores on conventional multimodal benchmarks, they still exhibit foundational visual gaps when validated against human cognitive standards. The paper's abstract explicitly states that these models struggle with basic visual reasoning tasks that are 'trivially solved by humans.' This suggests that the benchmarks currently used to measure MLLM performance may not adequately reflect real-world visual understanding or human-like cognitive abilities. It implies that while MLLMs might be excellent at pattern matching and superficial recognition, their underlying visual comprehension lacks the intuitive, holistic grasp that even a child possesses. This counterintuitive result challenges the perception that simply scaling up models and data will inevitably lead to human-level visual intelligence; instead, it points to a need for fundamentally different approaches to visual learning in AI.

What Happens Next

This research, submitted on February 23, 2025, and revised on August 7, 2025, underscores a essential area for future AI creation. The introduction of benchmarks like VisFactor, which are grounded in human cognitive psychology, will likely drive a shift in how MLLMs are trained and evaluated. We can expect to see AI researchers focusing more on developing models that can perform complex spatial and relational reasoning, rather than just object identification. This might involve new architectural designs or different training methodologies that emphasize understanding relationships and context over mere recognition. For content creators, this means that while current MLLMs require careful prompting and human review for visual tasks, future iterations, potentially within the next 2-3 years, could offer much more reliable and nuanced visual understanding. The prompt takeaway is to be aware of these limitations and to continue to validate AI-generated visual content critically, but the long-term outlook points towards more genuinely intelligent visual AI systems.