New AI Benchmark Reveals MLLMs Struggle with Scientific Reasoning

Scientists' First Exam (SFE) exposes limitations in multimodal AI's cognitive abilities.

A new benchmark, Scientists' First Exam (SFE), evaluates Multimodal Large Language Models (MLLMs) on scientific perception, understanding, and reasoning. The research shows leading MLLMs like GPT-o3 and InternVL-3 perform poorly, highlighting a significant gap in their scientific cognitive capacities. This suggests much work is needed for AI to truly assist in complex scientific discovery.

By Sarah Kline

November 5, 2025

4 min read

New AI Benchmark Reveals MLLMs Struggle with Scientific Reasoning

Key Facts

The Scientists' First Exam (SFE) is a new benchmark for evaluating Multimodal Large Language Models (MLLMs).
SFE assesses scientific signal perception, attribute understanding, and comparative reasoning.
The benchmark consists of 830 expert-verified VQA pairs across 66 multimodal tasks.
Leading MLLMs, GPT-o3 and InternVL-3, scored 34.08% and 26.52% respectively on SFE.
The results indicate significant room for improvement in MLLMs' scientific cognitive capacities.

Why You Care

Ever wonder if AI could truly revolutionize scientific discovery? Imagine an AI that could not only read scientific papers but also understand and reason like a human expert. What if that future is further away than we thought?

A new benchmark, the Scientists’ First Exam (SFE), has just been introduced. It aims to test the true cognitive abilities of Multimodal Large Language Models (MLLMs)—AI systems that can process various data types like text and images. This research directly impacts how you might use AI in fields requiring deep scientific insight.

What Actually Happened

Scientists have unveiled a new benchmark called the Scientists’ First Exam (SFE). This tool evaluates the cognitive abilities of Multimodal Large Language Models (MLLMs). These MLLMs are AI models designed to handle complex information, including both text and images. The SFE focuses on three key areas: scientific signal perception, scientific attribute understanding, and scientific comparative reasoning, as detailed in the blog post.

This new benchmark addresses a essential gap. Current scientific evaluations primarily assess an MLLM’s knowledge understanding. However, they often overlook its ability to perceive and reason. The SFE aims to provide a more comprehensive assessment. It includes 830 expert- visual question answering (VQA) pairs. These questions span 66 multimodal tasks across five high-value scientific disciplines. The goal is to see how well MLLMs can truly think like scientists.

Why This Matters to You

This research has significant implications for anyone interested in the future of AI in science. If you’re a researcher, developer, or just an AI enthusiast, these findings are important. They show where current MLLMs stand and where they need to improve. Imagine you’re developing an AI assistant for medical diagnostics. Its ability to reason about complex patient data, not just recall facts, is crucial. The SFE helps measure that deeper capability.

For example, think about an MLLM analyzing medical images. It needs to perceive subtle anomalies, understand their implications, and then compare them to known conditions. That’s exactly what the SFE tests. The study finds that current models are not yet performing at an expert level. As mentioned in the release, “Extensive experiments reveal that current GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE.” This indicates a substantial need for betterment.

How might these limitations affect your work or future AI applications you rely on?

Here’s a snapshot of the SFE’s focus areas:

Scientific Signal Perception: Interpreting raw data from images or graphs.
Scientific Attribute Understanding: Grasping the meaning of specific features.
Scientific Comparative Reasoning: Drawing conclusions by comparing different pieces of scientific information.

The Surprising Finding

The most surprising finding from the SFE benchmark is the low performance of leading MLLMs. You might assume that models like GPT-o3 and InternVL-3 would excel at scientific tasks. However, the research shows they struggled significantly. Specifically, GPT-o3 scored only 34.08%, and InternVL-3 managed just 26.52% on the exam.

This is particularly surprising because these models are considered . It challenges the common assumption that simply training on vast amounts of data automatically leads to human-like scientific reasoning. The team revealed that these results highlight “significant room for MLLMs to improve in scientific realms.” This suggests that while MLLMs can understand knowledge, their ability to truly perceive and reason scientifically is still rudimentary. It indicates a deeper cognitive gap than previously understood.

What Happens Next

The introduction of the SFE benchmark marks a crucial step forward. We can expect to see more focused research on improving MLLM capabilities in scientific reasoning. Developers will likely use SFE as a target for model training and refinement. Over the next 6-12 months, expect new MLLM architectures specifically designed to tackle these cognitive challenges.

For example, future MLLMs might incorporate more explicit scientific knowledge graphs or reasoning modules. This could help them better understand complex scientific relationships. The industry implications are clear: AI in scientific discovery will evolve more cautiously. It will focus on building genuine cognitive abilities, not just data recall. For you, this means future AI tools in science will be more reliable and intelligent. They will be better equipped to assist with complex tasks. The paper states that this benchmark will push MLLMs “to significantly enhance this discovery process in realistic workflows.”

Ready to start creating?