GPT-4o Matches Human 'Theory of Mind' in New Study

A recent study reveals advanced AI's surprising ability to infer beliefs and emotions from text.

New research from Anna Babarczy and colleagues investigates whether Large Language Models (LLMs) possess 'Theory of Mind' (ToM). The study, using a 'Strange Stories' paradigm, found that GPT-4o performed comparably to humans in inferring mental states, raising questions about AI's understanding.

By Katie Rowan

March 20, 2026

4 min read

GPT-4o Matches Human 'Theory of Mind' in New Study

Key Facts

The study evaluated Large Language Models (LLMs) for 'Theory of Mind' (ToM) capabilities.
Researchers used an adapted 'Strange Stories Paradigm' to test inference of beliefs, intentions, and emotions.
Five LLMs were compared against human controls in text-based scenarios.
GPT-4o achieved high accuracy and robustness, performing comparably to humans.
Earlier and smaller LLMs were affected by inferential cues and distracting information.

Why You Care

Ever wonder if the AI you chat with truly understands your intentions or emotions? This isn’t just a sci-fi fantasy anymore. A new study, as detailed in the blog post, suggests that some AI models might be closer to this capability than we previously thought. Researchers Large Language Models (LLMs) on their ability to infer others’ beliefs, intentions, and emotions—a concept known as ‘Theory of Mind’ (ToM). This matters to you because it could change how you interact with AI, from customer service bots to creative writing assistants. It raises important questions about the future of human-AI collaboration.

What Actually Happened

Researchers Anna Babarczy, Andras Lukacs, Peter Vedres, and Zeteny Bujka investigated whether current Large Language Models exhibit Theory of Mind (ToM) capabilities. This means assessing their ability to infer others’ beliefs, intentions, and emotions from text, according to the announcement. They compared the performance of five different LLMs with human controls. The team used an adapted version of a text-based tool called the ‘Strange Stories Paradigm.’ This tool is widely used in human ToM research, as mentioned in the release. Participants, both human and AI, answered questions about story characters’ mental states. The study aimed to determine if LLMs could achieve mental-state attribution. It also explored if their outputs were merely superficial pattern completion, the paper states. The results revealed a significant performance gap between the models.

Why This Matters to You

This research has practical implications for how you interact with AI every day. If an AI can better understand human intentions, your experiences with it will become much smoother. Imagine a virtual assistant that truly grasps the nuances of your requests, even when you don’t articulate them perfectly. How might this change your daily digital life?

Consider these potential impacts:

Enhanced Customer Service: AI agents could better understand your frustration or urgency.
Improved Content Creation: AI could generate more emotionally resonant stories or marketing copy.
More Intuitive Interfaces: Systems could anticipate your needs based on subtle cues.
Better Educational Tools: AI tutors could adapt to your learning style and emotional state.

For example, think of a situation where you’re trying to explain a complex problem to a chatbot. If the AI possesses ToM, it might infer your confusion or impatience. It could then offer clearer explanations or switch to a different approach. This goes beyond simply processing keywords. It involves understanding the underlying human experience. The study finds that “GPT-4o demonstrated high accuracy and strong robustness, performing comparably to humans even in the most challenging conditions.” This suggests a future where your AI companions are genuinely more empathetic.

The Surprising Finding

Here’s the twist: while earlier and smaller LLMs struggled, one model stood out. The research shows that GPT-4o performed exceptionally well, even when faced with difficult scenarios. Earlier and smaller models were strongly affected by the number of relevant inferential cues available, the study finds. They were also vulnerable to irrelevant or distracting information in the texts. However, GPT-4o demonstrated high accuracy and strong robustness, performing comparably to humans even in the most challenging conditions. This is surprising because LLMs are trained on vast amounts of language data. They don’t have social embodiment or access to other manifestations of mental representations, as the abstract explains. This challenges the common assumption that true understanding requires human-like experiences. It suggests that pattern recognition alone might be sufficient for some aspects of ‘Theory of Mind.’

What Happens Next

This study opens new avenues for AI creation and understanding. We can expect further research in the next 12-24 months. Researchers will likely explore how these ToM capabilities can be integrated into practical applications. For example, future AI systems might offer more personalized mental health support. They could also provide more nuanced feedback in creative writing tools. The industry implications are significant, potentially leading to more and human-like AI interactions. As the paper states, this work contributes to ongoing debates about the cognitive status of LLMs. It also helps define the boundary between genuine understanding and statistical approximation. Developers should focus on building upon these findings. Your future interactions with AI could become remarkably more intuitive and empathetic.

Ready to start creating?