AI's Sensory Gap: LLMs vs. Human Perception

New research reveals how large language models interpret the world, and where they still fall short compared to human senses.

A recent study explored how multimodal large language models (LLMs) perceive sensory information. Researchers found that while advanced LLMs can approximate human sensory associations, they still differ significantly from human embodied cognition. This suggests a persistent 'grounding deficit' in AI.

Mark Ellison

By Mark Ellison

November 11, 2025

4 min read

AI's Sensory Gap: LLMs vs. Human Perception

Key Facts

  • The study evaluated 21 LLMs from GPT, Gemini, LLaMA, and Qwen families.
  • Larger, multimodal, and newer models generally outperformed their counterparts.
  • Top models achieved 85-90% accuracy and 0.58-0.65 correlations with human sensory ratings.
  • Distributional factors like word frequency had minimal impact on model performance.
  • Despite improvements, LLMs still differ from human embodied cognition, showing a 'grounding deficit'.

Why You Care

Ever wonder if the AI you chat with truly understands the world like you do? Can it feel the warmth of a fire or smell a fresh-baked cookie? A new study dives deep into how large language models (LLMs) process sensory information, revealing surprising insights into their ‘perception’. This research is crucial for anyone building with or relying on AI. It impacts how we design more intuitive and human-like AI experiences for your daily life.

What Actually Happened

Researchers investigated whether multimodal large language models could achieve human-like sensory grounding, as detailed in the blog post. They examined the models’ ability to capture perceptual strength ratings across various sensory modalities. The study explored how model characteristics like size, multimodal capabilities, and architectural generation influenced performance. What’s more, it analyzed dependencies on distributional factors such as word frequency and embeddings. The team evaluated 21 models from four major families: GPT, Gemini, LLaMA, and Qwen. This comprehensive assessment used 3,611 words from the Lancaster Sensorimotor Norms. They employed correlation, distance metrics, and qualitative analysis for their evaluation, according to the announcement.

Why This Matters to You

This research has significant implications for how we interact with and develop AI. If you’re a content creator using AI to generate descriptions, understanding its sensory limitations is key. For example, an AI might describe a ‘spicy’ dish based on text patterns, but it doesn’t experience the heat. The study found that larger, multimodal, and newer models generally performed better.

Key Findings on Model Performance:

  • Larger Models: Outperformed smaller ones in 6 out of 8 comparisons.
  • Multimodal Models: Showed better results in 5 of 7 comparisons.
  • Newer Models: Surpassed older counterparts in 5 of 8 comparisons.
  • Top Models Accuracy: Achieved 85-90% accuracy with human ratings.
  • Correlation with Humans: Demonstrated 0.58-0.65 correlations.

“Top models achieved 85-90% accuracy and 0.58-0.65 correlations with human ratings, demonstrating substantial similarity,” the paper states. This means AI can approximate human understanding of sensory words quite well. However, they still aren’t perfectly aligned with human cognition. What does this mean for your next AI-powered project?

The Surprising Finding

Here’s the twist: despite strong alignment, the models were not identical to humans. Even top performers showed differences in distance and correlation measures. Qualitative analysis revealed processing patterns related to absent sensory grounding. What’s more, it remains questionable whether introducing multimodality truly resolves this grounding deficit, the study finds. Although multimodality improved performance, it seems to provide similar information to massive text rather than qualitatively different data. This is surprising because one might expect multimodal inputs (like images or sounds) to offer a richer, more human-like understanding. However, benefits occurred across unrelated sensory dimensions. Massive text-only models achieved comparable results in some cases, challenging the assumption that multimodality is a silver bullet for sensory grounding.

What Happens Next

This research suggests that while AI is getting better at statistical learning from vast datasets, true embodied cognition—the kind that allows humans to genuinely perceive and understand the world through senses—remains a frontier. We might see continued efforts to bridge this gap over the next 12-18 months. Future AI creation could focus on new architectural approaches that go beyond statistical associations. Imagine an AI designed to learn from direct interaction with the physical world, not just pre-existing data. For example, robots equipped with sensors could learn about ‘hot’ by touching a warm surface. This could lead to AI that truly understands the nuances of sensory input. For you, this means anticipating AI tools that are more contextually aware. However, don’t expect AI to ‘feel’ exactly like you do anytime soon. “Our findings demonstrate that while LLMs can approximate human sensory-linguistic associations through statistical learning, they still differ from human embodied cognition in processing mechanisms, even with multimodal integration,” the team revealed.

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice