JARVIS AI Boosts Embodied Agents with Commonsense Reasoning

A new neuro-symbolic framework, JARVIS, significantly improves conversational AI agents' ability to understand and act in real-world tasks.

Researchers have introduced JARVIS, a neuro-symbolic AI framework designed for conversational embodied agents. This system combines the strengths of symbolic reasoning and deep learning to enhance agents' ability to perform real-life tasks with better understanding and decision-making. It has shown state-of-the-art results in various dialog-based scenarios.

Mark Ellison

By Mark Ellison

September 4, 2025

3 min read

JARVIS AI Boosts Embodied Agents with Commonsense Reasoning

Key Facts

  • JARVIS is a neuro-symbolic commonsense reasoning framework for conversational embodied agents.
  • It combines large language models (LLMs) for understanding and symbolic reasoning for planning.
  • JARVIS achieved state-of-the-art results on three dialog-based embodied tasks (EDH
  • TfD
  • TATC).

Why You Care

Ever wish your smart home assistant could truly understand your complex requests, not just simple commands? Imagine an AI that doesn’t just hear you, but comprehends the nuances of a conversation and acts accordingly. This isn’t just a futuristic dream anymore. Researchers have unveiled JARVIS, a new structure designed to make conversational AI agents far more intelligent and capable. Why should you care? Because this creation could soon mean your interactions with AI are much more natural and effective, bridging the gap between basic commands and genuine understanding. What if your AI could anticipate your needs?

What Actually Happened

Researchers have proposed JARVIS, a neuro-symbolic commonsense reasoning structure, as detailed in the blog post. This structure aims to build conversational embodied agents that can execute real-life tasks. Historically, building such agents has been quite challenging, requiring effective human-agent communication and multi-modal understanding, as mentioned in the release. Traditional symbolic methods, while precise, struggle with scaling and generalization, the research shows. On the other hand, end-to-end deep learning models often face issues like data scarcity and high task complexity, and their decisions can be hard to explain, the team revealed. JARVIS combines the best of both worlds. It acquires symbolic representations—think of these as structured knowledge—by prompting large language models (LLMs) for language understanding and sub-goal planning. It also constructs semantic maps from visual observations, according to the announcement. This allows the symbolic module to reason for sub-goal planning and action generation based on task- and action-level common sense.

Why This Matters to You

This new JARVIS structure offers significant practical implications for conversational AI. It makes agents more modular, generalizable, and interpretable, the paper states. This means AI systems could become easier to develop, adapt to new situations, and even explain their decisions. For example, imagine you’re trying to instruct a robotic vacuum cleaner to clean a specific spill. Instead of just saying “clean,” you could say, “The dog knocked over the plant, please clean up the dirt near the window.” JARVIS-powered agents could understand the context and plan the necessary steps. This is a big step towards AI that can handle complex, multi-step instructions, making your life easier. What if your AI could truly collaborate with you on tasks?

Key Performance Boosts with JARVIS:

Task CategoryPrevious Success Rate (Example)JARVIS Success Rate (Example)
Execution from Dialog History (EDH)6.1%15.8%
Trajectory from Dialog (TfD)Significant betterment
Two-Agent Task Completion (TATC)Significant betterment

Extensive experiments on the TEACh dataset validate the efficacy and efficiency of JARVIS, the company reports. It achieves (SOTA) results on all three dialog-based embodied tasks. These include Execution from Dialog History (EDH), Trajectory from Dialog (TfD), and Two-Agent Task Completion (TATC), as detailed in the blog post. The team specifically highlighted that their method boosts the unseen Success Rate on EDH from 6.1% to 15.8%. This represents a substantial betterment in an agent’s ability to complete tasks based on past conversations. The JARVIS model also ranked first in the Alexa Prize SimBot Public Benchmark Challenge, the research shows. This demonstrates its practical superiority in competitive AI environments.

The Surprising Finding

Here’s a surprising twist: the research systematically analyzed the essential factors affecting task performance. They also demonstrated the superiority of their method in few-shot settings, as mentioned in the release. This is surprising because deep learning models typically require vast amounts of data to learn effectively. The ability of JARVIS to perform well with only a few examples—a

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice