New AI Training Method Uses Language for Better LLM Agents

Natural Language Actor-Critic (NLAC) promises more stable and efficient learning for AI.

Researchers have introduced Natural Language Actor-Critic (NLAC), a novel method for training large language model (LLM) agents. This approach uses a generative LLM as a 'critic' to provide natural language feedback, improving stability and data efficiency in complex tasks. NLAC aims to overcome the limitations of traditional policy gradient methods.

By Katie Rowan

December 16, 2025

4 min read

New AI Training Method Uses Language for Better LLM Agents

Key Facts

Natural Language Actor-Critic (NLAC) is a new algorithm for training LLM agents.
NLAC uses a generative LLM as a 'critic' to provide natural language feedback.
This method improves learning stability and data efficiency compared to traditional policy gradient methods.
NLAC helps LLM agents understand why actions are suboptimal, fostering better reasoning.
The approach is applicable to tasks involving reasoning, web browsing, and tool-use with dialogue.

Why You Care

Ever wonder why some AI agents struggle with complex tasks, getting stuck in repetitive loops or making illogical choices? Imagine an AI that doesn’t just guess, but truly understands why an action was good or bad. This is precisely what a new research paper from Joey Hong and his team introduces. They’ve unveiled a novel training method called Natural Language Actor-Critic (NLAC) that could significantly improve how large language model (LLM) agents learn. This matters to you because better AI agents mean more capable tools, smarter assistants, and more reliable automation in your daily life.

What Actually Happened

Researchers Joey Hong, Kang Liu, Zhan Ling, Jiecao Chen, and Sergey Levine have proposed Natural Language Actor-Critic (NLAC). This new algorithm aims to train LLM policies more effectively, as detailed in the paper. LLM agents are AI systems that interact dynamically with environments over extended periods. They handle tasks like using tools, browsing the web, and engaging in dialogue, according to the announcement. Traditionally, training these agents without expert demonstrations relied on policy gradient methods. However, in long-horizon tasks with sparse rewards, this approach can lead to noisy and unstable training, the research shows. What’s more, exploring better actions in natural language spaces proves difficult. NLAC tackles these issues by employing a generative LLM critic. This critic produces natural language feedback instead of simple numerical values, the team revealed. This richer feedback helps LLM policies understand how to improve their actions without extensive random exploration.

Why This Matters to You

NLAC offers a more data-efficient and stable alternative to existing on-policy methods, as mentioned in the release. This means AI agents could learn faster and with less data. Think of it as getting detailed coaching instead of just a score. For example, instead of an AI agent just knowing it failed to book a flight, it might receive feedback like, “You selected an invalid date format, try ‘YYYY-MM-DD’.” This specific advice is far more useful than a simple ‘fail’ signal. This approach leverages the inherent strengths of LLMs to provide a richer and more actionable training signal. “Particularly, in tasks with large, open-ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for LLM policies to reason how to improve their actions, without relying on random exploration,” the paper states. How might more stable and efficient AI training impact your future interactions with smart devices and automated services?

Here’s a look at the potential benefits of NLAC:

Improved Learning Stability: Reduces noise in training, leading to more consistent performance.
Higher Data Efficiency: Requires less data to achieve effective learning.
Better Exploration: Natural language feedback guides agents more directly to optimal actions.
Enhanced Reasoning: Agents can better understand why actions are suboptimal.

The Surprising Finding

The most intriguing aspect of NLAC is its departure from traditional scalar reward systems. Instead of receiving a simple number indicating success or failure, the LLM agent gets a natural language explanation. This is a significant shift because it moves beyond abstract numerical signals. The generative LLM critic produces natural language rather than scalar values, the paper states. This allows the agent to understand the reasoning behind a suboptimal action. It challenges the common assumption that numerical rewards are always the most efficient way to guide AI learning. The documentation indicates that this method provides a much richer and more actionable training signal. This means agents can learn from mistakes in a more human-like way. They can internalize complex feedback, making their improvements more .

What Happens Next

NLAC shows promise in outperforming existing training approaches, offering a more training paradigm for LLM agents, according to the announcement. We can anticipate seeing more research and creation in this area over the next 12-18 months. Imagine future AI assistants that not only complete tasks but also explain their process. For example, your smart home assistant might tell you, “I couldn’t dim the lights because the smart bulb is offline; check its power connection.” This level of detail is a direct result of NLAC’s capabilities. Developers might start integrating similar natural language feedback mechanisms into their AI models by late 2026 or early 2027. This will lead to more and user-friendly AI applications. The industry implications are vast, suggesting a future where AI agents are more intuitive and less prone to obscure errors. “What’s more, our approach can be trained off-policy without policy gradients, offering a more data-efficient and stable alternative to existing on-policy methods,” the team revealed. Keep an eye out for more intelligent and communicative AI in your everyday tech.

Ready to start creating?