LLMs Learn Like Humans: 'Reward Is Enough' for Self-Improvement

New research shows large language models can perform in-context reinforcement learning during inference, boosting performance.

A recent paper introduces 'in-context RL' (ICRL), revealing that large language models (LLMs) can self-improve by optimizing numerical feedback during inference. This simple multi-round prompting framework significantly enhances LLM performance across various complex tasks, even with self-generated rewards.

Katie Rowan

By Katie Rowan

December 25, 2025

4 min read

LLMs Learn Like Humans: 'Reward Is Enough' for Self-Improvement

Key Facts

  • LLMs exhibit 'in-context RL' (ICRL) during inference time.
  • ICRL prompting is a multi-round framework where LLMs receive numerical feedback (rewards) after each response.
  • Response quality consistently improves as the context, including prior responses and rewards, grows.
  • ICRL prompting significantly outperforms baselines like Self-Refine and Reflexion on tasks like Game of 24 and Olympiad-level math.
  • Performance improves even when the reward signals are generated by the same LLM.

Why You Care

Ever wish your AI tools could learn on the fly, getting better with each interaction? Imagine an AI that truly understands and adapts. What if your large language models (LLMs) could improve themselves in real-time, just by getting feedback? New research suggests this is not just possible, but already happening.

A paper titled “Reward Is Enough: LLMs Are In-Context Reinforcement Learners” reveals a fascinating new capability. It shows how LLMs can essentially learn from experience during a single conversation. This means your AI assistants could become much more intelligent and responsive, delivering better results instantly.

What Actually Happened

Researchers have uncovered a surprising phenomenon they call ‘in-context RL’ (ICRL), according to the announcement. This refers to the ability of large language models to perform reinforcement learning during their inference time. Inference time is when the model generates its output based on your input.

The team introduced a straightforward multi-round prompting structure, known as ICRL prompting. This structure guides LLMs to self-improve on a given task, as detailed in the blog post. After an LLM provides a response, it receives a numerical scalar feedback, or ‘reward’. In the subsequent round, the LLM is prompted again, but this time with a context that includes all previous responses and their associated rewards. The research shows that response quality consistently improves as this context grows.

Why This Matters to You

This discovery means LLMs can improve scalar reward signals during inference. They exhibit behavior analogous to reinforcement learning, the paper states. Think of it as an LLM learning from its mistakes and successes within a single conversation. This could dramatically change how you interact with AI.

For example, imagine you’re using an LLM for creative writing. If you give it a ‘reward’ for a particularly good sentence, it will learn to produce more like it. This iterative self-betterment is a tool for enhancing AI performance. How often do you wish your AI could just ‘get it’ faster?

“We consistently observe that response quality improves as the context grows,” the team revealed. This means the longer you interact and provide feedback, the smarter your LLM becomes in that session. This could lead to more personalized and effective AI interactions for you.

Here’s a quick look at the impact of ICRL prompting:

Task CategoryICRL Performance
Game of 24Significant improvements over baselines
Creative WritingEnhanced quality through iterative feedback
ScienceWorldDemonstrated better problem-solving
Olympiad-level MathOutperformed Self-Refine and Reflexion on AIME and HMMT

The Surprising Finding

Here’s the twist: ICRL prompting still improves performance even when the reward signals are generated by the same LLM. This challenges a common assumption that external, human-generated feedback is always necessary. The documentation indicates that self-generated rewards are sufficient for betterment.

This is surprising because it suggests LLMs possess an intrinsic ability to evaluate and learn from their own outputs. It’s like an artist critiquing their own work and getting better without an external teacher. This highlights a promising new paradigm for test-time scaling, as mentioned in the release. Your AI could become a perpetual student, constantly refining its skills.

What Happens Next

This research points to a future where LLMs are far more adaptive and efficient. We can expect to see ICRL prompting integrated into various AI applications within the next 6-12 months. For example, customer service chatbots could learn to provide better answers based on user satisfaction signals in real-time.

Developers should consider implementing multi-round prompting frameworks into their AI solutions. This will allow LLMs to use in-context reinforcement learning for continuous betterment. The industry implications are vast, suggesting a shift towards more autonomous and self-correcting AI systems.

This capability could lead to more AI assistants that learn from every interaction. Your AI tools will become more capable, requiring less explicit fine-tuning. The paper states that this work “highlights a promising new paradigm for test-time scaling.”

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice