New AI Training Fixes Hallucinations in Reasoning Models

Researchers unveil FSPO, an algorithm designed to make large language models more factual.

A new study reveals that standard reinforcement learning for large language models (LLMs) increases hallucinations. Researchers Junyi Li and Hwee Tou Ng propose Factuality-aware Step-wise Policy Optimization (FSPO) to reduce these factual errors. This method significantly improves both accuracy and reliability in AI reasoning.

By Mark Ellison

November 8, 2025

4 min read

New AI Training Fixes Hallucinations in Reasoning Models

Key Facts

Reinforcement learning (RL) optimization for reasoning tasks in LLMs significantly increases hallucinations.
Researchers Junyi Li and Hwee Tou Ng identified high-variance gradient, entropy-induced randomness, and spurious local optima as causes.
They developed Factuality-aware Step-wise Policy Optimization (FSPO) to address this issue.
FSPO incorporates explicit factuality verification at each reasoning step.
Experiments with Qwen2.5 and Llama models show FSPO reduces hallucinations and enhances reasoning accuracy.

Why You Care

Have you ever asked an AI a complex question, only to receive a confidently wrong answer? It’s frustrating, right? New research sheds light on why your AI tools might be making more factual errors, especially when they try to reason. This discovery could change how we train artificial intelligence, making your future interactions far more reliable.

According to the announcement, a essential flaw exists in current AI training methods. This directly impacts the trustworthiness of AI, particularly in areas like factual reporting or complex problem-solving. Understanding this issue is key to building more dependable AI systems for everyone.

What Actually Happened

Researchers Junyi Li and Hwee Tou Ng recently uncovered a significant issue with large language models (LLMs). As detailed in the blog post, optimizing these models for reasoning tasks using reinforcement learning (RL) actually makes them hallucinate more. Hallucinations, in this context, mean the AI generates false or nonsensical information.

The team revealed that traditional RL fine-tuning, while improving reasoning capabilities, inadvertently increases these factual errors. They identified several culprits. These include high-variance gradients, entropy-induced randomness, and susceptibility to spurious local optima. These factors lead to the AI making up facts more often.

To combat this, the researchers developed a new method called Factuality-aware Step-wise Policy Optimization (FSPO). This RL fine-tuning algorithm incorporates explicit factuality verification. It checks information at each step of the reasoning process. This dynamic adjustment of token-level advantage values incentivizes factual correctness throughout the AI’s thought process.

Why This Matters to You

Imagine you’re using an AI for essential tasks, like drafting legal documents or summarizing scientific papers. The increased hallucination rate in reasoning models could lead to serious inaccuracies in your work. FSPO directly addresses this by making AI more reliable. Your AI tools could soon become much more trustworthy partners.

Key Benefits of FSPO:

Reduced Hallucinations: AI models generate fewer factual errors.
Enhanced Reasoning Accuracy: Models maintain strong problem-solving skills.
Improved Reliability: Outputs are more consistently correct and dependable.
Better User Experience: Less need for human fact-checking of AI-generated content.

For example, consider a medical AI assisting doctors. A hallucination could have severe consequences. With FSPO, the AI’s recommendations would be grounded in facts, significantly boosting confidence. The research shows that FSPO effectively reduces hallucinations while enhancing reasoning accuracy. This substantially improves both reliability and performance, according to the paper states. How much time could you save if you didn’t have to constantly fact-check your AI’s outputs?

The Surprising Finding

Here’s the twist: the very method used to make LLMs better at reasoning was also making them worse at being truthful. The study finds that reasoning-oriented RL fine-tuning significantly increases the prevalence of hallucinations. This is counterintuitive because we expect improvements in one area to not degrade another so drastically.

Statistical Highlight:

Standard RL fine-tuning for reasoning tasks significantly increases hallucination rates.

This finding challenges the assumption that all forms of reinforcement learning uniformly improve AI performance. It highlights a essential trade-off. Improving reasoning without explicit fact-checking can lead to an AI that sounds logical but is factually incorrect. The team revealed that this issue stems from the training dynamics themselves. Factors like high-variance gradient and susceptibility to spurious local optima contribute to this problem.

What Happens Next

This research, accepted by NeurIPS 2025, suggests a new direction for AI creation. We can expect to see FSPO, or similar factuality-aware methods, integrated into future LLM training protocols. This could happen within the next 12-18 months. AI developers will likely prioritize reducing hallucinations while maintaining reasoning capabilities.

For instance, imagine your next AI assistant. It won’t just solve complex problems. It will also verify its answers against reliable data in real-time. This ensures greater accuracy. The industry implications are vast, impacting everything from customer service chatbots to scientific research tools. The documentation indicates that FSPO was on Qwen2.5 and Llama models. This suggests its applicability across various popular LLM architectures.

Our advice for you? Keep an eye on updates from major AI developers. Look for announcements about improved factuality and reduced hallucinations. This will be a key indicator of more reliable AI systems becoming available. The push for more factual AI is now a central focus, as the team revealed. This will lead to more trustworthy AI experiences for everyone.

Ready to start creating?