LLMs Still Struggle with Complex Math Reasoning, Study Finds

New research reveals the limits of fine-tuning for advanced mathematical problem-solving in AI.

Despite significant gains from supervised fine-tuning (SFT), large language models (LLMs) hit a wall with harder math problems. A recent study categorizes problem difficulties and highlights where current AI approaches fall short, especially for 'Extremely Hard' challenges requiring unconventional thinking.

Katie Rowan

By Katie Rowan

January 12, 2026

4 min read

LLMs Still Struggle with Complex Math Reasoning, Study Finds

Key Facts

  • Supervised fine-tuning (SFT) significantly improves LLM performance on mathematical reasoning tasks.
  • The study categorized mathematical problems into four difficulty tiers: Easy, Medium, Hard, and Extremely Hard (Exh).
  • LLMs achieve R1 reasoning for Easy to Medium problems with minimal SFT (500-1K instances).
  • Accuracy for Hard-level questions plateaus at around 65% despite logarithmic scaling of training data.
  • Extremely Hard problems require unconventional problem-solving skills that current models struggle with.

Why You Care

Ever wonder why your AI assistant can ace basic algebra but fumbles with a complex word problem? You’re not alone. New research sheds light on exactly what large language models (LLMs) can and cannot solve in mathematical reasoning, even after extensive training. This matters because understanding these limitations helps us build better, more reliable AI. What does this mean for your daily interactions with AI?

What Actually Happened

A recent study titled “Climbing the Ladder of Reasoning: What LLMs Can-and Still Can’t-Solve after SFT?” investigates the performance of LLMs on mathematical reasoning tasks. The research focuses on how supervised fine-tuning (SFT) impacts these capabilities, according to the announcement. SFT involves further training a pre-trained model on a specific dataset to improve performance on a particular task. The team conducted a detailed analysis using the AIME24 dataset. This dataset is known for its challenging mathematical problems. They aimed to understand the evolution of reasoning capabilities in these models. The study categorized questions into four difficulty tiers. These tiers are Easy, Medium, Hard, and Extremely Hard (Exh).

Why This Matters to You

This research provides a clearer picture of AI’s current mathematical prowess. It helps us set realistic expectations for AI tools we use daily. Imagine you’re using an AI for data analysis. Knowing its strengths and weaknesses in reasoning is crucial for trusting its output. The study identifies specific requirements for LLMs to advance between problem tiers. For instance, moving from Easy to Medium problems needs an R1 reasoning style. This requires minimal SFT, specifically 500-1K instances, the paper states. However, Hard-level questions reveal a different story. Models frequently make errors in each step of the reasoning chain. Their accuracy plateaus at around 65%, even with logarithmic scaling of training data. What does this tell you about the future of AI in complex problem-solving?

Key Findings on Problem Tiers:

  • Easy to Medium Tier: Requires R1 reasoning style with minimal SFT (500-1K instances).
  • Hard Tier: Models suffer from frequent errors at each reasoning step; accuracy plateaus at ~65%.
  • Extremely Hard (Exh) Tier: Demands unconventional problem-solving skills, which current models struggle with uniformly.

One of the authors, Yiyou Sun, and the team revealed that “progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT (500-1K instances).” This highlights the efficiency of targeted training for certain skill improvements. However, the plateau at the Hard tier indicates a fundamental limitation. This limitation persists despite increased training data. Your AI might be good at routine tasks but struggles with novel, multi-step challenges.

The Surprising Finding

Here’s the twist: You might assume that more carefully selected, small-scale training data would always be better. However, the study uncovered a surprising finding about data curation. Additional findings reveal that carefully curated small-scale datasets offer limited advantage, as mentioned in the release. The research shows that scaling dataset size proves far more effective for improving LLM performance. This challenges the common assumption that quality always trumps quantity in AI training. For Extremely Hard problems, current models uniformly struggle. They require unconventional problem-solving skills, the team revealed. This suggests that simply providing more examples of complex problems isn’t enough. The models need to develop a different kind of intelligence.

What Happens Next

This research offers a clear roadmap for advancing LLM capabilities in mathematical reasoning. Future efforts will likely focus on developing new architectural approaches. These approaches must address the challenges posed by Hard and Extremely Hard problems. We might see new models emerge in the next 12-18 months. These models could potentially break through the 65% accuracy plateau. For example, imagine a specialized AI tutor that can genuinely guide students through calculus. This would require the AI to understand and apply unconventional problem-solving. Developers should prioritize scaling dataset size over overly meticulous small-scale curation, according to the study. As a user, you can expect AI tools to become more reliable for routine calculations. However, for truly novel or complex mathematical tasks, human oversight will remain crucial for the foreseeable future. The industry will continue to explore methods beyond SFT. These methods aim to foster genuine, multi-step reasoning in LLMs.

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice