New Framework Reveals LLMs Struggle with Financial Reasoning

FinEval-KR uncovers surprising limitations in large language models' financial knowledge and reasoning abilities.

A new evaluation framework, FinEval-KR, has been introduced to independently assess large language models' (LLMs) financial knowledge and reasoning. The research reveals that even top models struggle with applying knowledge, and specialized financial LLMs often underperform general models.

By Sarah Kline

November 8, 2025

4 min read

New Framework Reveals LLMs Struggle with Financial Reasoning

Key Facts

FinEval-KR is a new evaluation framework for Large Language Models (LLMs) in the financial domain.
It independently quantifies LLM knowledge and reasoning abilities using distinct scores.
A cognitive score based on Bloom's taxonomy analyzes reasoning across different cognitive levels.
A new open-source Chinese financial reasoning dataset covering 22 subfields was released.
The study found that specialized financial LLMs generally lag behind top general models.

Why You Care

Ever wondered if your AI assistant could truly manage your investments? Or perhaps accurately predict market trends? A new study suggests that large language models (LLMs), despite their impressive capabilities, still face significant hurdles in the complex world of finance. This research directly impacts how you might trust AI with your financial decisions. What if the AI you rely on for financial insights isn’t as smart as you think?

What Actually Happened

Researchers have unveiled FinEval-KR, a novel evaluation structure designed to rigorously test large language models (LLMs) in financial contexts. This structure aims to separate and measure an LLM’s financial knowledge from its reasoning abilities, according to the announcement. Current evaluation methods often mix these capabilities, making it hard to pinpoint weaknesses. FinEval-KR introduces distinct metrics: a knowledge score and a reasoning score. It also uses a cognitive score, based on Bloom’s taxonomy, to analyze reasoning across different cognitive levels. The team also released a new open-source Chinese financial reasoning dataset. This dataset covers 22 subfields, supporting reproducible research and further advancements in financial reasoning, as detailed in the blog post.

Why This Matters to You

This new structure offers crucial insights into the limitations of current LLMs. It helps us understand where these AI tools excel and where they fall short. For example, imagine you’re using an LLM to analyze complex financial reports. You might assume it understands everything. However, FinEval-KR shows that many models struggle with applying their knowledge effectively. This could lead to incorrect financial advice or flawed analyses. How confident are you now in an LLM’s ability to handle your money?

Key Findings from FinEval-KR:

LLM reasoning ability is a core factor influencing accuracy.
Higher-order cognitive ability also significantly impacts reasoning accuracy.
Top models still face a bottleneck with knowledge application.
Specialized financial LLMs generally lag behind top general models.

Shaoyu Dou, one of the authors, highlighted a essential observation. “Large Language Models (LLMs) demonstrate significant potential but face challenges in complex financial reasoning tasks requiring both domain knowledge and reasoning,” the paper states. This means your financial AI might have the data but lack the deeper understanding needed for nuanced decisions. Understanding these limitations is vital for anyone using or developing financial AI applications.

The Surprising Finding

Here’s the twist: you might expect specialized financial LLMs to outperform general-purpose models in their specific domain. However, the study revealed a surprising outcome. The analysis shows that specialized financial LLMs generally lag behind the top general large models across multiple metrics. This challenges the common assumption that specialization automatically leads to superior performance in AI. It suggests that breadth of training data and general reasoning capabilities might be more essential. The research indicates that LLM reasoning ability and higher-order cognitive ability are the core factors influencing reasoning accuracy. This finding is counterintuitive for many in the AI community. It implies that simply feeding an LLM more financial data isn’t enough. The underlying reasoning architecture plays a more significant role.

What Happens Next

The introduction of FinEval-KR marks a significant step forward in evaluating LLMs for financial tasks. Researchers will likely use this structure to refine existing models and develop new ones. We can expect to see improvements in LLM financial reasoning over the next 12-18 months. For instance, developers might focus on enhancing an LLM’s cognitive processing rather than just expanding its financial data. This could lead to more reliable AI tools for financial analysis and forecasting. If you are developing financial AI, consider integrating FinEval-KR into your testing protocols. This will help you identify and address essential weaknesses. The industry implications are substantial, pushing for more and transparent AI evaluation methods. The team revealed that “FinEval-KR, a novel evaluation structure for decoupling and quantifying LLMs’ knowledge and reasoning abilities independently, proposing distinct knowledge score and reasoning score metrics.”

Ready to start creating?