QuantEval: Benchmarking LLMs for Financial Trading

A new benchmark, QuantEval, assesses large language models' real-world financial quantitative tasks.

Researchers have introduced QuantEval, a new benchmark designed to rigorously test large language models (LLMs) in financial quantitative tasks. It moves beyond simple knowledge questions to evaluate reasoning and strategy coding, highlighting significant gaps between LLMs and human experts.

By Katie Rowan

January 18, 2026

3 min read

QuantEval: Benchmarking LLMs for Financial Trading

Key Facts

QuantEval is a new benchmark for evaluating Large Language Models (LLMs) in financial quantitative tasks.
It assesses LLMs across three dimensions: knowledge-based QA, quantitative mathematical reasoning, and quantitative strategy coding.
QuantEval uses a CTA-style backtesting framework to execute and evaluate model-generated trading strategies.
Current state-of-the-art LLMs show substantial performance gaps compared to human experts, especially in reasoning and strategy coding.
Large-scale supervised fine-tuning and reinforcement learning experiments demonstrated consistent improvements in LLM performance on domain-aligned data.

Why You Care

Ever wondered if an AI could manage your investments better than a human? A new benchmark, QuantEval, is putting large language models (LLMs) to the ultimate financial test. This creation could reshape how we view AI’s role in finance. It directly impacts your potential future interactions with AI-driven financial tools. What if your financial advisor was an LLM? How would you feel about that?

What Actually Happened

Researchers recently unveiled QuantEval, a comprehensive benchmark for evaluating large language models (LLMs) in financial quantitative tasks, according to the announcement. This new system moves beyond traditional knowledge-based question answering. It focuses on three key areas of quantitative finance. These include knowledge-based QA, quantitative mathematical reasoning, and quantitative strategy coding. The team revealed that QuantEval integrates a CTA-style backtesting structure. This structure executes model-generated strategies. It then evaluates them using financial performance metrics. This approach provides a more realistic assessment of an LLM’s quantitative coding ability, the paper states. This is a significant step forward in understanding AI’s practical financial capabilities.

Why This Matters to You

QuantEval offers a much-needed, realistic assessment of AI in finance. It evaluates LLMs on practical applications, not just theoretical knowledge. This means we can better understand how these models perform in real trading scenarios. Imagine an AI that can not only answer financial questions but also develop and execute trading strategies. This benchmark helps us get there. For example, consider an LLM tasked with identifying profitable trading patterns. QuantEval would not just check if it knows the patterns. It would actually run the strategies the LLM generates. This provides a tangible measure of success or failure. “Unlike prior financial benchmarks, QuantEval integrates a CTA-style backtesting structure that executes model-generated strategies and evaluates them using financial performance metrics, enabling a more realistic assessment of quantitative coding ability,” the team explained. How might this evaluation impact your trust in AI-driven financial advice?

Here’s a quick look at QuantEval’s core evaluation dimensions:

Evaluation Dimension	Description
Knowledge-based QA	Assesses understanding of financial concepts and factual recall.
Quantitative Mathematical Reasoning	Tests the ability to apply mathematical principles to financial problems.
Quantitative Strategy Coding	Evaluates the LLM’s capacity to generate executable trading strategies, followed by real-world backtesting.

The Surprising Finding

Despite the impressive capabilities of LLMs in many domains, QuantEval revealed a surprising truth. The research shows substantial gaps between open-source and proprietary LLMs and human experts. This disparity is particularly evident in reasoning and strategy coding, the study finds. Many might assume that LLMs, with their vast training data, would excel across all financial tasks. However, the technical report explains that while LLMs show strength in knowledge-centric question answering, their performance drops significantly when it comes to complex quantitative mathematical reasoning and, crucially, strategy coding. This challenges the common assumption that general-purpose LLMs are inherently proficient in highly specialized, real-world financial applications.

What Happens Next

The introduction of QuantEval is expected to accelerate research in LLM quantitative finance capabilities. The team hopes it will foster practical adoption in real-world trading workflows, as mentioned in the release. We can anticipate seeing new LLM developments specifically targeting these identified weaknesses. Over the next 12-18 months, expect to see models fine-tuned with domain-aligned data. The current research already demonstrates consistent improvements through supervised fine-tuning and reinforcement learning experiments. For example, a financial institution might use QuantEval to rigorously test a new AI trading assistant before deployment. This ensures it meets performance benchmarks. The release also includes the full deterministic backtesting configuration. This guarantees strict reproducibility for future research. This transparency will allow other researchers to build upon these findings. What specific financial tasks could you see benefiting most from these improved LLMs in the near future?

Ready to start creating?