Old Tech Outperforms GPT-4o for LLM Evaluation

A new study reveals that Natural Language Inference (NLI) can evaluate LLMs more cost-effectively than advanced models.

New research suggests that a decades-old technique, Natural Language Inference (NLI), can match GPT-4o's accuracy in evaluating Large Language Model (LLM) answers. This method is significantly more cost-effective. It offers a promising alternative for developers and researchers.

By Katie Rowan

November 24, 2025

4 min read

Old Tech Outperforms GPT-4o for LLM Evaluation

Key Facts

Natural Language Inference (NLI) matches GPT-4o's accuracy (89.9%) on long-form QA.
NLI-based evaluation requires orders-of-magnitude fewer parameters than 'LLM-as-Judge' methods.
A new 3000-sample human-annotated benchmark, DIVER-QA, was introduced.
DIVER-QA spans five QA datasets and five candidate LLMs.
The research highlights NLI as a cost-effective and human-aligned metric for evaluating LLMs.

Why You Care

What if a decades-old system could evaluate AI models better and cheaper than the latest, most AI? This isn’t science fiction. New research reveals that an older method, Natural Language Inference (NLI), can surprisingly match the performance of models like GPT-4o for evaluating Large Language Models (LLMs). This discovery could drastically cut costs and speed up creation for anyone working with AI. Imagine the resources you could save while maintaining high accuracy.

What Actually Happened

Evaluating the quality of answers from LLMs has been a persistent challenge, according to the announcement. Traditional lexical metrics often miss the subtle semantic nuances in AI-generated text. Meanwhile, using an “LLM-as-Judge” approach, where a LLM evaluates another, is computationally expensive. Researchers Sai Shridhar Balamurali and Lu Cheng revisited Natural Language Inference (NLI), a lightweight alternative. NLI is a technique that determines if one sentence logically entails, contradicts, or is neutral to another. They augmented this method with a simple lexical-match flag. The study found that this technique, despite its age, matches GPT-4o’s accuracy on long-form question answering. It requires significantly fewer parameters, making it a much more efficient option, the paper states.

Why This Matters to You

This research has significant implications for anyone developing or deploying LLMs. You can now achieve high-quality evaluation without the hefty computational price tag. Think of it as getting results with a budget-friendly tool. For example, if you’re a small startup building an AI chatbot, you no longer need to dedicate massive computing resources or budget to evaluating your model’s responses. This allows you to iterate faster and more affordably.

What’s more, the team introduced DIVER-QA, a new human-annotated benchmark. This benchmark includes 3000 samples and spans five question-answering datasets and five candidate LLMs. It rigorously tests the human alignment of these evaluation metrics. The results highlight that inexpensive NLI-based evaluation remains competitive. The researchers offer DIVER-QA as an open resource for future metric research, as mentioned in the release. What impact could this have on your AI creation timeline and budget?

As the abstract states, “evaluating answers from large language models (LLMs) is challenging: lexical metrics miss semantic nuances, whereas ‘LLM-as-Judge’ scoring is computationally expensive.” This new approach directly addresses those challenges, offering a practical approach for your projects.

The Surprising Finding

Here’s the twist: a technique that is decades old can perform as well as, or even better than, the most AI models for a specific task. The study finds that NLI-based scoring, augmented with a simple lexical-match flag, achieves 89.9% accuracy on long-form question answering. This accuracy matches GPT-4o’s performance. The surprising part is that NLI requires orders-of-magnitude fewer parameters. This challenges the common assumption that more complex problems always demand more complex, resource-intensive solutions. We often believe that the latest AI must be the best for every task. However, this research shows that sometimes, simpler, older methods can be incredibly effective and efficient. It forces us to reconsider what constitutes a ‘’ approach in AI evaluation.

What Happens Next

The future will likely see increased adoption of NLI-based evaluation methods by developers and researchers. This is especially true for those needing cost-effective solutions. We can expect to see new tools and libraries integrating this approach over the next few months. For example, an independent developer building an open-source LLM could use NLI to quickly assess their model’s performance. This would avoid the high costs of commercial API calls. The DIVER-QA benchmark, now openly available, will also spur further research into human-aligned metrics. This could lead to more nuanced and reliable ways to evaluate AI. The industry implication is a move towards more accessible and sustainable AI creation. This shift could democratize access to AI evaluation. Your next project might just become significantly cheaper to test and refine, thanks to this insight.

Ready to start creating?