AI Models Get Smarter with Open-Ended Questions

New 'ReVeL' framework moves beyond multiple-choice to improve vision-language AI.

A new research paper introduces ReVeL, a framework that transforms multiple-choice questions into open-ended ones for training and evaluating AI models. This approach promises more robust and accurate vision-language AI, reducing 'score inflation' seen in traditional methods.

By Katie Rowan

November 30, 2025

3 min read

AI Models Get Smarter with Open-Ended Questions

Key Facts

ReVeL (Rewrite and Verify by LLM) is a new framework for training and evaluating multimodal language models.
It converts multiple-choice questions into open-ended, verifiable questions.
Models trained with ReVeL-OpenQA matched MCQA accuracy and improved OpenQA accuracy by 6 percentage points.
ReVeL revealed up to 20 percentage points of score inflation in traditional MCQA benchmarks.
The framework also reduces evaluation cost and latency.

Why You Care

Ever wonder if the ‘smart’ AI you interact with is truly intelligent, or just good at guessing? What if its impressive scores are a bit… inflated? New research reveals a surprising flaw in how we train and evaluate AI, especially vision-language models. This new approach could make your future AI interactions much more reliable. It could mean the difference between an AI truly understanding your complex requests and one simply picking the best-sounding option.

What Actually Happened

Researchers have unveiled a new structure called ReVeL (Rewrite and Verify by LLM), as detailed in the blog post. This creation tackles a core problem in AI creation: the limitations of multiple-choice question answering (MCQA). MCQA has been a popular method for both evaluating and fine-tuning multimodal language models. However, the team revealed that its constrained output format can lead to unreliable accuracy metrics. The options themselves might inadvertently provide exploitable signals to the AI. This can encourage guessing behaviors during reinforcement fine-tuning (RFT), meaning the AI isn’t truly learning.

The ReVeL structure rewrites these multiple-choice questions into open-form questions. It also keeps the answers verifiable whenever possible. The structure categorizes questions based on their answer types. Then it applies different rewriting and verification schemes accordingly, the paper states.

Why This Matters to You

Imagine you’re using an AI assistant to describe an image. Do you want it to pick from a few pre-set options, or genuinely explain what it sees? ReVeL aims for the latter. The company reports that when used for RFT, they converted 20,000 MCQA examples. They then used GRPO to fine-tune Qwen2.5-VL models. Models trained on ReVeL-OpenQA matched MCQA accuracy on multiple-choice benchmarks. What’s more, they improved OpenQA accuracy by about six percentage points. This indicates better data efficiency and more reward signals than MCQA-based training.

This means the AI you interact with could become much more capable. It will be able to handle complex, open-ended queries rather than just predefined choices. How much more trustworthy would your AI assistant be if it truly understood your requests?

Here’s how ReVeL improves AI:

Feature	Traditional MCQA	ReVeL-OpenQA
Question Format	Constrained, multiple-choice	Open-ended, verifiable
Training Signal	Potentially exploitable options, guessing	, genuine understanding
Evaluation	Prone to ‘score inflation’	More accurate, reveals true capabilities
Efficiency	Less data efficient for deep understanding	More data efficient, better reward signals

The Surprising Finding

Here’s the twist: The research shows that MCQA benchmarks can significantly overstate AI performance. When ReVeL was used for evaluation, it revealed up to 20 percentage points of score inflation in MCQA benchmarks. This is relative to OpenQA. This means many of the impressive AI scores we’ve seen might not reflect true understanding. Instead, they might just show an AI’s ability to navigate constrained choices. This finding challenges the common assumption that high MCQA scores directly equate to AI intelligence.

The team revealed that ReVeL not only improves judging accuracy. It also reduces both cost and latency in the evaluation process. This makes it a more efficient and honest way to gauge AI capabilities.

What Happens Next

The researchers plan to release the code and data publicly. This will allow other developers and researchers to adopt the ReVeL structure. We can expect to see this method integrated into new AI model training within the next 6-12 months. Imagine future AI systems, like those powering your smart home devices or customer service chatbots, offering more nuanced and accurate responses. For example, a vision-language AI could provide detailed medical image analysis without being limited to a set of predefined diagnoses. Developers should consider incorporating open-ended question formats into their training pipelines. This will ensure their models are and truly intelligent. This shift could lead to a new era of more reliable and genuinely capable AI applications across various industries.

Ready to start creating?