Why You Care
Ever wonder if the ‘smart’ AI you interact with is truly intelligent, or just good at guessing? What if its impressive scores are a bit… inflated? New research reveals a surprising flaw in how we train and evaluate AI, especially vision-language models. This new approach could make your future AI interactions much more reliable. It could mean the difference between an AI truly understanding your complex requests and one simply picking the best-sounding option.
What Actually Happened
Researchers have unveiled a new structure called ReVeL (Rewrite and Verify by LLM), as detailed in the blog post. This creation tackles a core problem in AI creation: the limitations of multiple-choice question answering (MCQA). MCQA has been a popular method for both evaluating and fine-tuning multimodal language models. However, the team revealed that its constrained output format can lead to unreliable accuracy metrics. The options themselves might inadvertently provide exploitable signals to the AI. This can encourage guessing behaviors during reinforcement fine-tuning (RFT), meaning the AI isn’t truly learning.
The ReVeL structure rewrites these multiple-choice questions into open-form questions. It also keeps the answers verifiable whenever possible. The structure categorizes questions based on their answer types. Then it applies different rewriting and verification schemes accordingly, the paper states.
Why This Matters to You
Imagine you’re using an AI assistant to describe an image. Do you want it to pick from a few pre-set options, or genuinely explain what it sees? ReVeL aims for the latter. The company reports that when used for RFT, they converted 20,000 MCQA examples. They then used GRPO to fine-tune Qwen2.5-VL models. Models trained on ReVeL-OpenQA matched MCQA accuracy on multiple-choice benchmarks. What’s more, they improved OpenQA accuracy by about six percentage points. This indicates better data efficiency and more reward signals than MCQA-based training.
This means the AI you interact with could become much more capable. It will be able to handle complex, open-ended queries rather than just predefined choices. How much more trustworthy would your AI assistant be if it truly understood your requests?
Here’s how ReVeL improves AI:
| Feature | Traditional MCQA | ReVeL-OpenQA |
| Question Format | Constrained, multiple-choice | Open-ended, verifiable |
| Training Signal | Potentially exploitable options, guessing | , genuine understanding |
| Evaluation | Prone to ‘score inflation’ | More accurate, reveals true capabilities |
| Efficiency | Less data efficient for deep understanding | More data efficient, better reward signals |
The Surprising Finding
Here’s the twist: The research shows that MCQA benchmarks can significantly overstate AI performance. When ReVeL was used for evaluation, it revealed up to 20 percentage points of score inflation in MCQA benchmarks. This is relative to OpenQA. This means many of the impressive AI scores we’ve seen might not reflect true understanding. Instead, they might just show an AI’s ability to navigate constrained choices. This finding challenges the common assumption that high MCQA scores directly equate to AI intelligence.
The team revealed that ReVeL not only improves judging accuracy. It also reduces both cost and latency in the evaluation process. This makes it a more efficient and honest way to gauge AI capabilities.
What Happens Next
The researchers plan to release the code and data publicly. This will allow other developers and researchers to adopt the ReVeL structure. We can expect to see this method integrated into new AI model training within the next 6-12 months. Imagine future AI systems, like those powering your smart home devices or customer service chatbots, offering more nuanced and accurate responses. For example, a vision-language AI could provide detailed medical image analysis without being limited to a set of predefined diagnoses. Developers should consider incorporating open-ended question formats into their training pipelines. This will ensure their models are and truly intelligent. This shift could lead to a new era of more reliable and genuinely capable AI applications across various industries.
