New AI Model Nears Gemini 2.0 Flash in Exam Accuracy

Data-centric approach fine-tunes Vision Language Models for standardized tests.

New research shows that carefully curated data can significantly boost Vision Language Model performance on standardized exams. A fine-tuned model, Qwen-2.5VL-32B, achieved 78.6% accuracy, almost matching Google's Gemini 2.0 Flash.

Katie Rowan

By Katie Rowan

December 2, 2025

4 min read

New AI Model Nears Gemini 2.0 Flash in Exam Accuracy

Key Facts

  • Researchers developed a 161.4 million token multimodal dataset.
  • They fine-tuned Qwen-2.5VL-32B using this data and an optimized reasoning syntax (QMSA).
  • The model achieved 78.6% accuracy on the YKSUniform benchmark.
  • YKSUniform contains 1,854 multimodal exam questions across 309 curriculum topics.
  • The fine-tuned model's performance was only 1.0% below Gemini 2.0 Flash.

Why You Care

Ever wondered if an AI could ace a tough exam designed for humans? Imagine a future where AI understands complex visual and textual information as well as, or even better than, a top student. This new research reveals a significant step towards that reality, focusing on how data quality drives AI intelligence. Why should you care? Because this approach could make AI tools much more reliable and accurate for everyday tasks.

What Actually Happened

Researchers Egemen Sert and Şeyda Ertekin have unveiled a new method to enhance Vision Language Models (VLMs). They focused on data-centric fine-tuning to improve how these models handle standardized exam questions, according to the announcement. Instead of just tweaking algorithms, their work highlights the power of high-quality training data. They compiled a massive 161.4 million token multimodal dataset. This dataset includes textbook questions, solutions, diagrams, and contextual materials. They then fine-tuned the Qwen-2.5VL-32B model using this specialized data. The team revealed that their model achieved an impressive 78.6% accuracy on a new benchmark called YKSUniform. This benchmark contains 1,854 multimodal exam questions across 309 curriculum topics, as mentioned in the release.

Why This Matters to You

This research offers practical implications for anyone using or developing AI. It demonstrates that the quality and composition of your training data are crucial. You might think AI progress is all about complex algorithms, but this study points to a different path. It suggests that focusing on data can yield remarkable results. Imagine you’re a content creator needing an AI to generate accurate descriptions for complex diagrams. This data-centric fine-tuning approach means future AI tools could provide much more precise and contextually aware outputs for your needs.

Key Findings on Data-Centric Fine-Tuning:

  • Dataset Size: 161.4 million tokens of multimodal data.
  • Model Used: Qwen-2.5VL-32B.
  • Benchmark: YKSUniform, featuring 1,854 questions.
  • Accuracy Achieved: 78.6% on YKSUniform.
  • Comparison: Only 1.0% below Gemini 2.0 Flash.

How might this approach change the way you interact with AI in education or professional settings? The researchers state, “Our results reveal that data composition and representational syntax play a decisive role in multimodal reasoning.” This means that how data is organized and presented to the AI is just as important as the data itself. This could lead to AI assistants that truly understand your specific domain.

The Surprising Finding

Here’s the twist: The research shows that supervised fine-tuning (SFT) with high-quality data can actually rival proprietary approaches. Many in the AI community often assume that performance comes from complex algorithmic advances, like reinforcement learning. However, the study finds that a meticulously curated dataset and a focused fine-tuning strategy can achieve nearly results. Specifically, their model, fine-tuned with their specialized dataset, performed only 1.0% below Gemini 2.0 Flash. This is surprising because Gemini 2.0 Flash is a proprietary model from a tech giant, while their approach uses an open-weight model. It challenges the assumption that only massive, closed models can achieve top-tier performance on complex tasks. It highlights the often-underestimated power of data quality over sheer model complexity.

What Happens Next

This work establishes a clear path for advancing open-weight Vision Language Models (VLMs). We can expect to see more researchers focusing on creating high-quality, curriculum-grounded multimodal datasets in the coming months. For example, imagine a scenario where educational platforms develop specialized datasets for specific subjects, like physics or medical imaging. This could lead to highly specialized AI tutors or diagnostic tools by late 2026 or early 2027. Developers might start incorporating data-centric fine-tuning as a standard practice for improving their AI applications. The team revealed that their work demonstrates that “carefully curated and curriculum-grounded multimodal data can elevate supervised fine-tuning to near performance.” For you, this means future AI tools will likely be more reliable and accurate, especially in niche domains. Look for more open-source Vision Language Models to close the performance gap with proprietary solutions in the near future.

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice