AI Search Gets Smarter with Multi-Level Synthetic Data

New research refines information retrieval by training AI with nuanced, generated data.

A new study introduces a method to train AI search systems using synthetic data that has multiple levels of relevance. This approach moves beyond simple 'relevant' or 'not relevant' labels. It significantly improves search performance, even without using real-world documents for initial training.

By Mark Ellison

November 5, 2025

4 min read

AI Search Gets Smarter with Multi-Level Synthetic Data

Key Facts

New research introduces a method for training AI information retrieval (IR) models using synthetic data with multiple levels of relevance.
This approach moves beyond traditional contrastive learning, which uses binary (positive/negative) relevance labels.
Large language models (LLMs) are used to generate synthetic documents that answer queries with graduated relevance.
The method utilizes Wasserstein distance as a loss function for training transformer-based retrievers.
Experiments on MS MARCO and BEIR benchmarks show significant performance improvements over conventional training, even without real documents.

Why You Care

Ever frustrated when a search engine misses the mark, even slightly? What if AI search could understand not just if something is relevant, but how relevant it truly is? This new research on information retrieval aims to do just that, making your future searches much more precise. It could mean finding exactly what you need, faster, and with less effort. Your digital life is about to get a serious upgrade.

What Actually Happened

A team of researchers, including Reza Esfandiarpoor and George Zerveas, has unveiled a novel approach to training AI models for information retrieval (IR). As detailed in the blog post, they are moving past the traditional method of contrastive learning. This older method often uses binary relevance labels, meaning documents are either positive or negative. The problem, the team revealed, is that this approach treats all non-positive documents as equally irrelevant. This misses important subtleties in how relevant a document might actually be. To address this, the researchers used large language models (LLMs) to generate synthetic documents. These documents were designed to answer specific queries with multiple levels of relevance. What’s more, the paper states they introduced Wasserstein distance as a more effective loss function. This function helps train transformer-based retrievers with these graduated relevance labels. This method allows AI to learn a more nuanced understanding of document relevance.

Why This Matters to You

This isn’t just academic theory; it has direct implications for your everyday digital interactions. Imagine your favorite search engine or AI assistant understanding your query with far greater depth. Think of it as moving from a simple ‘yes/no’ answer to a nuanced scale of ‘perfectly relevant’ to ‘somewhat relevant’ to ‘not relevant at all.’

For example, if you search for “best coffee makers for cold brew,” a traditional AI might show you all coffee makers. This new approach, however, could prioritize those specifically designed for cold brew. It would then rank others based on how well they adapt to the task.

Key Benefits of Multi-Level Synthetic Data Training:

Increased Search Precision: AI understands nuanced relevance, leading to better results.
Robustness to Data Shifts: Performance holds up even when query patterns change.
Reduced Reliance on Real Data: Synthetic data can jumpstart training without extensive human annotation.
Enhanced Integration: Existing real data can be seamlessly incorporated for further improvements.

How much better could your online research or content discovery become with this level of precision? The study finds that this proposed approach significantly outperforms conventional training. This is true even without using any real documents initially, as the research shows.

The Surprising Finding

Here’s the twist: The most unexpected discovery is that this new method achieves superior results without using any real documents for its initial training. The team revealed that their approach significantly improves self-supervised retrievers. This is in contrast to contrastive learning, which relies on real data. This finding challenges the common assumption that real-world data is always essential for initial model training. It suggests that well-designed synthetic data, especially with multi-level relevance, can be incredibly . The research shows that the method is also more to distribution shift. This means it handles changes in data patterns better than traditional methods. “Without using any real documents, our method significantly improves self-supervised retrievers and is more to distribution shift compared to contrastive learning using real data,” the team revealed.

What Happens Next

This research, presented at EMNLP 2025, points to a clear direction for the future of information retrieval. We can expect to see these multi-level relevance training techniques adopted in commercial search engines and AI assistants within the next 12-24 months. Imagine your personal AI assistant becoming much better at understanding the subtle intent behind your requests. For example, if you ask for “healthy dinner ideas,” it might not just list recipes. It could rank them based on specific dietary needs you’ve expressed previously, even if not explicitly stated in the current query. Content creators might find their work more accurately discovered by target audiences. This is because search algorithms will better understand the specific value of their content. The team revealed that generating multi-level ranking contexts is a better approach to synthetic data generation. This is especially true for IR compared to just generating standard positive and negative documents. This suggests a future where AI learns from richer, more human-like relevance signals from the start. Your interactions with AI will become more intuitive and effective.

Ready to start creating?