New AI Benchmarks Improve Machine Translation

Researchers unveil RAGtrans, a new benchmark and method to enhance LLMs for translation using unstructured knowledge.

A new paper introduces RAGtrans, the first benchmark for retrieval-augmented machine translation (MT) using unstructured documents. This research also proposes a multi-task training method that significantly boosts LLM translation accuracy, addressing a key challenge in global communication.

Katie Rowan

By Katie Rowan

September 2, 2025

5 min read

New AI Benchmarks Improve Machine Translation

Key Facts

  • RAGtrans is the first benchmark for retrieval-augmented machine translation using unstructured documents.
  • RAGtrans contains 169,000 MT samples collected via GPT-4o and human translators.
  • A multi-task training method was proposed to teach LLMs to use multilingual document information.
  • The method improves LLMs by 1.6-3.1 BLEU and 1.0-2.0 COMET scores in En-Zh.
  • It also improves LLMs by 1.7-2.9 BLEU and 2.1-2.7 COMET scores in En-De.

Why You Care

Ever struggled with a language barrier online? Imagine trying to understand a complex technical document or a foreign news article. What if AI could translate it with near-human accuracy, even pulling in context from vast, unstructured information sources? This new research on Retrieval-Augmented Machine Translation (RAMT) directly addresses that challenge. It promises to make your digital interactions across languages much smoother.

What Actually Happened

Researchers Jiaan Wang, Fandong Meng, Yingxue Zhang, and Jie Zhou have unveiled a significant advancement in machine translation. They’ve introduced a new benchmark called RAGtrans, according to the announcement. This benchmark is specifically designed to train and evaluate large language models (LLMs) on their ability to perform retrieval-augmented machine translation (RAMT). RAMT involves LLMs using additional information, or ‘retrieved’ data, to improve their translations. Previously, this often meant using paired translation examples or structured knowledge graphs. However, a large amount of crucial information exists in unstructured documents, like web pages or PDFs. The paper states that RAGtrans is the first benchmark to tackle this challenge directly. It contains 169,000 machine translation samples, gathered using GPT-4o and human translators. It also includes diverse language documents to provide the necessary knowledge for these samples. The team further proposed a multi-task training method. This method teaches LLMs to use information from multilingual documents during translation. It uses existing multilingual corpora to create auxiliary training objectives, as detailed in the blog post.

Why This Matters to You

This creation has practical implications for anyone who interacts with content in multiple languages. Think about your daily browsing or work tasks. You might encounter information that’s only available in a foreign language. This new approach could make those translations much more accurate and contextually rich. For example, imagine you are researching a niche topic. The best information might be scattered across various forums and documents in different languages. Current translation tools might miss subtle nuances. However, an LLM trained with RAGtrans could pull in relevant details from those unstructured sources, providing a far more comprehensive translation.

The research shows significant improvements in translation quality. The method improves LLMs by 1.6-3.1 BLEU and 1.0-2.0 COMET scores in English-Chinese translation. For English-German, the improvements are 1.7-2.9 BLEU and 2.1-2.7 COMET scores. These scores are standard metrics for evaluating translation quality. BLEU (Bilingual Evaluation Understudy) measures the similarity of a machine-translated text to a set of high-quality reference translations. COMET (Cross-lingual Metric for Evaluation of Translation) is a more metric that often correlates better with human judgment. The higher these scores, the better the translation quality. This means you can expect more reliable translations. How much better could your international communication become with these advancements?

One of the researchers, Jiaan Wang, highlighted a key aspect of their work. They stated, “In machine translation (MT), previous work typically retrieves in-context examples from paired MT corpora, or domain-specific knowledge from knowledge graphs, to enhance MT models.” This new research expands beyond those traditional methods. It embraces the vast amount of ‘world knowledge’ found in unstructured documents. This is a crucial step forward. It allows LLMs to access a much richer pool of information when translating.

The Surprising Finding

Here’s an interesting twist: the researchers also concluded the essential difficulties that current LLMs face with this task. Despite the significant improvements, there are still challenges. You might assume that simply giving an LLM more data automatically solves all translation problems. However, the study finds that handling unstructured knowledge effectively is complex. It requires specialized training. The paper explains that unstructured documents might not be fully paired across different languages. This means the AI cannot simply look for a direct translation of a sentence. It must understand the context from the unstructured text in one language and apply that understanding to generate a translation in another. This is more difficult than working with neatly organized, pre-translated data. It challenges the common assumption that more data alone is sufficient for AI performance. It highlights the need for methods to interpret and apply this knowledge.

What Happens Next

This research, presented at EMNLP 2025 Findings, suggests exciting future developments. We can expect to see these techniques integrated into commercial translation tools within the next 12 to 18 months. Imagine a scenario where your favorite translation app can pull up relevant articles or historical data to provide a more accurate translation of a news report. For example, if you are reading about a historical event, the AI could access historical archives to ensure correct terminology and context. This would be a significant leap from current word-for-word translation. For content creators and podcasters, this means better access to global audiences. Your content could be translated with greater fidelity, preserving nuances that might otherwise be lost. Developers in the AI industry should consider adopting these multi-task training methods. They should also explore integrating unstructured knowledge retrieval into their LLM-based translation services. The team revealed that their method uses existing multilingual corpora. This creates auxiliary training objectives without additional labeling requirements. This makes it a cost-effective approach for further creation. This approach could lead to more and accurate machine translation systems across various industries.

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice