New Benchmark Reveals AI's Struggle with Expert Translation

DiscoX exposes significant gaps in large language models for specialized, discourse-level translation.

A new benchmark, DiscoX, highlights major shortcomings in large language models (LLMs) when translating complex, expert-level texts. Developed by a team of researchers, DiscoX evaluates discourse-level coherence and terminological precision in Chinese-English translation, revealing that even advanced LLMs fall short of human expert performance.

By Katie Rowan

November 29, 2025

4 min read

New Benchmark Reveals AI's Struggle with Expert Translation

Key Facts

DiscoX is a new benchmark for discourse-level and expert-level Chinese-English translation.
It comprises 200 professionally-curated texts from 7 domains, averaging over 1700 tokens in length.
Metric-S is a new reference-free evaluation system developed for DiscoX, outperforming existing metrics.
Advanced large language models (LLMs) significantly underperform human experts on DiscoX tasks.
The research highlights persistent challenges in achieving professional-grade machine translation for expert domains.

Why You Care

Ever relied on an AI translator for something really important, like a legal document or a medical report? Did you ever wonder if it truly captured every nuance? A new study reveals that even the smartest AI models are struggling with complex, expert-level translations. This could impact how you access essential information across languages. What if your translated content misses crucial details?

This new research introduces DiscoX, a benchmark designed to test AI’s ability to translate specialized texts. The findings indicate a significant gap between AI and human experts in this crucial area, as detailed in the blog post. This matters because accurate translation is vital for sharing knowledge globally.

What Actually Happened

Researchers have unveiled DiscoX, a new benchmark specifically for evaluating discourse-level translation tasks in expert domains. This benchmark focuses on Chinese-English translation, according to the announcement. It addresses a essential limitation in current evaluation methods, which often prioritize segment-level accuracy over broader coherence. The team revealed that existing metrics do not adequately assess the complex requirements of specialized translation.

DiscoX includes 200 professionally-curated texts from seven different expert domains. These texts are substantial, with an average length exceeding 1700 tokens. To complement DiscoX, the researchers also developed Metric-S. This is a reference-free system for automatic assessment. Metric-S provides fine-grained evaluations across accuracy, fluency, and appropriateness, as mentioned in the release. It significantly outperforms older metrics in consistency with human judgments.

Why This Matters to You

Imagine you’re a scientist collaborating with international colleagues. You need to translate highly technical research papers. If the AI tool you use misses key terminological precision or discourse-level coherence, your research could be misunderstood. This new benchmark directly addresses these challenges. It aims to ensure that AI translation tools can handle the complexity your work demands.

What’s more, the study finds that even the most large language models (LLMs) still trail human experts on these tasks. This means that for essential, specialized content, human oversight remains essential. Do you trust AI with your most important cross-lingual communications yet?

Key Findings from DiscoX:

Evaluation Gap: Current methods neglect discourse-level coherence.
New Benchmark: DiscoX uses 200 expert texts from 7 domains.
Text Length: Average text length exceeds 1700 tokens.
AI Performance: LLMs still trail human experts.

For example, consider a medical diagnosis translated by an LLM. A slight misinterpretation of a medical term or a lack of coherence in a complex sentence could have serious consequences. Your ability to get accurate information depends on translation tools. The researchers state, “The evaluation of discourse-level translation in expert domains remains inadequate, despite its centrality to knowledge dissemination and cross-lingual scholarly communication.”

The Surprising Finding

Here’s the twist: despite all the progress in large language models, they still significantly underperform human experts in this specialized translation. This might seem counterintuitive given how LLMs appear to be. We often hear about AI’s capabilities in language. However, the research shows a remarkable performance gap. This finding validates the difficulty of DiscoX.

It also underscores the persistent challenges in achieving machine translation. This challenges the common assumption that general-purpose LLMs can handle any language task. The study finds that while LLMs excel at many things, the nuanced demands of expert domains are still a hurdle. Think of it as the difference between writing a casual email and drafting a patent application. Both are writing, but one requires far more precision.

What Happens Next

This new benchmark and evaluation system provide a structure for more rigorous assessment. We can expect to see AI developers use DiscoX to refine their translation models over the next 12-18 months. The industry implications are clear: AI companies will need to focus more on domain-specific training. This will improve their models’ ability to handle complex, specialized language.

For example, future LLMs might incorporate more expert knowledge during their training phases. This would help them understand intricate terminology and discourse structures. Our advice for you? If you work with highly specialized translations, continue to rely on human experts for final review. However, AI tools can still assist with initial drafts. The proposed benchmark and evaluation system will facilitate future advancements in LLM-based translation, the paper states. Expect to see more targeted improvements in AI translation capabilities in the coming years.

Ready to start creating?