M-Prometheus: AI Judges Go Multilingual, Boosting Global LLMs

New open-weight LLM judges promise better evaluation and development for non-English AI.

A new suite of open-weight AI judges, M-Prometheus, has been released to improve how large language models (LLMs) are evaluated in multiple languages. This development addresses a significant gap, as most existing LLM judges are optimized only for English. M-Prometheus offers better performance across over 20 languages, helping to advance global AI capabilities.

By Sarah Kline

November 1, 2025

3 min read

M-Prometheus: AI Judges Go Multilingual, Boosting Global LLMs

Key Facts

M-Prometheus is a suite of open-weight LLM judges ranging from 3B to 14B parameters.
These judges provide direct assessment and pairwise comparison feedback on multilingual outputs.
M-Prometheus models outperform state-of-the-art open LLM judges on multilingual reward benchmarks spanning over 20 languages.
They also improve literary machine translation evaluation covering 4 language pairs.
Key factors for effective multilingual judges include backbone model selection and training on synthetic multilingual feedback data.

Why You Care

Ever wondered if your favorite AI assistant truly understands you, especially if you speak a language other than English? Many AI evaluation tools fall short outside of English, creating a significant quality gap. A new creation, M-Prometheus, aims to fix this. This collection of open multilingual LLM judges could dramatically improve how AI understands and responds in diverse languages. Why should you care? Because it means better, more inclusive AI experiences for everyone, including you.

What Actually Happened

Researchers have introduced M-Prometheus, a new collection of open-weight LLM judges. These judges range in size from 3 billion to 14 billion parameters, as detailed in the blog post. Their primary goal is to provide both direct assessment and pairwise comparison feedback on multilingual outputs. This addresses a essential issue: most existing LLM judges are exclusively for English, according to the announcement. This optimization has hindered the creation of better multilingual capabilities in AI models. M-Prometheus aims to bridge this gap, offering a more equitable evaluation method for non-English languages.

Why This Matters to You

Imagine you’re a content creator trying to reach a global audience. You rely on AI tools for translation or content generation. If those tools are only evaluated effectively in English, their quality in other languages might suffer. M-Prometheus directly tackles this challenge. The research shows these models outperform open LLM judges on multilingual reward benchmarks. This includes benchmarks spanning more than 20 languages, as mentioned in the release. They also excel in literary machine translation (MT) evaluation across 4 language pairs.

This means the AI tools you use could soon become much more accurate and nuanced, regardless of the language. For example, consider a podcast producer who needs to generate summaries in Spanish, French, and Japanese. With M-Prometheus, the AI judging the quality of these summaries will be far more reliable. This directly leads to higher quality output for your audience. How might more accurate multilingual AI impact your daily digital interactions?

Here’s a quick look at the impact:

Benefit Area	Previous State	M-Prometheus Impact
Evaluation Quality	Mostly English-centric	Superior performance across 20+ languages
Model creation	Hindered for non-English LLMs	Accelerates creation of better multilingual models
User Experience	Inconsistent for non-English speakers	More accurate and nuanced AI interactions
Translation Accuracy	Limited by English-focused judges	Improved literary machine translation quality

The Surprising Finding

One interesting twist revealed by the research is the identification of key factors for building an effective multilingual judge. The team revealed that backbone model selection is crucial. What’s more, training on synthetic multilingual feedback data proved more effective than relying on translated data. This challenges the common assumption that simply translating existing English evaluation data would suffice. Instead, creating data specifically designed for multilingual feedback yields better results. This finding could reshape how AI models are trained for global use, moving beyond simple translation to more culturally and linguistically appropriate data generation.

What Happens Next

Looking ahead, the M-Prometheus models offer utility for AI developers. They can be leveraged at decoding time to significantly improve generated outputs across all three languages, according to the announcement. This suggests that we could see improvements in multilingual AI applications in the coming months. Developers can access the models, training dataset, and code, fostering rapid integration and creation. For example, a company developing a customer service chatbot could use M-Prometheus to ensure high-quality responses in multiple languages. This would allow them to serve a broader customer base more effectively.

The industry implications are substantial. This open-source release will likely accelerate the creation of truly global large language models. It provides a standardized and effective way to measure their performance beyond English. This means you can expect future AI tools to be more and reliable across diverse linguistic contexts. The project aims to empower developers to create AI that truly understands and communicates with the world.

Ready to start creating?