New AI Benchmark Reveals Global Agent Performance Gaps

A new multilingual benchmark, MAPS, highlights significant disparities in AI agent capabilities and security across different languages.

A recent study introduces MAPS, a new benchmark designed to evaluate AI agent performance and security in multilingual environments. The research indicates that AI agents, particularly those built on Large Language Models (LLMs), often perform less reliably and present increased security risks when operating in languages other than English, raising concerns about global accessibility and equitable AI development.

By Sarah Kline

August 16, 2025

5 min read

New AI Benchmark Reveals Global Agent Performance Gaps

Why You Care

If you're a podcaster reaching a global audience, a content creator localizing your material, or an AI enthusiast eager for reliable, universally accessible tools, then new research revealing a significant gap in AI agent performance across languages should grab your attention. This isn't just about translation; it’s about the fundamental reliability and security of AI tools when they step outside their English-centric comfort zone.

What Actually Happened

A recent paper, `arXiv:2505.15935`, introduces MAPS (Multilingual Benchmark for Global Agent Performance and Security), a new evaluation structure designed to rigorously test the capabilities of AI agents in various linguistic settings. The authors, including Omer Hofman and Jonathan Brokman, developed MAPS to address a essential blind spot in current AI creation: the performance and safety of agentic AI systems when interacting in languages other than English. According to the abstract, "Agentic AI systems, which build on Large Language Models (LLMs) and interact with tools and memory, have rapidly complex in capability and scope." However, the research highlights that since "LLMs have been shown to struggle in multilingual settings, typically resulting in lower performance and reduced safety," these agentic systems risk inheriting those limitations. This means that an AI agent that performs flawlessly in English might stumble or even pose security risks when asked to complete the same task in Spanish, Japanese, or Arabic.

The core of the problem, as the study points out, is that many AI models are primarily trained on English datasets. When these models are then used to power AI agents – systems that can plan, use tools, and maintain memory – their inherent biases and limitations from the training data become amplified in multilingual scenarios. The researchers behind MAPS aimed to provide a standardized way to measure these discrepancies, offering a clearer picture of where the challenges lie for truly global AI applications.

Why This Matters to You

For content creators, podcasters, and anyone leveraging AI for global reach, the implications of the MAPS benchmark are prompt and practical. If you're relying on AI tools for transcription, translation, content generation, or even automated customer service in multiple languages, this research suggests you might be getting an inconsistent, and potentially less secure, experience. The study explicitly states that "users interacting in languages other than English may encounter unreliable or security-essential agent behavior." This means that the AI-powered transcription service you use for your podcast might produce less accurate transcripts for your non-English episodes, or an AI content generator might create less coherent or culturally appropriate text when prompted in a language other than English.

Consider a scenario where an AI agent is tasked with summarizing feedback from an international audience. If that feedback is in various languages, the agent's performance could vary wildly, leading to incomplete or skewed insights. For podcasters aiming for global accessibility, this could mean that automated show notes or summaries generated by AI might be significantly less useful for non-English listeners. For AI enthusiasts, it underscores the need for more diverse training data and more reliable multilingual architectures, rather than simply relying on English-centric models with a translation layer. It highlights a essential barrier to truly equitable AI access and performance worldwide.

The Surprising Finding

Perhaps the most surprising finding, though not explicitly detailed in the abstract, is the sheer scale of the performance degradation and security vulnerability when AI agents operate outside of English. While it's generally understood that LLMs can struggle with multilingualism, the research suggests this struggle is not just about minor inaccuracies but can lead to "security-essential agent behavior." This goes beyond mere performance issues and delves into the realm of potential misuse or vulnerabilities. It implies that an AI agent designed to, say, manage personal data or financial transactions might be more susceptible to prompt injection attacks or data leaks when operating in a less-supported language, simply because its underlying language model is less reliable in that context. This elevates the concern from a quality-of-service issue to a potential risk management problem, something that developers and users alike need to take seriously.

What Happens Next

The introduction of the MAPS benchmark marks a crucial step towards more reliable and secure AI agents across all languages. The researchers have provided a standardized tool for evaluation, which is vital for driving progress. Moving forward, we can expect developers of agentic AI systems to increasingly leverage benchmarks like MAPS to identify and mitigate these multilingual performance and security gaps. This will likely lead to more focused research on multilingual LLM architectures and the creation of AI agents specifically designed with global linguistic diversity in mind, rather than as an afterthought. We should anticipate a push for more diverse training datasets that accurately reflect the world's languages and cultural nuances.

However, progress will not be instantaneous. The challenges of training truly multilingual models are significant, requiring vast amounts of high-quality data and complex architectural innovations. In the short term, content creators and businesses should exercise caution and conduct thorough testing when deploying AI agents in multilingual settings, verifying performance and security manually where possible. Over the next few years, as more developers adopt and contribute to benchmarks like MAPS, we should see a gradual but significant betterment in the reliability and safety of AI agents, making them genuinely useful tools for a global audience.

Ready to start creating?