New ASR Leaderboard Tracks Reveal Key AI Voice Trends

Hugging Face expands its Open ASR Leaderboard, adding multilingual and long-form audio analysis.

The Open ASR Leaderboard now includes new tracks for multilingual and long-form audio, offering deeper insights into automatic speech recognition (ASR) model performance. This expansion helps users compare models beyond short-form English transcription. It highlights trade-offs between accuracy, speed, and language support.

By Sarah Kline

December 2, 2025

4 min read

New ASR Leaderboard Tracks Reveal Key AI Voice Trends

Key Facts

The Open ASR Leaderboard now includes new multilingual and long-form transcription tracks.
The leaderboard compares over 60 open and closed-source models from 18 organizations across 11 datasets.
Conformer encoder + LLM decoders offer the best accuracy, with open-source models performing strongly.
CTC / TDT decoders provide the fastest processing speeds.
Closed-source systems currently lead in long-form audio transcription.

Why You Care

Ever struggled with an AI assistant misunderstanding your accent or a podcast transcription failing miserably? What if you could easily pick the AI model for your specific audio needs?

The Open ASR Leaderboard, a crucial resource for automatic speech recognition (ASR) models, has just expanded, according to the announcement. This update introduces vital new tracks for multilingual and long-form audio analysis. This means you can now make much more informed decisions about the AI voice system you use.

What Actually Happened

Hugging Face, a prominent system for machine learning models, has significantly updated its Open ASR Leaderboard, as detailed in the blog post. This leaderboard serves as a benchmark for comparing various ASR models—these are AI systems that convert spoken language into text. Previously, most benchmarks focused primarily on short-form English transcription, typically audio clips under 30 seconds. This narrow focus often overlooked crucial performance aspects like multilingual capability and model throughput, which measures how quickly a model processes audio.

The team revealed that new multilingual and long-form transcription tracks have been added. This expansion allows for a more comprehensive evaluation of ASR models. It moves beyond simple accuracy to consider how well models handle different languages and extended audio segments, like meetings or podcasts. The leaderboard now compares over 60 open and closed-source models from 18 organizations across 11 datasets, as of November 21, 2025.

Why This Matters to You

This expansion of the Open ASR Leaderboard directly impacts anyone working with audio. For example, if you’re a content creator producing podcasts in multiple languages, you need an ASR model that performs well across linguistic boundaries. Or, if you’re a journalist transcribing long interviews, you care deeply about a model’s ability to handle extended audio efficiently.

This updated leaderboard provides clearer guidance, helping you select models tailored to your specific requirements. The research shows that while some models excel in specific areas, there are often trade-offs. “Multilingual performance comes at the cost of single-language performance,” the team revealed. This means a model designed for many languages might not be as precise for English as a dedicated English-only model.

Consider the following key findings from the updated leaderboard:

Feature	Best Performer	Current Status
Accuracy	Conformer encoder + LLM decoders	Open-source models are highly competitive
Speed	CTC / TDT decoders	Offer the fastest processing
Multilingual	Varies; often sacrifices single-language precision	A trade-off for broader language support
Long-Form	Closed-source systems	Still leading, but open-source is catching up

How will this detailed insight change your approach to choosing ASR system?

The Surprising Finding

Here’s a twist: despite the perception that proprietary systems always lead, the Open ASR Leaderboard indicates otherwise for certain metrics. The research shows that models combining a “Conformer encoder + LLM decoders” offer the best accuracy, and surprisingly, many of these are open-source solutions. This challenges the common assumption that closed-source, commercially developed models are inherently superior in all aspects. The team revealed this finding, stating that “open-source ftw” in terms of best accuracy. This suggests that the collaborative and transparent nature of open-source creation is yielding highly competitive results in core ASR accuracy.

However, for long-form audio, the situation is different. The paper states that “Closed-source systems still lead” for these extended tasks. This highlights a nuanced landscape where no single type of model dominates every category. While open-source excels in raw accuracy, closed-source models currently maintain an edge in handling very long audio segments, possibly due to more specialized engineering or proprietary data sets.

What Happens Next

Looking ahead, the insights from the Open ASR Leaderboard will drive further creation in AI voice system. Developers will likely focus on improving open-source models for long-form transcription, aiming to close the gap with closed-source competitors. For example, you might see new fine-tuning guides, similar to those for Whisper, emerging in the coming months to help push performance. The industry implications are significant, fostering a more competitive and dynamic environment for ASR creation.

For readers, this means you can expect more and versatile ASR tools to become available. Actionable advice includes staying updated with the leaderboard’s trends and experimenting with different models. The documentation indicates that fine-tuning guides are available to help users continue pushing performance. This continuous creation promises a future where your AI assistants understand you better, no matter the language or length of your speech.

Ready to start creating?