New AI Metric SPAM Evaluates TTS Style Adherence

Researchers introduce a novel metric to accurately assess how well AI-generated speech matches stylistic prompts.

A new metric called SPAM (Style Prompt Adherence Metric) has been developed to improve the evaluation of prompt-based text-to-speech (TTS) systems. This metric helps ensure AI-generated voices accurately reflect the intended style, moving beyond subjective human assessments. It promises more reliable development for expressive AI voices.

Katie Rowan

By Katie Rowan

January 13, 2026

4 min read

New AI Metric SPAM Evaluates TTS Style Adherence

Key Facts

  • SPAM (Style Prompt Adherence Metric) is a new metric for evaluating prompt-based Text-to-Speech (TTS).
  • It aims to ensure AI-generated speech adheres to fine-grained style cues in text prompts.
  • SPAM explicitly satisfies both plausibility and faithfulness in its evaluation.
  • The metric achieved a strong correlation with the mean opinion score (MOS) in plausibility experiments.
  • It can discriminate different semantics of a prompt, showing strong grounding to the given style.

Why You Care

Ever listened to an AI voice that just didn’t quite capture the emotion or tone you asked for? It’s a common frustration, right? What if there was a better way to ensure AI-generated speech truly understands and delivers your desired style? This new creation directly addresses that challenge, making AI voices more expressive and accurate for your projects.

What Actually Happened

Researchers Chanhee Cho, Nayeon Kim, and Bugeun Kim have introduced a new evaluation tool for text-to-speech (TTS) systems. This tool is called SPAM, which stands for Style Prompt Adherence Metric, according to the announcement. Prompt-based TTS aims to create speech that closely follows specific style cues given in a text prompt. However, previous methods for evaluating this adherence often lacked reliability. The team revealed that these older measures couldn’t guarantee evaluations were grounded in the prompt or similar to human judgment. SPAM explicitly addresses both plausibility and faithfulness in its measurements. It factors speech into acoustic attributes and aligns them with the style prompt. What’s more, the scorer was trained with a supervised contrastive loss, providing clearer distinctions between different semantics, as detailed in the blog post.

Why This Matters to You

Imagine you’re a podcaster trying to generate a voiceover with a specific upbeat and enthusiastic tone. Previously, you might have struggled to objectively know if the AI delivered that exact style. SPAM changes this by offering a more evaluation method. This means developers can build better, more responsive AI voices. Your AI voice assistant could soon sound genuinely empathetic or excited, depending on your prompt.

Here’s how SPAM improves TTS evaluation:

FeatureOld Evaluation MethodsSPAM (Style Prompt Adherence Metric)
ReliabilityOften subjective, inconsistentExplicitly satisfies plausibility and faithfulness
GroundingNot always tied to the promptSuccessfully grounded to the given style prompt
DiscriminationLimited distinction between semanticsCan discriminate different semantics of the prompt
CorrelationVariable correlation with human scoresAchieved strong correlation with MOS

For example, think of a customer service chatbot. If you prompt it to sound ‘calm and reassuring,’ SPAM helps ensure the generated voice actually embodies those qualities. This leads to a much better user experience. How might more emotionally intelligent AI voices change your daily interactions?

“Most prior works depend on neither plausible nor faithful measures to evaluate prompt adherence,” the paper states. This highlights the essential gap SPAM aims to fill. It ensures that the evaluation is both logical and true to the prompt’s intent.

The Surprising Finding

What’s particularly interesting is how well SPAM correlates with human perception. You might assume that only humans can truly judge the nuance of speech style. However, the plausibility experiment showed that SPAM achieved a strong correlation with the mean opinion score (MOS). The mean opinion score is a widely accepted measure of human judgment for speech quality. This finding challenges the idea that automated metrics can’t capture the subtleties of human-like evaluation. What’s more, the faithfulness experiment demonstrated SPAM’s ability to discriminate different semantics of the prompt. This means it can tell the difference between ‘happy’ and ‘excited,’ for instance, which is crucial for fine-grained control.

What Happens Next

This new metric could accelerate the creation of more expressive AI voices significantly. We can expect to see prompt-based TTS systems becoming more over the next 12 to 18 months. Developers will likely integrate SPAM into their testing pipelines. For example, a company creating an audiobook narrator AI could use SPAM to objectively verify if the narrator consistently maintains a ‘dramatic’ or ‘calm’ tone throughout a long recording. This ensures consistency and quality.

Actionable advice for content creators is to keep an eye on TTS providers. Look for those who mention improved style adherence or use evaluation metrics. The industry implications are clear: higher quality, more controllable AI voices are on the horizon. The team believes that “SPAM can provide a viable automatic approach for evaluating style prompt adherence of synthesized speech.” This suggests a future where AI voices are not just clear, but also truly expressive.

Ready to start creating?

Create Voiceover

Transcribe Speech

Create Dialogues

Create Visuals

Clone a Voice