AI Interview Systems: Quality vs. User Satisfaction

New research reveals surprising disconnect in voice-based AI performance.

A recent study evaluated AI interview systems, combining speech-to-text, large language models, and text-to-speech components. Researchers found that while Google's STT, GPT-4.1, and Cartesia's TTS performed best objectively, user satisfaction didn't always align with these technical metrics.

By Katie Rowan

August 25, 2025

3 min read

AI Interview Systems: Quality vs. User Satisfaction

Key Facts

A study evaluated Speech-to-Text (STT), Large Language Model (LLM), and Text-to-Speech (TTS) combinations for AI interview systems.
The research analyzed data from over 300,000 AI-conducted job interviews.
Google's STT, GPT-4.1, and Cartesia's TTS combination outperformed others in objective quality and user satisfaction.
Surprisingly, objective quality metrics showed a weak correlation with user satisfaction scores.
The study provides practical guidance for selecting components in multimodal conversations and a validated evaluation methodology.

Why You Care

Ever wonder why some AI voice assistants feel more natural than others? What if the ‘best’ technical performance doesn’t actually mean the best experience for you? New research is shedding light on this very question, specifically within the realm of AI interview systems. This study dives into how different AI components work together. It uncovers a surprising truth about what truly makes a voice AI system feel right. How will this change your interactions with AI in the future?

What Actually Happened

A new paper titled “Evaluating Speech-to-Text x LLM x Text-to-Speech Combinations for AI Interview Systems” has been published. According to the announcement, this research compared various combinations of AI components. These systems include speech-to-text (STT), large language models (LLMs), and text-to-speech (TTS) technologies. The study analyzed data from over 300,000 AI-conducted job interviews. The team used an LLM-as-a-Judge structure to evaluate conversational quality and technical accuracy. The documentation indicates that five production configurations were assessed. The findings show which combinations excel in specific areas.

Why This Matters to You

This research offers practical guidance for anyone involved with voice-based conversational AI. If you’re building or using these systems, understanding component performance is crucial. The study finds that a specific combination stands out. This stack includes Google’s STT, GPT-4.1, and Cartesia’s TTS. It outperformed other options in objective quality metrics. It also scored higher in user satisfaction scores. Imagine you’re a recruiter using an AI interviewer. You want the best experience for candidates. This research helps you choose the right tools. Your candidates will appreciate a smoother, more effective interaction.

Here are some key findings from the study:

Best Performing Stack: Google’s STT, GPT-4.1, and Cartesia’s TTS.
Evaluation Method: LLM-as-a-Judge automated structure.
Data Scale: Over 300,000 AI-conducted job interviews analyzed.

As Rumi Allbert and the team state in their abstract, “Our findings provide practical guidance for selecting components in multimodal conversations and contribute a validated evaluation methodology for human-AI interactions.” This means you can make more informed decisions. Are you prioritizing technical perfection or user comfort? This study suggests you might need to consider both. How might this influence your next AI project?

The Surprising Finding

Here’s the twist: The study revealed a surprising disconnect. The research shows that objective quality metrics correlated weakly with user satisfaction scores. This is counterintuitive for many. You might assume that a technically superior system would always lead to happier users. However, the paper states that user experience in voice-based AI systems depends on factors beyond technical performance. This challenges the common assumption that raw technical power always translates to real-world success. For example, a system might perfectly transcribe speech. Yet, if its synthesized voice sounds unnatural, users could still be dissatisfied. This suggests that the ‘feel’ of the interaction is just as important as its accuracy.

What Happens Next

These findings will likely influence how AI interview systems are developed. Expect more focus on the holistic user experience, not just individual component performance. For example, future AI systems might prioritize natural intonation over pure word accuracy. This could lead to more nuanced evaluation methods. The team revealed their work offers a validated evaluation methodology. This methodology could become a standard for assessing human-AI interactions. The industry implications are significant. Companies might start investing more in user perception studies. Your feedback on AI systems will become even more valuable. The goal is to create AI that feels truly conversational. This research is a vital step towards that future.

Ready to start creating?