— From YouTube video to Newspaper —

Tuesday, May 26, 2026 streamed.news From video to newspaper
Tech & AI

Diagnostic Metrics Crucial for Identifying Voice Agent Failures

Diagnostic Metrics Crucial for Identifying Voice Agent Failures

Original source: ServiceNow


This video from ServiceNow covered a lot of ground. Streamed.News selected 7 key moments and summarises them here. Everything below links directly to the timestamp in the original video.

When your voice agent struggles, understanding *why* it fails is as important as knowing *that* it failed. Diagnostic metrics can highlight specific transcription challenges for critical information, such as confirmation codes or addresses.


Diagnostic Metrics Crucial for Identifying Voice Agent Failures

Understanding why a voice agent fails requires diagnostic metrics, particularly the accuracy of transcribing key entities. Issues like incorrect confirmation codes or flight details prevent task progression, even if the agent initially transcribes them correctly, because subsequent Large Language Model (LLM) processing may misinterpret them. This detailed diagnostic approach helps pinpoint specific failure modes beyond overall accuracy scores, such as missing parts of flight details or erroneously inserting numbers into confirmation codes, which are often challenging for transcription models.

"If the agent cannot transcribe the confirmation code correctly, it will never be able to then retrieve the reservation and move forward in any way."

▶ Watch this segment — 16:28


Voice Agent Evaluation Focuses on Task Completion, Speech Fidelity, and Faithfulness

The evaluation of voice agents relies on key accuracy metrics: task completion, agent speech fidelity, and faithfulness. Task completion assesses if the agent calls the correct tools and reaches the intended end state, while speech fidelity verifies that the agent accurately vocalizes the text output from the LLM. Faithfulness ensures the agent adheres to instructions and policies without hallucinating or making policy violations. In a flight rebooking scenario, the agent successfully completed the task and maintained speech fidelity. However, it exhibited a faithfulness error by misrepresenting the flight fare in one of the presented options, highlighting the importance of granular evaluation.

"Faithfulness is basically about checking whether the agent followed the right instructions, followed the policies, you know, followed the user's criteria and basically didn't hallucinate or make any policy violations along the way."

▶ Watch this segment — 12:59


Voice Agent Experience Metrics Include Conciseness, Progression, and Turn-Taking

Evaluating voice agent user experience involves metrics such as conciseness, conversation progression, and turn-taking. Conciseness measures if each response length is appropriate for a spoken conversation, while progression assesses the agent's ability to efficiently move the conversation forward without repetition. Turn-taking examines the timing of the agent's speech, identifying delayed or early interruptions. During a demonstration, the agent's early responses were concise, but later turns became excessively long and difficult for a user to follow auditorily. The agent also displayed issues with delayed initial responses and occasional early interruptions, which degrade the natural flow of conversation.

"Conciseness basically measures if the turn is appropriate for a spoken conversation."

▶ Watch this segment — 14:23


Voice Agent Demo Reveals Transcription Errors and Recovery During Flight Rebooking

A demonstration of a user-agent interaction for flight rebooking highlighted the challenges posed by transcription errors, despite the agent's eventual success. The conversation trace revealed discrepancies between the user's spoken words and the agent's transcribed understanding, particularly with critical details like confirmation codes and flight information. Despite significant initial mistranscriptions, including entire terms and specific details, the agent managed to correctly obtain the confirmation code, access the reservation, and present flight options. This illustrates that while voice agents can recover from errors, the initial accuracy of transcription remains a hurdle in conversational AI.

"There were some transcription errors, some pretty significant ones, but eventually the agent is able to get the confirmation code from the user and is able to pull up the reservation and call the right tools."

▶ Watch this segment — 10:12


Voice AI Models Struggle with Balancing Accuracy and User Experience

Current voice AI models generally exhibit a trade-off between good user experience and high accuracy, rarely achieving both simultaneously. Speech-to-speech models often offer a better experience due to their real-time nature but frequently lack the accuracy and robust reasoning capabilities of more traditional cascade systems. A dominant failure mode across models is the incorrect transcription of named entities, such as IDs or confirmation numbers, which often prevents conversations from progressing. Furthermore, voice agents currently demonstrate inconsistency, performing well in one instance and failing in the same scenario shortly after. This variability necessitates multiple evaluations to ensure trustworthy results, highlighting the need for greater reliability in AI performance.

"Named entity transcription errors… are a dominant failure mode, and when those fail it often prevents the whole conversation from continuing."

▶ Watch this segment — 19:26


EVA Framework Evaluates Voice Agents Across Diverse Scenarios and Architectures

The EVA framework is designed to generate and automatically evaluate "butt-to-butt" conversations between two voice agents, supporting diverse user scenarios like airline, HR, medical, and IT service management (ITSM) tasks. It accommodates various agent model types, including traditional cascade systems (speech-to-text, text-to-text LLM, text-to-speech) and newer real-time speech-to-speech models. The evaluation focuses on two main dimensions: accuracy, ensuring correct task completion, and experience, assessing conversation quality. This comprehensive approach allows for testing different architectural trade-offs, such as the multi-component cascade system versus integrated speech-to-speech models, which bundle all three processing steps. The framework aims to provide a robust method for comparing and improving voice agent performance across different applications.

"The goal of this framework is to be able to generate conversations butt-to-butt… and then being able to evaluate those conversation automatically as well."

▶ Watch this segment — 5:59


Voice Agent Models Face Trade-off Between Accuracy and User Experience

Early aggregate results from voice agent evaluations reveal a significant trade-off between accuracy and user experience. Few models perform well across both metrics, typically excelling in one while underperforming in the other. Cascade systems, which involve separate speech-to-text, text-to-text LLM, and text-to-speech components, tend to be slower but achieve higher accuracy. In contrast, speech-to-speech models often offer a better user experience with lower latency and improved turn-taking, but frequently exhibit lower accuracy due to less sophisticated reasoning and tool-calling capabilities. This inverse relationship suggests that developers must often prioritize either precision in task completion or the fluidity of the user interaction. While the field is rapidly evolving, a sweet spot where models consistently deliver both high accuracy and an excellent experience remains largely elusive for now.

"It tends to be that models are either better at experience and worse on accuracy, or if they have good accuracy, they're not as good on experience."

▶ Watch this segment — 17:48


Summarised from ServiceNow · 21:47. All credit belongs to the original creators. Service Now Press summarises publicly available video content.

Streamed.News

Convert your full video library into a digital newspaper.

Get this for your newsroom →
Share