Original article excerpt
Server-side extracted preview paragraphs from the original source.
A Blog post by ServiceNow-AI on Hugging Face
Voice agent failures are often highly domain-specific. A system that flawlessly processes alphanumeric confirmation codes in flight re-booking transactions might stumble when handling complex policies in HR systems. Different domains test an agent's ability to adapt to different vocabulary, workflow complexities and user expectations. So with this release, EVA-Bench expands from one enterprise domain to three: Airline Customer Service Management (CSM), Enterprise IT Service Management (ITSM), and Healthcare HR Service Delivery (HRSD). Together they span 213 evaluation scenarios across 121 tools, a roughly 4x increase in scenario coverage from our original release. Every scenario was validated for solvability against three frontier models (OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6) ensuring the benchmark is both challenging and fair. All three datasets are open-source and available for download:
EVA-Bench is built for multiple audiences. If you're evaluating a voice agent, you can run it against a diverse set of realistic enterprise scenarios spanning 35+ distinct workflows. If you're building your own evaluation dataset, this post describes our end-to-end generation and validation process in enough detail to serve as a practical reference. We walk through how each domain was designed and generated and take a deep dive into the two new additions. We also preview our upcoming multilingual extension, which widens the benchmark's reach beyond English-only enterprise deployments.
Five principles guided the design of the EVA-Bench datasets across all three domains.
Voice-first scope. Not every enterprise workflow belongs in a voice benchmark. We started by identifying which tasks within each domain are handled over the phone in practice, then selected the most common flows from that subset. This kept the scenarios grounded in realistic call patterns.
Realism. Tool schemas were modeled after the kinds of APIs a production platform uses. Scenario policies were drawn from actual enterprise constraints. For the Healthcare HRSD domain, this meant grounding scenarios in actual US healthcare policy and administration systems, including NPI numbers, FMLA, and insurance coverage, so that the benchmark reflects the domain as practitioners encounter it in real life.
Variety. Scaling a dataset by simply repeating identical tasks offers limited evaluation signal. To avoid this, we defined specific workflows for each domain and sampled across three scenario types: single-intent calls, multi-intent calls with up to four intents in a single conversation, and adversarial calls where callers attempt to bypass troubleshooting steps, misclassify urgency, or access records they are not authorized to view. Within single and multi-intent scenarios, we also included cases where the user's goal is not satisfiable, because real call volume is not all happy-path, and in our experience models tend to struggle more with unsatisfiable goals than with successful interactions.