EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

EVA-Bench Data 2.0 is a new benchmark dataset covering 3 domains with 121 AI tools and 213 scenarios. It aims to evaluate AI system performance comprehensively across multiple tasks. This update helps researchers and developers better understand AI capabilities and limitations.

NowCore AIHigh-signal source

Signal trust

High-signal sourceSingle sourceEarly signal

PublishedThursday, June 4, 2026 at 2:24 PMJun 4, 02:24 PM

Freshness3h live

Story ID#3852

Back to feed Original report

Original article excerpt

Server-side extracted preview paragraphs from the original source.

A Blog post by ServiceNow-AI on Hugging Face

Voice agent failures are often highly domain-specific. A system that flawlessly processes alphanumeric confirmation codes in flight re-booking transactions might stumble when handling complex policies in HR systems. Different domains test an agent's ability to adapt to different vocabulary, workflow complexities and user expectations. So with this release, EVA-Bench expands from one enterprise domain to three: Airline Customer Service Management (CSM), Enterprise IT Service Management (ITSM), and Healthcare HR Service Delivery (HRSD). Together they span 213 evaluation scenarios across 121 tools, a roughly 4x increase in scenario coverage from our original release. Every scenario was validated for solvability against three frontier models (OpenAI GPT-5.4, Google Gemini 3.1 Pro, and Anthropic Claude Opus 4.6) ensuring the benchmark is both challenging and fair. All three datasets are open-source and available for download:

Opening the briefing

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

Original article excerpt