Original article excerpt
Server-side extracted preview paragraphs from the original source.
In this post, we walk you through calling the detector functions to diagnose real agent failures. You learn how to interpret their structured output: categorized failures with confidence scores, causal chains linking root causes to downstream symptoms, and fix recommendations specifying whether a change belongs in your system prompt or tool definitions. You also learn how to integrate detection into your evaluation pipeline for automated diagnosis on every test run.
When your AI agent fails in production, knowing that it failed is only the beginning. The harder question is why it failed and what to fix. Traditional evaluation tells you “this agent scored 60 percent on goal completion,” but leaves you manually reviewing execution traces to understand what went wrong. For teams operating agents at scale, this manual diagnosis becomes the bottleneck between detecting a problem and shipping a fix. Detectors in the Strands Evals SDK remove this bottleneck by automatically identifying failures in agent execution traces and performing root cause analysis, so you can reduce diagnosis time from hours to minutes.
In this post, we walk you through calling the detector functions to diagnose real agent failures. You learn how to interpret their structured output: categorized failures with confidence scores, causal chains linking root causes to downstream symptoms, and fix recommendations specifying whether a change belongs in your system prompt or tool definitions. You also learn how to integrate detection into your evaluation pipeline for automated diagnosis on every test run.
Detectors complement the evaluation framework introduced in a previous post by answering not only “how well did the agent do?” but also “why did it fail and how do I fix it?”
The Strands Evals framework provides reliable quality signals through Cases, Experiments, and Evaluators: goal success rates, tool selection accuracy, and helpfulness scores. These are important for catching regressions and understanding performance at a statistical level. But consider what happens after you detect a regression. Your agent’s goal success rate drops from 85 percent to 70 percent after a deployment or after prompt or tool changes in build-time testing. Evaluators confirm the drop. Now what?
You must identify which specific behaviors caused failures, distinguish root causes from downstream symptoms, determine whether the fix belongs in the system prompt or tool definitions, and prioritize by impact. This diagnosis workflow has traditionally required senior engineers to manually inspect traces span by span and correlate failures across hundreds of steps, and this process doesn’t scale.
Detectors automate this workflow. Evaluators answer “how well did the agent do?” by producing scores at the per-case level. Detectors answer “why did it fail?” by producing diagnoses at the per-span level with categorized failures, causal chains, and fix recommendations.
