Original article excerpt
Server-side extracted preview paragraphs from the original source.
Agent evaluation is most powerful when you combine fast-moving online signals with stable offline baselines. To understand whether your agent is truly improving over time, you need a fixed benchmark alongside your changing real-world traffic. Managing test cases for evaluation baselines as a dataset in Amazon Bedrock AgentCore brings the discipline of versioned test fixtures […]
Agent evaluation is most powerful when you combine fast-moving online signals with stable offline baselines. To understand whether your agent is truly improving over time, you need a fixed benchmark alongside your changing real-world traffic.
Managing test cases for evaluation baselines as a dataset in Amazon Bedrock AgentCore brings the discipline of versioned test fixtures to agent evaluation. You can author scenarios with inputs, expected outputs, assertions, and tool sequences, then publish them as immutable numbered versions that don’t shift beneath a run. You can iterate freely on a mutable draft until you’re ready to lock a checkpoint. And when something breaks in production, that failure becomes a permanent test case that every future change gets evaluated against.
In this post, we walk through the full workflow with a financial market-intelligence agent. We capture failures from production traces, build a versioned dataset, run an evaluation, fix the agent, and confirm the improvement against the same locked inputs.
Agents are non-deterministic by design. The same input can produce different outputs across runs, which makes a single evaluation result nearly meaningless. You can’t tell if a score moved because the agent changed or because the model sampled differently. Consistent measurement across stable inputs is the only way to know whether a change actually helped.
But stable inputs alone aren’t enough. A large language model (LLM) judge can tell you whether a response sounds helpful. It cannot tell you whether the stock price is accurate, whether the broker workflow ran in the right order, or whether personally identifiable information (PII) leaked between sessions. For those checks you need ground truth: the expected response, the required tool sequence, and the assertions that must hold regardless of how the response is phrased. Ground truth is what turns a subjective score into a verifiable measurement. Without it, you’re measuring the appearance of correctness, not correctness itself.
Versioned datasets give you both. They hold the inputs still so scores are comparable across runs, and they carry the ground truth that makes those scores mean something. This matters most in the two places where agent evaluation actually happens.
