Original article excerpt
Server-side extracted preview paragraphs from the original source.
Generate recommendations from production traces, validate them with batch evaluation and A/B testing, and ship with confidence. AI agents that perform well at launch don’t stay that way. As models evolve, user behavior shifts, and prompts get reused in new contexts they were never designed for. Agent quality quietly degrades. In most teams, the improvement […]
Generate recommendations from production traces, validate them with batch evaluation and A/B testing, and ship with confidence.
AI agents that perform well at launch don’t stay that way. As models evolve, user behavior shifts, and prompts get reused in new contexts they were never designed for. Agent quality quietly degrades. In most teams, the improvement process still looks the same: without automatic feedback loops, when a user complains, a developer reads through traces, forms a hypothesis, rewrites the prompt, tests a handful of cases, and ships the fix. Then the cycle repeats, often introducing a new issue for a different user. Up until today, Amazon Bedrock AgentCore provided the pieces for you to debug it manually or build custom implementations: check the evaluation scores to detect quality drop, deep dive into the traces to determine the root cause and update the agent with an improved configuration. The developer is the performance engine relying on intuition rather than on systematic data-backed evidence. Dedicated science teams and large centralized benchmarks help, but they are neither a practical nor timely solution for most product teams. Even when you have that machinery, it tends to move on weekly or monthly cycles, while agents drift in production every day.
AgentCore is the platform to build, connect, and optimize agents at scale, with security enforced at the infrastructure layer. Thousands of developers already use AgentCore to build agents that reason, plan, and act across complex workflows. Today we are announcing new capabilities in AgentCore that complete the observe, evaluate, improve loop for agent performance and quality: recommendations and two ways to validate them.
Recommendations analyze production traces and evaluation outputs to optimize your system prompt or tool descriptions for the evaluator you specify. Batch evaluation helps test the recommendation against a pre-defined test dataset and reports aggregate scores, catching regressions on cases you know matter. When hand-authored scenarios aren’t enough, you can also simulate a dataset using an LLM-backed actor to play the role of an end user. A/B testing runs a controlled comparison between versions of an agent through AgentCore Gateway, splitting live production traffic at the percentage you configure and reporting results with confidence intervals and statistical significance. Recommendations propose changes, batch evaluation and A/B testing validate them, and together they replace the manual cycle of reading traces, guessing at fixes, and deploying blind.
“Continuously evaluating and improving agents is essential for driving data-driven value creation. Processes that traditionally required weeks of manual prompt tuning have evolved into rapid, repeatable cycles through the use of AgentCore. By deriving improvement recommendations from production trace data and validating their impact through A/B testing, organizations can optimize performance while ensuring accuracy and effectiveness. This approach enables continuous, highly efficient improvement at scale.” Yoshiharu Okuda, Head of Generative AI Business Strategy Department, NTT DATA
Here is how the loop runs for the model upgrade scenario. The pattern is the same for any change: a prompt refactor, a tool set update, a framework upgrade.