Original article excerpt
Server-side extracted preview paragraphs from the original source.
Microsoft on Tuesday took the wraps off Adaptive Spec-driven Scoring for Evaluation and Regression Testing, an open source framework for spinning up AI evaluations.
AI researchers and labs have advanced by leaps and bounds in evaluating AI models for everything from safety and compliance to sycophancy and alignment. But it appears companies and developers are faced with a new, specific need: making sure their AI system behaves as intended for their specific product or service.
In a bid to make that testing process simpler, Microsoft on Tuesday took the wraps off ASSERT, short for Adaptive Spec-driven Scoring for Evaluation and Regression Testing.
The open source framework, Microsoft says, makes evaluating application-specific AI behavior easy by using AI to turn high-level, natural-language descriptions of goals, policies, or intended behaviors into thorough, scored tests that can be investigated.
ASSERT takes plain-language descriptions of an AI model’s expected behavior and policies, turns them into a structured set of acceptable and unacceptable behaviors, generates problem scenarios and test cases, runs them against the target system, and scores the results. It can also record the paths the AI system takes, including intermediate actions and tool calls, so developers can inspect where failures happen.
Devs can provide system context, tools, and constraints, too, if they want to further customize what the evaluations cover.
For example, a developer could specify that a document research AI agent shouldn’t send emails to people outside the company, and it should limit confidential information to C-level executives and provide concise summaries with prior context in mind. ASSERT will use those rules to generate test cases that check whether the system follows those rules on an ongoing basis.
