Original article excerpt
Server-side extracted preview paragraphs from the original source.
OpenAI shares guidance on third-party AI evaluations, covering how to assess model capabilities, safeguards, and validity for frontier systems.
What matters for effective independent evaluations of safeguards and capabilities for frontier models.
Independent, trusted third party evaluations play a critical role in strengthening the safety ecosystem. These evaluations are conducted on frontier models to provide additional evidence for claims about critical capabilities and safety mitigations. In this post, we share lessons we’ve learned so far, and recommend approaches for designing evaluations that can validly assess frontier models that we hope help inform emerging standards in the space.
Earlier, many evaluations treated models like chatbots: the evaluation prompted a model as though it were a user asking a question, the model answered, and an evaluator judged the output. Today’s frontier models can do much more: they can use tools, keep track of information across many steps, and act within a larger workflow. This means that performance depends not only on the model, but also on the environment in which the task takes place, and on the setup that facilitates its actions. This surrounding setup, which we call the “harness,” can change key aspects of the system’s performance, including how it uses tools, keeps track of information, or recovers from mistakes.
This changes how evaluations need to be conducted, and what readers should look for in evaluation reports. In our view, the most useful reports explicitly describe two things beyond the result itself: First, they specify what claim the evaluation setup was designed to test, and second, they share the available evidence that the evaluation result is valid.
Evaluation reports also need to explain how evaluators checked for effects that could impact the validity of a result. These include:
We’ve observed that the role of the harness is especially important for systems that act over longer trajectories. When models can use tools, maintain state, and recover from mistakes across many steps, the harness can change the observed level of performance, and even determine whether the capability that’s being assessed appears in the evaluation at all. For example, a harness that preserves state and retries failed actions may let a model finish a multi-step task that the same model never completes in a simpler harness.
