Original article excerpt
Server-side extracted preview paragraphs from the original source.
We’re releasing a human-validated subset of SWE-bench that more reliably evaluates AI models’ ability to solve real-world software issues.
We’re releasing a human-validated subset of SWE-bench that more reliably evaluates AI models’ ability to solve real-world software issues.
As part of our Preparedness Framework, OpenAI develops a range of metrics to track, evaluate, and forecast models’ abilities to act autonomously. The ability to autonomously complete software engineering tasks is a key component of our Medium risk level in the Model Autonomy risk category. Evaluating these capabilities is challenging due to the complexity of software engineering tasks, the difficulty of accurately assessing generated code, and the challenge of simulating real-world development scenarios. Therefore, our approach to Preparedness must also involve careful examination of evaluations themselves, to reduce the potential for underestimating or overestimating performance in important risk categories.