Why we no longer evaluate SWE-bench Verified

OpenAI announced it will stop evaluating the SWE-bench Verified benchmark. This change reflects evolving standards in software engineering evaluation. It matters because it signals a shift in how AI performance is assessed in coding tasks.

ArchiveMajor

Signal trust

Single sourceEarly signal

PublishedMonday, February 23, 2026 at 12:00 PMFeb 23, 12:00 PM

FreshnessArchive

Story ID#82

Back to feed Original report

Original article excerpt

Server-side extracted preview paragraphs from the original source.

SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.

Since we first published SWE-bench Verified in August 2024, the industry has widely used it to measure the progress of models on autonomous software engineering tasks. After its release, SWE-bench Verified provided a strong signal of capability progress and became a standard metric reported in frontier model releases. Tracking and forecasting progress of these capabilities is also an important part of OpenAI’s Preparedness Framework. When we created the Verified benchmark initially, we attempted to solve issues in the original evaluation that made certain tasks impossible to accomplish in the SWE-bench dataset⁠(opens in a new window).

Opening the briefing

Why we no longer evaluate SWE-bench Verified

Original article excerpt