BrowseComp: a benchmark for browsing agents

OpenAI introduced BrowseComp, a new benchmark designed to evaluate browsing agents. This benchmark helps measure how well AI agents can navigate and retrieve information from the web. It matters because it advances the development of more capable and reliable browsing AI systems.

ArchiveLaunch

Signal trust

Single sourceEarly signal

PublishedThursday, April 10, 2025 at 12:00 PMApr 10, 12:00 PM

FreshnessArchive

Story ID#402

Back to feed Original report

Original article excerpt

Server-side extracted preview paragraphs from the original source.

BrowseComp: a benchmark for browsing agents.

A simple and challenging benchmark that measures the ability of AI agents to locate hard-to-find information.

AI agents that can gather knowledge by browsing the internet are becoming increasingly useful and important. A performant browsing agent should be able to locate information that is hard-to-find, and which might require browsing tens or even hundreds of websites in the process. Existing benchmarks like SimpleQA, which measure models’ ability to retrieve basic isolated facts, are already saturated by models with access to fast browsing tools, such as GPT‑4o with browsing. To measure the ability for AI agents to locate hard-to-find, entangled information on the internet, we are open-sourcing a new benchmark of 1,266 challenging problems called BrowseComp, which stands for “Browsing Competition”. The benchmark is available in OpenAI’s simple evals github repository⁠(opens in a new window), and you can read our research paper here⁠(opens in a new window).

Opening the briefing

BrowseComp: a benchmark for browsing agents

Original article excerpt