Original article excerpt
Server-side extracted preview paragraphs from the original source.
BrowseComp: a benchmark for browsing agents.
A simple and challenging benchmark that measures the ability of AI agents to locate hard-to-find information.
AI agents that can gather knowledge by browsing the internet are becoming increasingly useful and important. A performant browsing agent should be able to locate information that is hard-to-find, and which might require browsing tens or even hundreds of websites in the process. Existing benchmarks like SimpleQA, which measure models’ ability to retrieve basic isolated facts, are already saturated by models with access to fast browsing tools, such as GPT‑4o with browsing. To measure the ability for AI agents to locate hard-to-find, entangled information on the internet, we are open-sourcing a new benchmark of 1,266 challenging problems called BrowseComp, which stands for “Browsing Competition”. The benchmark is available in OpenAI’s simple evals github repository(opens in a new window), and you can read our research paper here(opens in a new window).