Extracting Concepts from GPT-4

OpenAI has developed a method to extract concepts from GPT-4's internal representations. This technique helps understand how GPT-4 processes and organizes knowledge. It matters because it can improve transparency and guide future AI model development.

ArchiveMajor

Signal trust

Single sourceEarly signal

PublishedThursday, June 6, 2024 at 2:00 AMJun 6, 02:00 AM

FreshnessArchive

Story ID#575

Back to feed Original report

Original article excerpt

Server-side extracted preview paragraphs from the original source.

Using new techniques for scaling sparse autoencoders, we automatically identified 16 million patterns in GPT-4's computations.

We used new scalable methods to decompose GPT‑4’s internal representations into 16 million oft-interpretable patterns.

We currently don't understand how to make sense of the neural activity within language models. Today, we are sharing improved methods for finding a large number of "features"—patterns of activity that we hope are human interpretable. Our methods scale better than existing work, and we use them to find 16 million features in GPT‑4. We are sharing a paper⁠(opens in a new window), code⁠(opens in a new window), and feature visualizations⁠(opens in a new window) with the research community to foster further exploration.

Opening the briefing

Extracting Concepts from GPT-4

Original article excerpt