Original article excerpt
Server-side extracted preview paragraphs from the original source.
Using new techniques for scaling sparse autoencoders, we automatically identified 16 million patterns in GPT-4's computations.
We used new scalable methods to decompose GPT‑4’s internal representations into 16 million oft-interpretable patterns.
We currently don't understand how to make sense of the neural activity within language models. Today, we are sharing improved methods for finding a large number of "features"—patterns of activity that we hope are human interpretable. Our methods scale better than existing work, and we use them to find 16 million features in GPT‑4. We are sharing a paper(opens in a new window), code(opens in a new window), and feature visualizations(opens in a new window) with the research community to foster further exploration.