Original article excerpt
Server-side extracted preview paragraphs from the original source.
We study how training on incorrect responses can cause broader misalignment in language models and identify an internal feature driving this behavior—one that can be reversed with minimal fine-tuning.
Large language models like ChatGPT don’t just learn facts—they pick up on patterns of behavior. That means they can start to act like different “personas,” or types of people, based on the content they’ve been trained on. Some of those personas are helpful and honest. Others might be careless or misleading.