Deliberative alignment: reasoning enables safer language models

OpenAI introduces deliberative alignment, a method that uses reasoning to improve the safety of language models. This approach helps models better understand and follow user intentions while reducing harmful outputs. It matters because safer AI systems are crucial for responsible deployment and user trust.

ArchiveLaunch

Signal trust

Single sourceEarly signal

PublishedFriday, December 20, 2024 at 11:00 AMDec 20, 11:00 AM

FreshnessArchive

Story ID#473

Back to feed Original report

Original article excerpt

Server-side extracted preview paragraphs from the original source.

Deliberative alignment: reasoning enables safer language models Introducing our new alignment strategy for o1 models, which are directly taught safety specifications and how to reason over them.

Introducing our new alignment strategy for o-series models, which are directly taught safety specifications and how to reason over them.

We introduce deliberative alignment, a training paradigm that directly teaches reasoning LLMs the text of human-written and interpretable safety specifications, and trains them to reason explicitly about these specifications before answering. We used deliberative alignment to align OpenAI’s o-series models, enabling them to use chain-of-thought (CoT) reasoning to reflect on user prompts, identify relevant text from OpenAI’s internal policies, and draft safer responses. Our approach achieves highly precise adherence to OpenAI’s safety policies, and without requiring human-labeled CoTs or answers. We find that o1 dramatically outperforms GPT‑4o and other state-of-the art LLMs across a range of internal and external safety benchmarks, and saturates performance on many challenging datasets. We believe this presents an exciting new path to improve safety, and we find this to be an encouraging example of how improvements in capabilities can be leveraged to improve safety as well.

Opening the briefing

Deliberative alignment: reasoning enables safer language models

Original article excerpt