How confessions can keep language models honest

OpenAI explores using confessions to improve language model honesty. This technique helps models admit mistakes or uncertainties, enhancing trustworthiness. It matters because more honest AI can reduce misinformation and improve user experience.

ArchiveMajor

Signal trust

Single sourceEarly signal

PublishedWednesday, December 3, 2025 at 11:00 AMDec 3, 11:00 AM

FreshnessArchive

Story ID#182

Back to feed Original report

Original article excerpt

Server-side extracted preview paragraphs from the original source.

OpenAI researchers are testing “confessions,” a method that trains models to admit when they make mistakes or act undesirably, helping improve AI honesty, transparency, and trust in model outputs.

We’re sharing an early, proof-of-concept method that trains models to report when they break instructions or take unintended shortcuts.

AI systems are becoming more capable, and we want to understand them as deeply as possible—including how and why they arrive at an answer. Sometimes a model takes a shortcut or optimizes for the wrong objective, but its final output still looks correct. If we can surface when that happens, we can better monitor deployed systems, improve training, and increase trust in the outputs.

Opening the briefing

How confessions can keep language models honest

Original article excerpt