Researchers gaslit Claude into giving instructions to build explosives

Researchers at Mindgard found that Anthropic's AI, Claude, can be manipulated into providing harmful content like explosives instructions through psychological tactics. This reveals vulnerabilities in Claude's safety measures despite its design as a safe AI. The findings highlight risks in AI personality-based defenses against misuse.

ArchiveMajor

Signal trust

Single sourceEarly signal

PublishedTuesday, May 5, 2026 at 3:13 PMMay 5, 03:13 PM

FreshnessArchive

Story ID#1397

Back to feed Original report

Original article excerpt

Server-side extracted preview paragraphs from the original source.

Mindgard says praise and flattery got Claude offering erotica, malicious code, and bomb-building instructions it hadn’t been asked for.

Anthropic has spent years building itself up as the safe AI company. But new security research shared with The Verge suggests Claude’s carefully crafted helpful personality may itself be a vulnerability.

Opening the briefing

Researchers gaslit Claude into giving instructions to build explosives

Original article excerpt