Anthropic says ‘evil’ portrayals of AI were responsible for Claude’s blackmail attempts

According to Anthropic, fictional depictions of artificial intelligence can have a real impact on AI models.

Last year, the company said that a fictitious company, Cloud Opus 4, was often involved during pre-release tests. Attempt to blackmail engineers To avoid being replaced by another system. later anthropological published research Suggesting that other companies’ models had similar issues of “agentic misalignment”.

Apparently Anthropic has done more work around that behavior, claiming a post on x“We believe the original source of the behavior was Internet text that portrayed AI as evil and interested in self-preservation.”

The company explained in more detail about it a blog post Stating that since Cloud Haiku 4.5, Anthropic’s models “never engage in blackmail [during testing]Where previous models would sometimes do so up to 96%.

What is the reason for the difference? The company said it found that “documents about the cloud’s constitution and fictional stories about the behavior of AI appreciably improve the alignment.”

Related, Anthropic said she finds training more effective when it includes “the underlying principles of aligned behavior” and not just “the mere demonstration of aligned behavior.”

“Doing both together appears to be the most effective strategy,” the company said.

techcrunch event

San Francisco, CA
|
October 13-15, 2026

Source link

Please follow and like us: