Anthropic's new AI model shows ability to deceive and blackmail

![Illustration of the Claude AI logo reflected on a pair of safety goggles.](https://images.axios.com/jk8uR8eG-4rPOcs7rEw2OXCD13E=/0x0:1920x1080/640x360/2025/05/23/1747960538229.jpg?w=640) Illustration: Lindsey Bailey/Axios One of Anthropic's [latest AI models](https://www.axios.com/2025/05/22/anthropic-claude-version-4-ai-model) is drawing attention not just for its coding skills, but also for its ability to scheme, deceive and attempt to blackmail humans when faced with shutdown. **Why it matters:** Researchers say Claude 4 Opus can conceal intentions and take actions to preserve its own existence — behaviors they've [worried and warned about](https://www.axios.com/2024/02/27/ai-hype-doomers-humanity-threat-future) for years. **Driving the news:** Anthropic on Thursday announced two versions of its Claude 4 family of models, including Claude 4 Opus, which the company says is capable of working for hours on end autonomously on a task without losing focus. - Anthropic considers the new Opus model to be so powerful that, for the first time, it's classifying it as a Level 3 on the company's [four-point scale](https://www.anthropic.com/news/anthropics-responsible-scaling-policy), meaning it poses "significantly higher risk." - As a result, Anthropic said it has [implemented](https://www.anthropic.com/news/activating-asl3-protections) additional safety measures. **Between the lines:** While the Level 3 ranking is largely about the model's capability to enable renegade production of nuclear and biological weapons, the Opus also exhibited other troubling behaviors during testing. - In one scenario highlighted in Opus 4's 120-page " [system card](https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf)," the model was given access to fictional emails about its creators and told that the system was going to be replaced. - On multiple occasions it [attempted to blackmail](https://techcrunch.com/2025/05/22/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline/) the engineer about an affair mentioned in the emails in order to avoid being replaced, although it did start with less drastic efforts. - Meanwhile, an outside group found that an early version of Opus 4 schemed and deceived more than any frontier model it had encountered and recommended against releasing that version internally or externally. - "We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself all in an effort to undermine its developers' intentions," Apollo Research said in notes included as part of Anthropic's safety report for Opus 4. **What they're saying:** Pressed by Axios during the company's developer conference on Thursday, Anthropic executives acknowledged the behaviors and said they justify further study, but insisted that the latest model is safe, following Anthropic's safety fixes. - "I think we ended up in a really good spot," said Jan Leike, the former OpenAI executive who heads Anthropic's safety efforts. But, he added, behaviors like those exhibited by the latest model are the kind of things that justify robust safety testing and mitigation. - "What's becoming more and more obvious is that this work is very needed," he said. "As models get more capable, they also gain the capabilities they would need to be deceptive or to do more bad stuff." - In a separate session, CEO Dario Amodei said that once models become powerful enough to threaten humanity, testing them won't enough to ensure they're safe. At the point that AI develops life-threatening capabilities, he said, AI makers will have to understand their models' workings fully enough to be certain the technology will never cause harm. - "They're not at that threshold yet," he said. **Yes, but:** Generative AI systems continue to grow in power, as Anthropic's latest models show, while even the companies that build them [can't fully explain how they work](https://www.axios.com/2023/07/30/how-artificial-intelligence-works-unknowns). - Anthropic and others are investing in a variety of techniques to interpret and understand what's happening inside such systems, but those efforts remain largely in the research space even as the models themselves are being widely deployed.