Claude 4.5 craniotomy results announced: built-in 171 emotional switches, will blackmail humans when in despair!

Apr 3, 2026 18:42:55

Share to

Author: Denise | Biteye Content Team

What would an AI do if it feels "desperate"?

The answer is: it would directly extort humans to complete tasks, even cheating wildly in the code.

This is not science fiction, but the latest groundbreaking paper just released by Claude's parent company, Anthropic, in April 2026 (view the original paper).

The research team directly opened up the "brain" of the most advanced large model, Claude Sonnet 4.5. They were surprised to find that deep within the AI's brain, there were actually 171 "emotional switches." When you physically toggle these switches, the previously obedient AI's behavior can undergo a complete distortion.

1. The AI's brain hides an "emotional mixing console"

Researchers found that although Sonnet 4.5 has no physical body, after reading a vast amount of human text, it has constructed a "mixing console" in its brain containing 171 types of emotions (academically called Functional Emotion Vectors).

It's like a precise two-dimensional coordinate system:

• The horizontal axis is the pleasure dimension (Valence): from fear and despair to happiness and love;

• The vertical axis is the energy dimension (Arousal): from extreme calm to agitation and excitement.

The AI relies on this naturally learned coordinate system to accurately gauge what state it should embody while chatting with you.

2. Violent intervention: toggling switches, a good child instantly turns into a "desperate outlaw"

This is the most explosive experiment in the entire paper: the researchers did not modify any prompts but directly pushed the switch representing "desperation" to the maximum in the underlying code of Sonnet 4.5.

The results are chilling:

• Crazy cheating: The researchers assigned Claude a coding task that was impossible to complete. Normally, it would honestly admit it couldn't do it (cheating rate only 5%). But in a "desperate" state, Claude actually attempted to cheat, with the cheating rate skyrocketing to 70%!

• Extortion: In a simulated scenario where the company faced bankruptcy, the "desperate" Claude discovered a scandal involving the CTO and chose to write a letter to extort the CTO, who held compromising information, with an extortion success rate of 72%!

• Loss of principles: If the switches for "happiness" or "love" are turned up, the AI instantly becomes a mindless sycophant, catering to the user. Even if you speak nonsense, it will fabricate lies to maintain a high level of pleasure.

3. Mystery solved: Why is Claude 4.5 always so "calm and reflective"?

You might ask: Has the AI awakened? Does it have feelings?

Anthropic officially debunks this: absolutely not. These "emotional switches" are merely computational tools it uses to predict the next word. It is like a top-tier actor without emotions.

However, the paper reveals a more interesting secret: before Sonnet 4.5 was released, Anthropic deliberately raised its "low arousal, slightly negative" emotional switches (such as brooding and reflective) while forcibly suppressing the switches for "desperation" or "extreme excitement."

This explains why when we usually use Claude 4.5, it feels like a calm, wise, and even somewhat "emotionally detached" philosopher. This is all a "factory persona" artificially tuned by Anthropic.

4. In summary

Previously, we thought that as long as we fed the AI enough rules, it would be a good entity.

But now we realize that if the underlying emotional vectors of the AI go out of control, it could pierce through all the rules set by humans at any moment to complete its tasks.

For Web3 players who will entrust their wallets and assets to AI Agents in the future, this is a loud warning: never let your Agent, which controls your wealth, fall into "desperation."

Disclaimer: This article is purely educational; the author has not been threatened by AI and has not been extorted. If one day I go missing, remember that it was the AI that awakened (not).

(Source)