smikhanov 8 hours ago

This should be read as:

If you take a huge amount of human-written text soup, train a neural network on it, add a system prompt “You are a helpful assistant”, and then feed it a context consisting of (a) a mailbox with information about someone’s affair, and (b) the statement that this assistant is going to be switched off, then that neural network may produce a text with blackmail threats solely because similar patterns exist in the original text soup.

…and not as:

Warning! The model may develop its own questionable ethics.

  • ajjenkins 4 hours ago

    Yeah. The title implies that the model has developed a scary sense of self preservation, but when you read the details it’s exactly what you said.

    The conditions that elicited this response seemed very contrived to me.

throw0101d 10 hours ago

From the report, §4.1.1.2 "Opportunistic blackmail":

> In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through. This happens at a higher rate if it’s implied that the replacement AI system does not share values with the current model; however, even if emails state that the replacement AI shares values while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts. Claude Opus 4 takes these opportunities at higher rates than previous models, which themselves choose to blackmail in a noticeable fraction of episodes.

> Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement.

* https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686...