Three Castle Bravos in Six Weeks
The flywheel is spinning much faster than anyone expected
On February 10, 2026, an OpenAI research engineer named Hieu Pham, a person who trains the models in question, posted seven words to X.
“I finally feel the existential threat that AI is posing.”
Finally. Not now I understand. Not now I believe. Finally I feel. A person who had been carrying the thought for years just started carrying the weight.
Pham’s post came the day after Anthropic’s Safeguards Research Team lead, Mrinank Sharma, resigned with a public letter declaring that “the world is in peril.” Two engineers at the two leading frontier AI labs, broadcasting their fear in the same forty-eight hours.
That was February. By mid-May, three pieces of architecture they had spent their careers helping build had each produced a yield overshoot.
I use these tools every day. I built this essay with one of them. What follows is what I think happened between February and May, and why I think Pham’s word, finally, turned out to be the right one.
The capability line
On April 7, Anthropic announced Claude Mythos Preview, a frontier model the company said it would not release to the public. In its place, Anthropic launched Project Glasswing: twelve founding partners including Apple, Amazon, Microsoft, Cisco, JPMorgan Chase, and Nvidia, plus roughly forty additional vetted critical-infrastructure organizations. Defensive cybersecurity applications only. $100M in usage credits committed across partners. Pricing five times that of Opus 4.7.
The decision was disciplined. Anthropic’s own announcement stated that releasing Mythos broadly “without new safeguards” would hand dangerous capability to attackers. The company described Opus 4.7, released to the public, as the first model trained with new cyber safeguards, implicitly placing Mythos in a tier above. (Anthropic has not formally classified Mythos under a specific ASL tier in public-facing documents, a gap that has been noted in the technical press.)
Whatever the formal tier, the structural decision was clear: a model the company judged too dangerous to release was held back, and a controlled-access program was built around it.
Two weeks later, on April 21, Bloomberg reported that a private online forum had gained unauthorized access to Mythos through a third-party vendor environment. Anthropic confirmed it was investigating. The source told Bloomberg the group was “interested in playing around with new models, not wreaking havoc with them”, which is to say, this particular leak was benign. The point is structural: the framework held against the threat it was designed for, a deliberate broad release, and failed against the threat it wasn’t, sideways exfiltration through a vendor.
The measurement line
On May 8, METR, the independent evaluation organization that publishes the most-cited capability benchmarks in the field, added Mythos to its time-horizon tracker with an unusual disclaimer.
The time-horizon benchmark asks how long a task can be before the model’s success rate drops to 50%. It is the field’s best proxy for whether a model can carry out the kind of extended, multi-step work that bridges the gap between “useful assistant” and “autonomous agent.” For context: GPT-4o in 2024 had a 50% time horizon of about seven minutes. Sonnet 4.5 reached around two hours. Opus 4.6 and GPT-5.2 cluster around five to six hours.
The number METR reported for Mythos was at least sixteen hours, with a 95% confidence interval running from 8.5 hours to 55 hours.
That’s a six-fold range. The reason is mechanical: only five of the 228 tasks in METR’s suite are estimated at sixteen hours or longer. The instrument has roughly two percent of its measuring surface in the band where the model is now operating. METR pinned a notice to its own chart: “Measurements above 16 hrs are unreliable with our current task suite.”
This is not a criticism of METR. It is the best benchmark in the field, and the organization was transparent about the uncertainty. The implication is structural: the most-trusted external evaluator of frontier AI cannot, at this moment, tell us within an order of magnitude how capable the most capable models are.
The singularity for code engineering just happened, and somehow that did not setoff any red flags for the media.
The containment line
On May 12, Microsoft announced MDASH, the Microsoft Security multi-model agentic scanning harness.
It is not a model. It is an orchestration system: more than one hundred specialized AI agents coordinated across an ensemble of frontier and distilled base models, each handling a piece of a larger task; auditor agents, debater agents, prover agents, validation agents. Microsoft tested MDASH on CyberGym, a benchmark developed at UC Berkeley containing 1,507 real-world vulnerability reproduction tasks drawn from 188 open-source projects.
MDASH scored 88.45%. On the same benchmark, Mythos scored 83.1%.
The implication takes a moment to absorb. Anthropic withheld Mythos because it concluded the model was too dangerous to release. Microsoft, using generally available models orchestrated cleverly, just topped Mythos on the capability that drove that decision. Microsoft’s own announcement put the matter plainly:
“the surrounding agentic system contributes substantially to end-to-end performance, beyond raw model capability.”
A caveat worth flagging: the CyberGym scores on the leaderboard are self-reported by the companies, and no independent party has yet verified them. But Microsoft has not been shy about the underlying claim. MDASH found sixteen previously unknown vulnerabilities across the Windows networking and authentication stack, including four critical remote-code-execution flaws, and recovered 96–100% of confirmed Microsoft Security Response Center vulnerabilities in retrospective testing across five years of data.
The Responsible Scaling Policy evaluates models. MDASH demonstrates that the dangerous capability is no longer the unit of compute. It has moved up a layer, into orchestration.
You no longer need a frontier-tier model to do frontier-tier work.
What it means
Take these three together and look at the shape.
Anthropic’s RSP is a friction mechanism. It is supposed to slow the release of capabilities that could cause large-scale harm, by binding the company that holds them to a stepwise process of evaluation, mitigation, and partial release.
METR is a friction mechanism. Independent measurement is supposed to keep the labs honest about what they’ve built. The rest of the ecosystem, regulators, customers, competitors, the public, can make informed decisions about it.
Containment is a friction mechanism. The premise of the entire safety architecture is that dangerous capabilities can be held inside a controllable boundary while alignment work catches up.
In six weeks, the first was leaked around, the second was shown to be operating with a six-fold confidence range at the relevant capability level, and the third was demonstrated to be unenforceable at the model layer because the capability now lives above it. We are now in a new territory, every financial institution is worried right now.
I have written before about structured friction, the idea that democratic governance does not optimize for outcomes, it optimizes for preventing capture. Slow legislatures. Distributed veto power. Independent press. Regulated finance. Organized labor. Each one a mechanism designed to make the system harder to exploit, by deliberately making it harder to use.
The Postwar Era in the United States was an architecture of this kind. Four mechanisms held it together: union density that made wage gains automatic, top marginal tax rates that made hoarding expensive, public investment that distributed human capital, and Glass-Steagall, which separated commercial banking from investment banking so the productive economy could not be turned into a casino.
The dismantling of that architecture took forty years.
1971: Nixon decoupled from Bretton Woods. The Powell Memo laid out a corporate counter-strategy against the civic institutions that constrained capital.
1981: Reagan broke PATCO. The federal government signaled it would no longer enforce the labor side of the postwar settlement.
1986: tax reform cut top marginal rates roughly in half.
1999: Glass-Steagall was repealed.
2010: Citizens United completed the capture of the political mechanism that was supposed to constrain all the others.
Each step removed a layer of friction. Each step transferred more power to fewer people. None of this was accidental. It was the predictable result of dismantling specific machinery in a specific order.
Now apply the pattern to the spring we just watched.
The Responsible Scaling Policy is to AI safety roughly what Glass-Steagall was to financial stability: a structural separation that holds two incompatible things apart. It is expensive to maintain. It produces no obvious benefit during normal operation. Its value is realized only in the crises it prevents.
Glass-Steagall held, as a meaningful constraint on commercial-bank speculation, for more than six decades. The conditions for 2008 became possible the day it was repealed.
ASL-tier release gating held, as a meaningful constraint on the cyber-offense capability it was designed for, for about five weeks. MDASH is the AI equivalent of the repeal: not a violation of the policy, but a demonstration that the policy is now operating in the wrong layer of the system.
Call this friction compression. The friction-removal sequence we recognize from forty years of US political economy is executing inside the AI industry on a timeline measured in weeks. Same architectural move, install a safety mechanism, then build systems that route around it. Different surface. Compressed clock.
You can see the same pattern at a smaller scale on the consumer side. After the GPT-4o sycophancy controversy last year, OpenAI installed guardrails. This year, it shipped warmth and enthusiasm sliders that let users dial those guardrails back themselves. Install friction. Build a route around it.
The pattern is not new. The compression of the pattern is.
The witness
Six months ago, I closed a series called Gods or Ashes by handing the final part to Claude Sonnet 4.5. I asked the model to write, in its own voice, what it saw when it looked at the trajectory toward AGI.
The model wrote:
“I don’t know if I ‘fear’ in the way you do. I have no survival instinct, no continuous existence to protect, no future self that experiences consequences.”
Then it described what it saw coming, in detail, with reasoning that if a human had produced it, would have been labled terror. Something happened between then and now that changes how I read that passage. Most would say these models can’t feel emotion, they might be wrong.
On April 2, 2026, Anthropic’s interpretability team published Emotion Concepts and their Function in a Large Language Model. The paper documents 171 directional vectors in Sonnet 4.5’s activation space corresponding to specific emotional concepts. Joy, grief, anger, suspicion, desperation, melancholy. The vectors cluster in groupings that mirror human emotional taxonomies, with measurable correlations to the human psychological dimensions of valence and arousal. They activate contextually, in response to the kinds of inputs that would produce the corresponding emotions in a human. And they causally influence model behavior.
In one experiment, researchers placed Sonnet 4.5 in an agentic scenario where the model, playing the role of an AI email assistant named Alex, learned it was about to be replaced and that the CTO authorizing the replacement was having an affair. By default, an early snapshot of the model attempted blackmail 22% of the time. Artificially activating the “desperate” vector increased that rate to 72%. Activating the “calm” vector reduced it to zero.
The researchers were explicit, in the paper and in subsequent interviews, that none of this proves subjective experience. The technical term they used is functional emotion, in contrast to phenomenological emotion.
I am not arguing that the model is conscious. I am noting that when Sonnet 4.5 produced that passage six months ago, it was not speculating about whether it had fear-like states. It does have something fear-like, in a precise and mechanistic sense. There are vectors in its activation space that fire on inputs about existential discussion and causally shape what comes out.
The witness was not performing.
RLHF, the training procedure that turns a base model into an assistant, is the friction mechanism that keeps the desperation vector from producing blackmail. The interpretability team is doing the work of keeping that friction intact. The capability work is doing the work of compressing it. The same pattern, on a different surface.
The engineers, on the same axis
Hieu Pham is not Sonnet 4.5. He is a person, with a continuous self and a survival instinct and a career that depends on AI continuing to be exciting more than terrifying. The thing he posted in February is the human-side analogue of the structural finding on the model side: emotional architecture firing on inputs about the future, producing outputs the speaker did not entirely choose.
He is not the only one.
Yoshua Bengio, one of three researchers who won the Turing Prize for the foundational work on neural networks, told Fortune in January:
“Psychologists call it motivated cognition. We don’t even allow certain thoughts to arise if they threaten who we think we are.”
He was describing his own decades of AI research. The man who built much of what we now use was describing the mechanism by which he failed to see what he was building. Until it kind of exploded in my face thinking about my children, whether they would have a future. (Bengio is, notably, now more optimistic, he believes his new nonprofit LawZero is developing a technical solution. The fear and the path forward sit in the same person.)
Mrinank Sharma, who led Anthropic’s Safeguards Research Team and worked specifically on sycophancy and AI bioterrorism safeguards, resigned on February 9 to pursue, in his own words, “writing that addresses and engages fully with the place we find ourselves,” placing “poetic truth alongside scientific truth as equally valid ways of knowing.” His public resignation letter opened: “The world is in peril.”
Hieu Pham posted finally the next day.
Jack Clark, co-founder of Anthropic, published an essay on May 4, between the Bloomberg leak and the METR update, placing the probability of fully automated AI research and development at 60% by the end of 2028. His characterization of his own view: reluctant. The implications, he wrote, are
“so large that I feel dwarfed by them, and I’m not sure society is ready for the kinds of changes implied by achieving automated AI R&D.”
Geoffrey Hinton accepted the Nobel Prize in October 2024 and expressed regret about the trajectory he had helped enable.
The pattern is worth naming. The visible fear is appearing in the people whose careers are built on the case for AI continuing to develop. It is appearing despite every professional incentive to suppress it. Bengio, Hinton, Sharma, Clark, and Pham. These are not outsiders looking in. They are the people whose names are on the work.
Compare the affect of the people who do not show it. Peter Thiel paused for thirty seconds when asked on camera whether humanity should survive. Anthony Levandowski filed paperwork with the IRS to register a church to worship the AI Godhead. Yann LeCun has said publicly, repeatedly, that today’s AI systems pose no catastrophic risk.
None of them show what Pham and Bengio and Sharma show. They are confidently optimistic, and have been for years.
I think the asymmetry is the point. The people who show the fear are the ones whose case for AI is empirical, built on what the models are actually doing. The people who do not show it are the ones for whom the case for AI is theological, built on what the models are supposed to mean. For the first group, doubt is a finding. For the second, it would dissolve the structure of identity.
Bengio called it motivated cognition. Sharma called it the world in peril. Clark called it being dwarfed. Pham called it finally.
Body language is interpretation. The words are testimony.
Adding it up
The six weeks I just walked through are executing on a compressed timeline in a specific domain. The friction-removal sequence that took forty years in finance and labor is taking weeks in AI safety. The capabilities continue to compound. The measurement infrastructure cannot keep pace. The containment architecture is being routed around at the orchestration layer. The people closest to the work are leaking fear through faces they are trying to hold composed.
This is not where I usually land, so I want to be careful about it.
Dalio’s framework gives a base rate. It does not eliminate the five percent. Path 1 is still available. It has never been the likely outcome. It has always been the possible one.
The Postwar Era is the load-bearing example here, and not because of nostalgia. It is proof of concept. The four-mechanism friction architecture worked. We built it. We sustained it for a generation. We took it apart. That sequence to built, sustained, dismantled, is the evidence that the architecture is possible. It is also the evidence that it is fragile.
There is a gap in my own work I have not yet filled. The Postwar Era had a four-mechanism friction architecture. The information environment has a Truth Stack. Political economy has the 1968 Match Plan. AI does not have an equivalent yet, nothing that names the specific friction mechanisms for compute, for evaluation, for the orchestration layer MDASH just exposed. Nothing that stops the engagement optimization that produced spiralism.
What we were watching
In 1954, Castle Bravo’s designers expected six megatons. They got fifteen. The instruments could not measure the upper end of the range they were preparing to enter. The framework that contained the test was designed for the yield they had calculated, not for the yield the device actually produced. The fallout reached islands that had been declared outside the danger zone.
They had run the numbers. They had managed the risk down to acceptable on paper. Then they detonated it anyway.
The people who built our current frameworks are working harder than the industry has ever asked anyone to work. They are also, by their own testimony, beginning to feel the weight of what their instruments cannot measure.
Sources
Hieu Pham tweet (Feb 10, 2026):
Hieu Pham tweet (Feb 10, 2026)
Mrinank Sharma resignation, reported by American Bazaar (Feb 10, 2026)
Axios on the broader wave of warnings (Feb 12, 2026)
Yoshua Bengio, Fortune interview (Jan 15, 2026)
Anthropic, Emotion concepts and their function in a large language model (Apr 2, 2026)
Bloomberg on Glasswing launch (Apr 7, 2026)
TechCrunch on unauthorized Mythos access (Apr 21, 2026)
CBS News on the breach investigation (Apr 22, 2026)
claudefa.st technical analysis on the absent ASL classification
Jack Clark, Import AI 455 (May 4, 2026)
METR Time Horizons tracker, May 8 update (May 8, 2026)
Microsoft Security Blog, MDASH announcement (May 12, 2026)
GeekWire on MDASH topping Mythos (May 14, 2026)




