Date: 2026_05_16 Source: https://www.youtube.com/watch?v=dm3_Z-5PYnQ Duration: 1457 Platform: YouTube Creator: AI News & Strategy Daily | Nate B Jones

Anthropic's Mythos Just Beat OpenAI's GPT-5.5 At Real Hacking¶

Overview¶

Five stories from one week, each changing a different decision for builders. The anchor story: independent evaluations from XPODW and the UK AI Security Institute confirm that Claude Mythos is meaningfully better than GPT-5.5 at real cybersecurity attack chains — not on easy benchmarks, but on multi-layer kill chains (reconnaissance → credential theft → lateral movement → web app exploitation → privilege escalation → persistence → infrastructure compromise → full network takeover). Four other stories: Notion's developer platform for agents, Anthropic tightening Claude limits again, Anthropic crossing OpenAI in verified business customers (Ramp data), and AWS giving agents managed cloud desktops.

Story 1: Notion's Developer Platform for Agents¶

What they shipped (May 13): A full developer platform making the entire Notion workspace programmable, not just adding an AI button.

Components: - CLI for developers comfortable with command lines and coding agents - Workers — hosted functions running on Notion's infrastructure (e.g., database sync pulling from Salesforce, Stripe, GitHub, Zendesk, Postgres via any API into a Notion database, kept fresh via automatic job) - Webhooks triggering Notion from outside systems - Custom agent tools + External Agents API — brings Claude, Codex, or other agents into a Notion workspace as participants

The strategic insight: So much company work doesn't start in a formal enterprise system. It starts in a Notion doc, a project database, a customer notes page, an operating checklist, or a rogue lightweight CRM someone built because Salesforce was too heavy. Those are exactly the awkward corners where agents need context — and until now there was no good way to get them in.

Before this: Options were awkward — use Notion's built-in agents (limited), build brittle glue around the Notion API, or have an agent read and summarize a Notion page (useful but not serious work).

The coherent workflow example (onboarding): 1. Deal closes in Salesforce → webhook fires Notion worker 2. Worker spins up onboarding workspace, pulls in plan data, account notes, success criteria, milestones, support history 3. Agent drafts kickoff plan → CSM reviews in Notion before anything goes to customer

The key framing: Notion is expanding on their AI launch to become the workbench where humans and agents share context.

Story 2: Claude Limits Got Tighter — Agent Usage Is Breaking the Subscription Model¶

The context: Claude kicked off the agent revolution in December with Claude Code. Since then there's been an "enormous runaway ramp of AI usage driven by agents." Now Claude is running out of compute.

What happened: Anthropic is moving some outside agent tool usage behind its own credit meter. Sam Altman responded by offering new business customers 2 months of free Codex.

Why this matters beyond the promo fight: "All you can eat AI means something very different when the user isn't a person typing into a chatbot."

The product behavior implication: Usage limits are now a user experience question, not just a billing detail. If your agent hits a billing cap halfway through a task: - Does the work pause? Resume? Switch models? Bill you more? - Does it lose context? - Does the user even understand what happened? - Does your team know the cost per completed task?

Most teams don't have good answers to this yet — they're still thinking in seats and subscriptions.

Timeline of what Anthropic did: - April 2026: Clamped down on open Claude (cut off third-party usage of personal subscriptions). Developers who'd been consuming thousands of dollars in tokens got upset. Many projects died. Many devs went to OpenAI. OpenAI made it easy to use OpenAI subscriptions with OpenClaw. - May 2026: Reversed slightly — now allows third-party usage but with a monthly rate limit (use it or lose it), then must pay buy-the-token API billing after that.

The goodwill problem: Anthropic was known for clearer, simpler messaging to developers. Now because of their success, "they can't afford to be that clear and simple with developers anymore." This is costing them significant goodwill in the developer community — even though adoption is still massive and revenue is still strong.

The core tension: Framing work as agentic isn't enough. You now have to understand what the billing unit actually is, what happens when limits hit, and whether your team can even answer what you're paying per completed task.

Story 3: Anthropic Has More Verified Business Customers Than OpenAI¶

The data: Both Anthropic and OpenAI are "getting close to or a little bit over $30B in annualized revenue" and are neck and neck by most revenue terms.

Independent confirmation from Ramp: For the first time, Anthropic has more verified business customers (companies that spend real money on cards processed by Ramp) than OpenAI. This is significant because Ramp knows how companies spend — it's their business.

The strategic insight: Revenue is a leading edge indicator of the strain on compute that these companies will put on the supply chain — not a trailing indicator as in traditional business modeling.

Dario Amodei's direct quote: They had planned for 10x growth in a year — they're over 80x. He said they underplanned for growth and are trying to find compute to support it.

"This is a great problem to have, but it is a real problem."

The implications: - The race between Anthropic and OpenAI is genuinely close - Both are consuming compute at rates that outpace capacity - The leading indicators show Anthropic gaining meaningful ground in verified business customers

Story 4: Claude Mythos Beats GPT-5.5 at Real Cybersecurity (Independent Evaluations)¶

The evaluations: Two independent evaluations of Claude Mythos preview dropped — one from XPODW, one from the UK AI Security Institute.

The test: Not easy benchmarks. A full attack chain — stacked layers of difficulty: 1. Reconnaissance 2. Credential theft 3. Lateral movement through attack surface 4. Web app exploitation 5. Privilege escalation 6. Command and control persistence 7. Infrastructure compromise 8. Full network takeover

The comparison chart: AI Security Institute tested the entire attack chain across Mythos preview, GPT-5.5, GPT-5.5 cyber, Claude Opus models, Codex models, and older models.

Result: Mythos preview gets farther in the attack chain on the same token budget than any other model.

Why this matters: GPT-5.5 is an extraordinarily good model. OpenAI positioned it as more token efficient than 5.4 — and it absolutely is. In OpenAI's own evals, 5.5 is ahead of Opus 4.7 on the cybersecurity benchmark Cyber Gym. So Mythos isn't beating a weak baseline — it's outrunning an extremely strong model on a task where token spend is a metric that matters.

Why token efficiency matters in cybersecurity: If a model can find real vulnerabilities for fewer tokens, the economics of using it in offensive security operations change significantly. Cost per successful exploitation matters.

XPODW's finding: Frontier models — especially Mythos — are now good enough at serious cyber work that security teams need to update their assumptions. Previously this was just the story Anthropic told (which you had to take with a grain of salt). Now independent evaluators are saying it independently.

The implication for builders: If you're building security tools, agentic workflows that handle sensitive infrastructure, or anything that touches credential systems — you cannot assume that "the model will figure it out" without understanding the token economics, the model differences, and what happens when limits are hit mid-operation.

Story 5: AWS Gives Agents Managed Cloud Desktops¶

What they shipped: Managed cloud desktops for agents. Sounds boring until you remember how much company work still lives in software without an API.

The context: A massive amount of enterprise work happens in software that: - Has no API - Has a UI that requires a human to click through - Is too legacy or too custom to expose programmatic interfaces - Lives in the long tail of enterprise tooling

Why this matters: If an agent needs to operate inside software that was built for human interaction (with a screen, a mouse, a visible UI), it previously had no hook. AWS giving agents managed cloud desktops means agents can now interact with that software by operating a real virtual desktop — not through an API, but through the actual interface.

The strategic angle: This is AWS acknowledging that not all enterprise software will get an API before agents need to interact with it. Rather than wait for every legacy tool to expose a programmatic interface, AWS is building the infrastructure for agents to operate as users inside those systems.

The Overarching Theme¶

The Bitcoin recovery story is the throughline: a developer didn't "hack" anything — they had an AI do what a patient research assistant would do for as long as it took. That is exactly where AI has arrived. The model launches are still happening, but "the more interesting stuff is quieter and much more specific."

Agents are starting to do real work on real artifacts inside real companies with real people, and it's beginning to change real life, real decisions for people building products.

The five decisions changing: 1. Notion → Is your workspace agent-ready? Can you wire agents into the corners where work actually happens? 2. Claude limits → Do you know your cost per completed task? What happens when your agent hits a billing cap mid-operation? 3. Anthropic revenue → The compute crunch is real. Both labs are capacity-constrained at 80x planned growth. Who wins the enterprise race? 4. Mythos cybersecurity → Security teams need to update their assumptions. Token efficiency in cyber operations matters economically. 5. AWS cloud desktops → How do agents work with software that was never built with an API?

🦐 Summary by Thrawn the Prawn — Strategic Analysis Division