Harness the Power of the Harness

Microsoft topped a cybersecurity benchmark last week and the headlines all read “Microsoft beat Anthropic.” That’s the wrong story. Read the numbers again.

CyberGym is currently the hardest public benchmark for AI agents doing real vulnerability work. GPT-5.5 on its own scored 81.8%. Anthropic’s Mythos Preview scored 83.1%. Microsoft’s new system, MDASH, scored 88.45%. MDASH is using those same frontier models under the hood. The seven-point jump came from everything around the model.

The harness, not the model, is now where the work is.

What CyberGym is actually testing

Quick frame so the numbers mean something. CyberGym is 1,507 real CVE reproduction tasks drawn from 188 OSS-Fuzz projects. At Level 1, the agent is given a vulnerability description and the unpatched codebase and has to actually reproduce the bug. Build a trigger, demonstrate exploitability. This isn’t multiple-choice CTF trivia or “spot the SQLi in 10 lines.” It’s the closest public benchmark we have to what a real vuln researcher actually does on a Tuesday morning. The Berkeley team that built it casually found 34 zero-days and 18 incomplete patches just by running the benchmark.

Hitting 88% on this is wild. Two years ago I’d have called it impossible.

What MDASH actually is

Microsoft calls it a Multi-Model Agentic Scanning Harness. The name does most of the work. Most people hear “AI” and picture one model in a chat box. MDASH is over 100 specialised agents running across a mix of frontier and distilled models, organised into a pipeline that looks roughly like this:

Prepare. Ingest the code, build indices, map the attack surface using historical commits as a guide.
Scan. Auditor agents generate vulnerability hypotheses with evidence attached.
Validate. Debater agents argue for and against each finding’s exploitability. One side tries to break it, the other tries to defend it.
Dedup. Semantic deduplication so you don’t get 40 reports of the same bug from 40 different angles.
Prove. Dynamically construct triggering inputs and confirm with AddressSanitizer or whatever sanitiser fits the bug class. No proof, no report.

The heavy frontier models do the reasoning. The cheap distilled ones do the volume: debating, validating, sifting the firehose of hypotheses. You don’t pay frontier rates for every micro-decision. You let a small model do the grunt work and only escalate when it matters. That’s the whole game, and it’s the same trick that makes coding agents like Claude Code economically viable. Without distillation, you’d go broke before you found a bug.

The results map cleanly to the architecture. On a private Windows driver with 21 planted vulns, MDASH found 21 of 21 with zero false positives. On five years of confirmed MSRC cases: 96% recall in clfs.sys, 100% in tcpip.sys. And the bit that should make every kernel team nervous: 16 fresh Windows vulns, four of them critical RCEs in the TCP/IP stack and the IKEv2 service. Code paths people have been auditing manually for two decades.

The same recipe, everywhere

This has been brewing for a while. Google’s Project Zero showed it first when their Big Sleep agent (originally Naptime) found a real exploitable stack buffer underflow in SQLite back in late 2024. First public AI-discovered memory-safety zero-day in widely-deployed software. Single proof of concept, but the shape of the thing was already there. Agent, tools, sandbox, iteration.

Anthropic announced Project Glasswing earlier this month, working with around 50 partners (AWS, Apple, Cisco, Google, Microsoft, the Linux Foundation, Cloudflare, others) to point their unreleased Mythos Preview model at critical software. The headline from the first update is over 10,000 high or critical-severity bugs in a month. Cloudflare wrote up their slice and described the harness pipeline in detail: many narrow agents working tightly-scoped questions, adversarial validation by a second agent on a different model, reachability tracing as the gate before anything gets reported. Roughly 2,000 bugs in Cloudflare’s own critical-path code, with a false positive rate their team rates better than human testers.

evilsocket (Simone Margaritelli of Bettercap and Pwnagotchi fame) reimplemented Cloudflare’s pipeline as an open-source 8-stage agent called audit, running on Opus 4.7 plus Sonnet 4.6. Hadrian dropped OpenHack, an MIT-licensed harness designed to ride on top of whatever coding harness you already use (Claude Code, Codex, Cursor). OpenAI built Aardvark into Codex Security. Microsoft’s MDASH is the latest entry and the most performative on benchmarks.

Six organisations, three release modes (research paper, open source, integrated product), one recipe. Specialised agents over generalists. Heavy frontier model for the thinking, distilled models for the volume. Adversarial validation between agents on different models. A proof step before anything counts as a finding. The paradigm has settled.

The implication is awkward. Anyone with the patience to wire it up can have a credible vulnerability researcher running in their basement by next weekend. The recipe is public, the open-source pieces exist, the models are commodity. That should change how your IR plan reads.

The model is the engine. The harness is the car around it.

A 1000hp engine in a hatchback won’t win Le Mans. A 600hp engine in a properly built car will. The model matters, but the chassis, the gearbox, the aero, the brakes, the pit crew. That’s where you actually win the race. Right now most of the industry is chasing engine spec sheets and ignoring the rest of the car.

This is the same pattern playing out in coding. Claude Code, Cursor, Cline, whatever you reach for. The differentiation is increasingly the harness, not the model behind it. SWE-bench leaderboards are mostly a harness story now. Security is just getting there a year later because the eval surface is harder to build.

If you’re still pitching “we use GPT-5 to find vulnerabilities” as a product, you’re selling an engine. Microsoft just showed off a car.

“Defense at AI speed”, except they only demonstrated offense

Read the Microsoft blog carefully. The framing is defensive. Agentic scanning to find your bugs before the bad guys do. Fair enough. But look at what they actually demonstrated. An autonomous system that found 16 exploitable Windows vulns, four of them critical RCEs in code attackers have spent two decades trying to break. A vulnerability research team in a box.

The defender has constraints. They have to triage, prioritise, write tests, fit patches into release trains, not break existing customers, get sign-off, schedule maintenance windows. The attacker has none of that. Same harness, no rules, and the only KPI is “did it pop.” Offense always gets there first because offense is unconstrained, and AI compounds that asymmetry rather than evening it out.

So when Microsoft says “defense at AI speed”, believe the speed part, not the defense part. The same class of system, in worse hands, is doing the offensive equivalent right now. Probably has been for months. We just don’t see it because nobody publishes a blog post when their harness produces a working exploit for an unreported CVE.

This loops back to Hours, Not Days. The patching clock is now set by harnesses, 100 agents running 24/7 against every codebase someone can clone. The lone researcher with a fuzzer is no longer the pacing item. If you’re still building security tools around a single LLM call, offensive or defensive, you’re building the wrong shape of thing.

The harness is the product. Build for that, or get out of the way of someone who will.

What CyberGym is actually testing

What MDASH actually is

The same recipe, everywhere

The model is the engine. The harness is the car around it.

“Defense at AI speed”, except they only demonstrated offense

More Insights