Hours, Not Days

I’ve been banging on about mean time to exploit for a while. Recent events make it worth saying again.

The head start is gone

Remember the Exchange ProxyLogon mess from 2021? Microsoft runs a program called MAPP where security vendors get advance notice of critical vulns so they can build EDR signatures before the patch ships. At the time, roughly 80 partners were on the distribution list (it’s over 100 today). Microsoft notified MAPP partners on Feb 23. Mass exploitation started on Feb 28, using code that looked suspiciously similar to Microsoft’s own PoC. The out-of-band patch shipped Mar 2. Microsoft publicly suspected one of its Chinese MAPP partners had leaked.

Think about what that means. The disclosure-to-patch window was already short. Depending on tier, MAPP partners get between a few hours and a couple of weeks of advance warning, not the month a lot of people assume. And even that compressed window blew up. The defenders’ head start became the attackers’ head start.

It happened again with SharePoint in July 2025. MAPP partners were notified June 24, exploitation began July 7, and Microsoft eventually restricted Chinese MAPP partners from receiving PoC code. Same pattern.

With LLMs reading patch diffs and writing PoCs, the window keeps shrinking. Seven days isn’t enough. We need to be talking about patching in hours.

The mechanics are solved

That sounds insane to most enterprises. But the patching itself isn’t the hard part. Meta live-patches several million Linux servers with kpatch, gradual reboots, and container draining. Google rolls updates across its fleet effectively at once. During Log4j, AWS engineers told me in a meeting that they patched something on the order of 1.5 million Linux servers in 48 hours. That number’s anecdotal, but their public Log4j hotpatch work is consistent with that kind of scale. The mechanics are solved.

What’s not solved is the system design underneath.

For too long we’ve accepted single-region, single-AZ, single-point-of-failure architectures because real HA costs exponentially more. MVP to prod, ship it, move on. That tech debt is now a security problem. If you can’t move workloads from server A to server B without anyone noticing, you can’t patch in hours. And if you can’t patch in hours, you’re going to get breached.

This is the BeyondCorp prediction landing late. SRE and security are the same problem now. Google literally wrote the book on it. A system that can’t survive a random node dying can’t survive a zero-day either.

Start with the compute unit

So forget the patching team for a second. The real question for an enterprise is: what’s the smallest unit of compute you run?

If it’s a VM, that VM is a single point of failure. You’ll need scheduled downtime, probably a few hours a week, just to patch reliably. Most enterprises pretend this isn’t true and then panic when a CVE drops.

If it’s a container on Kubernetes or similar, you’re in much better shape. Drain a node, patch it, bring traffic back, repeat. But your containers have to behave. Read-only filesystem. No local state. Idempotent. Horizontally scalable. If killing a container loses data, you’re back to VM problems with extra YAML.

If it’s a function, like Lambda or an in-house FaaS along the lines of Meta’s XFaaS (trillions of calls a day across 100,000+ servers), patching is essentially free. The platform handles it for you.

So when people ask how to patch faster, my answer is: don’t start by yelling at the Linux sysadmin team to push changes quicker or throwing more headcount at change management. Go back and look at the architecture.

Design for failure first

A few things become non-negotiable. Smallest unit of compute possible: function beats container beats VM beats physical box. Real load balancing, failover, and GSLB on any critical system that can’t be down. Continuous testing of failure. Netflix’s Chaos Monkey isn’t a stunt, it’s the only way to know your design actually works. Kill containers at random. If anything breaks, you find out now instead of during an incident.

And for the legacy stuff that’s stuck on VMs or bare metal: stop treating the OS as something you mutate in place. Look at rpm-ostree (the engine behind Fedora CoreOS and Silverblue), or NixOS. Image-based and declarative systems give you atomic deployments and a real rollback path. That’s how you get 90-95% of patching done without a human in the loop, which is the only way the numbers work at scale.

The onslaught isn’t slowing down. AI-assisted exploit dev is going to keep compressing the timeline. The systems you’re designing today need to assume hours-not-days patching is normal operating procedure, not an incident response.

Most enterprises will only learn this after the breach. Try to be in the other group.

The head start is gone

The mechanics are solved

Start with the compute unit

Design for failure first

More Insights