CrowdStrike Took Down Half the Planet. Your Runbook Should Have Caught It.
On 19 July 2024 a CrowdStrike Falcon update bricked 8.5 million Windows machines. The post-mortem is not about CrowdStrike. It is about how nobody held their vendor accountable.
On 19 July 2024 a CrowdStrike Falcon sensor configuration update bricked roughly 8.5 million Windows endpoints inside a few hours. Airlines stopped. Hospitals stopped. Banks stopped. The London tube boards went dark. It was the largest outage in IT history by reach, and the root cause was a config file with a parsing bug that tripped a kernel-mode driver into a boot loop.
Everyone has written the technical post-mortem. I want to talk about the part nobody is writing.
CrowdStrike did not fail alone
A lot of senior engineers I respect spent the week pointing at CrowdStrike. Bad QA. No staged rollout. Kernel-mode driver. Channel files outside the standard release pipeline. All true. All damning. None of it is the lesson.
The lesson is that thousands of organisations had granted a single third-party vendor unilateral, unstaged, kernel-level write access to their entire fleet, and had no compensating control. That is not a CrowdStrike failure. That is a procurement failure, a security architecture failure, and a runbook failure on the customer side.
If you signed the contract and ticked "auto-update on", you co-authored this outage.
The vendor risk question you did not ask
When you onboarded Falcon, did you ask:
- Can we stage updates by ring? Pilot, canary, broad?
- Can we delay channel files by N hours behind the vendor?
- Do we have a documented procedure to roll back a sensor without booting into safe mode on every host?
- Do we have a manual override path if the vendor's cloud is itself down?
- Does our DR plan assume the EDR is the failure mode, or only that the EDR catches the failure?
Most teams I have audited in the last five years answered "no" to all five. CrowdStrike was the bill arriving for that.
What good looked like on 19 July
A handful of organisations rode the outage well. The pattern was consistent:
- They held updates 24-72 hours behind the vendor's auto-channel by policy.
- They had pre-staged BitLocker recovery keys in a system that was not itself dependent on the bricked endpoints.
- They had a documented "boot into safe mode and delete file X" runbook that a non-specialist could execute.
- They had a phone tree, not just Slack, because Slack ran on laptops that were in a boot loop.
None of that was novel. All of it was boring discipline.
The wider pattern
CrowdStrike is the loudest example of the same shape we keep seeing:
- 2017 AWS S3 us-east-1 outage. Shared dependency, no regional isolation.
- 2021 Fastly. Single config push, global blast radius.
- 2021 Facebook BGP. Internal tooling on the same plane as the outage.
- 2024 CrowdStrike. Auto-update with no staging.
The common cause is convenience. Centralising a control plane is faster, cheaper, and easier to operate, right up to the moment it fails. Then it is the worst day of your career.
What to do this quarter
Do not wait for the next CrowdStrike. Pick three vendors with kernel-, hypervisor-, or root-level access to your fleet. For each:
- Find out if you can stage their updates. If not, raise it in your next QBR.
- Write a recovery runbook for "this vendor pushed a bad change". Drill it.
- Verify your recovery dependencies are not themselves on the affected fleet. Recovery keys on the bricked laptop is the canonical anti-pattern.
Do the same for your CI/CD vendor, your auth provider, and your DNS provider. Those are the four ways your business actually goes dark.
A grim closing thought
The reason CrowdStrike-scale outages will keep happening is that the economics still favour the vendor. Auto-update is a feature. Staged rollout is friction. Customers say they want safety and then sign contracts that prioritise speed. Until enterprise procurement starts treating "we control the update cadence" as a hard requirement on par with SOC 2, you will keep ringing the same bell.
The next one is already in someone's git commit. The only question is whose runbook holds.