When cybersecurity goes wrong, the results can be catastrophic.
I am, of course, referring to last Thursday's Crowdstrike update, which caused the largest global IT outage ever seen.
Shortly after the news broke, the Fresh Security team started speculating on the root cause.
We were broadly split between:
The recovery process has started, but the outage's root cause is yet to be published.
That said, both explanations look to be partly correct.
What we do know is that Crowdstrike released a faulty update on Thursday.
A fully functioning testing process would have stopped this.
Crowdstrike EDR also operates in a very sensitive part of the Windows operating system called "Ring 0 Kernel Space".
Ring 0 Kernel Space is sensitive because it offers privileged access to every part of the computer's memory and data.
In case of any error, this part of Windows is designed to stop immediately (crash) and display a "Blue Screen of Death".
This is because any other approach could have far worse consequences.
Under normal circumstances, the software that runs in Ring 0 Kernel Space undergoes extensive testing by Microsoft before it is certified and allowed to be installed. This means the software rarely changes, which reduces the chance of errors.
However, it appears that the Crowdstrike software that Microsoft tested has a "special feature." It can load software from updates, like the one on Thursday.
Typically, this is not a problem. However, in this case, the untested update caused an error when it was loaded into the Ring 0 Kernel Space.
When the error was detected, Windows did what it was designed to do… stop everything immediately.
Shortly after the incident, questions about resilience were posed - How can one software bug have such a profound and immediate impact?
And I was asked similar questions by friends and family - Why were many banks, airlines, hospitals, and flagship brands affected?
In short, Crowdstrike is (usually) very good at what it does… I have some old friends who work there, and they are among the most talented and competent people I have ever worked with.
But in this case, a costly mistake was made.
Unfortunately, it now means that every Windows laptop and server running Crowdstrike must be manually fixed.
That is going to be a colossal amount of work.
Thankfully, outages requiring manual intervention like this are rare - especially compared to the Windows XP days.
This is due mainly to the impressive work Microsoft has completed to make its software more resilient and robust.
But - inevitably - outages do occur and will continue to do so.
And so, how can your business plan ahead for such an incident?
I'll cover that in part 2, "Plan B: Resilient Operations - Lessons Learnt", next week.
If you know of an IT admin who is picking up the pieces after this outage, offer them a drink and stay well out of their way.
They will not be having a fun week.