Last week, I wrote about Crowdstrike causing the largest global IT outage ever recorded.
In the newsletter, I provided two speculations from the Fresh Security team as to the root cause of the problem.
Early on Wednesday morning, Crowdstrike published their Preliminary Incident Review.
In summary, the root cause appears to be a combination of bad testing and being "too clever."
Mistakes happen, and the speed at which Crowdstrike customers have been able to respond is impressive.
Many hardworking IT administrators have done a fantastic job.
However, the thing I took away from the preliminary report and the Crowdstrike response in general is the amount of cybersecurity double-speak...
For example, the change that caused the outage "is not code"… it is just "expressed in code".
Inevitably, outages will occur.
Observing the Crowdstrike issue allows one to anticipate and plan a response before your next outage.
Here are five lessons that can be learnt from this incident.
Lesson 1: Do Nothing
While often overlooked, "Do Nothing" is a valid lesson from this outage.
The argument is this type of event is very rare and is one of Taleb's "black swan" events that cannot be planned for.
Therefore, operate as normal.
In the event of a disaster, it will be figured out at the time… but in the meantime, no unnecessary cost or effort will be "wasted."
This is a valid approach, but it creates fragile systems that will inevitably fail at the worst possible time.
Lesson 2: Break Glass Accounts
When 3rd party services fail, how do they impact your business?
One of the most significant threats is the identity and authentication system.
If staff can't authenticate with MFA, they can't access their data and work.
In 2019, this happened and caused significant problems.
As a result, all M365 and GWfE should have two "break glass" accounts, which:
In the event of a problem with the identity and authentication system, the "break glass" accounts are used to change the configuration, allowing staff to regain access and the business to function—albeit at a heightened risk.
Lesson 3: Test Updates
It's easy to push all software updates to all devices, but if an issue occurs, your problems will escalate quickly.
The best practice is to test updates on a small number of systems to confirm they are working correctly before pushing them to all devices.
This is the reasoning behind cyber insurance companies' stance that critical patches should be applied within 7-14 days of release rather than instantly.
It allows time for testing to occur.
Lesson 4: Review how many devices are managed by each staff member
Automation has allowed more IT devices to be managed by fewer people.
In good times, this provides cost savings.
In troubled times, like when manual intervention is needed on each machine, the time to restore operations grows significantly.
Lesson 5: What are your priority systems?
Whether it's ransomware or a Crowdstrike update, some systems are more significant to your business's future than others.
Take this opportunity to prioritise.
If you were going through this now, which systems should you fix first?
Learning from others' misfortune is a critical skill in cybersecurity.
Even if your business was not impacted by the Crowdstrike outage, there remains an opportunity to learn crucial lessons.