Following on from last week's newsletter about the need for resilience... What are your options? What's available, and what does it all mean?
"Never say - redundant. It implies unnecessary. And there's no budget for unnecessary."
There should never be a budget for unnecessary. But when spending gets tight - resilience can start to look increasingly unnecessary.
So why do you have an extra firewall? A just-in-case laptop? A storage or database cluster?
Resilience.
Resilience is measured in downtime.
How much time is it ok to be non-productive?
How much will it cost to have operations shut down?
But there's also - in the event of a failure - how are operations brought back online?
If it's automatic, we call it "failover". If it's not… you need to ensure the knowledge and experience of bringing operations back are also resilient.
There are only five tiers of resilience:
In cybersecurity, we often overcomplicate the idea of resilience. But businesses already deal with it every day.
Which tier you choose is entirely dependent on your business need.
Whilst this situation is clearly undesirable, it's the one new businesses operate under daily. Working in a world of unknowns. But more things become known when you've been in business for a few years. This tier is one to leave behind and hope never to return to.
This is the situation all cybersecurity professionals fear. In the event of a failure, businesses often decide the show must go on. This means ignoring security processes and significantly increasing the risk of further failures. In the worst case, one failure leads to another, and so on...
This is the scrappy part of business life. Someone decides to leave your business with immediate effect, but luckily you know someone. It will take time to get up to speed, but a few days of downtime and underperformance is ok.
This is often the situation when recovering from ransomware attacks. Standard configurations are used in the business for laptops, networking, and server security. There's a spare laptop, access point, or firewall on a shelf somewhere, with vague instructions on how to set it up.
Sometimes, the person with the critical information isn't in the office. They're on holiday or taking the family to the zoo. This is the byproduct of undocumented processes - which causes disruption and inconvenience, but it isn't the end of the world. A call. An interruption can get you back working within minutes.
There's a spare system ready to work. System configurations are tested and ready to go. It just needs to be connected to the right place. This is where those "infrastructure as code" people assume every business is operating… In reality, it's rarely used. This approach comes with the extra burden of maintaining software and configs, but it can appear to be magic when it works.
You're now entering the automatic zone...
This is like someone calling in sick for their shift at a restaurant. Their workload will be automatically picked up by other members of the team. No one will notice if it's a slow night and the team isn't busy. But if it's a busy Friday night - service will be slower than usual. The performance will be degraded.
This is where security and IT folk feel comfortable. It's the most common way to deal with large central file servers and networks. If something fails, the entire business will be offline. But it's ok because another system will take up the workload a few seconds later. Performance may be degraded, but at month end, that is much more tolerable than no data.
Oddly, there are few parallels to this approach in business experience. The closest is seen on production lines, where extra quality inspectors - check each other's work in real-time. The salaries of the additional workers are much less than the cost of stopping the line or having defective parts.
This is used in financial, medical, and oil production databases. Downtime is measured sub-second, and failures will go unnoticed, which is just as well when outages can cost millions per minute. You probably don't need this level of resilience.
Cybersecurity vendors and professionals often start by assuming zero downtime or total resilience is the answer. But other areas of business rarely operate in this way. A five-minute or even a five-hour delay may be acceptable - it's a business decision.
In challenging economic times, decisions to trim down resilience are inevitable. Whether that's people or technical processes. Knowing what level of downtime is acceptable is essential.