Developing a resilient business is about identifying what your business can’t afford to lose and planning for how to prevent loss should a disaster occur. While this may seem a daunting task, determining your business’s resiliency strategy is more straightforward than you might think.
This resilience toolkit developed by Facebook provides a framework for small businesses that may not have the time or resources to create an extensive plan to recover from business interruptions.
You don’t have to use Facebook’s crisis response features for this approach to be effective – the value comes from the taking the time to assess the risks and plan you response strategy.
Modern digital technology underpins the shift that enables businesses to implement new processes, scale quickly and serve customers in a whole new way.
Historically, organisations would invest in their own IT infrastructure to support their business objectives and the IT department’s role would be focused on keeping the ‘lights on’.
To minimise the chance of failure of the equipment, engineers traditionally introduced an element of redundancy in the architecture. That redundancy could manifest itself on many levels. For example, it could be a redundant datacentre, which is kept as a ‘hot’ or ‘warm’ site with a complete set of hardware and software ready to take the workload in case of the failure of a primary datacentre. Components of the datacentre, like power and cooling, can also be redundant to increase the resiliency.
On a lesser scale, within a single datacentre, networking infrastructure elements can be redundant. It is not uncommon to procure two firewalls instead of just one to configure them to balance the load or just to have a second one as a backup. Power and utilities companies still stock up on critical industrial control equipment to be able to quickly react to a failed component.
The majority of effort, however, went into protecting the data storage. Magnetic disks were assembled in RAIDs to reduce the chances of data loss in case of failure and backups were relegated to magnetic tapes to preserve less time-sensitive data and stored in separate physical locations.
Depending on specific business objectives or compliance requirements, organisations had to heavily invest in these architectures. One-off investments were, however, only one side of the story. On-going maintenance, regular tests and periodic upgrades were also required to keep these components operational. Labour, electricity, insurance and other costs were adding to the final bill. Moreover, if a company was operating in a regulated space, for example if they processed payments and cardholder data, then external audits, certification and attestation were also required.
With the advent of cloud computing, companies were able to abstract away a lot of this complexity and let someone else handle the building and operation of datacentres and dealing with compliance issues relating to physical security.
The need for the business resilience, however, did not go away.
Cloud providers can offer options that far exceed (at comparable costs) the traditional infrastructure; but only if configured appropriately.
One example of this is the use of ‘zones’ of availability, where your resources can be deployed across physically separate datacentres. In this scenario, your service can be balanced across these availability zones and can remain running even if one of the zones goes down. If you build your own infrastructure for this, you would have to build one datacentre in each location and you better have a solid business case for that.
It is important to keep this in mind when deciding to move to the cloud from the traditional infrastructure. Simply lifting and shifting your applications to the cloud may not work. These applications are unlikely to have been developed to run in the cloud and take advantage of these additional resiliency options. Therefore, I advise against such migration in favour of re-architecting.
Cloud Service Provider SLAs should also be considered. Compensation might be offered for failure to meet these, but it’s your job to check how this compares to the traditional “5 nines” of availability in a traditional datacentre.
You should also be aware of the many differences between cloud service models.
When procuring a SaaS, for example, your ability to manage resilience is significantly reduced. Instead, you are relying on your provider to keep the service up and running, potentially raising the provider outage concern. Even if you have access to the data itself, your options are limited without a second application on-hand to process that data. Study the historical performance and pick your SaaS provider carefully.
IaaS gives you more options to design an architecture for your application, but with this great freedom comes great responsibility. The provider is responsible for fewer layers of the overall stack when it comes to IaaS, so you must design and maintain a lot of it yourself. When doing so, assume failure rather than thinking of it as a (remote) possibility. Availability Zones are helpful, but not always sufficient. What scenarios require consideration of the use of a separate geographical region? The European Banking Authority recommendations on Exit and Continuity can be an interesting example to look at from a testing and deliverability perspective.
Be mindful of characteristics of SaaS that also affect PaaS from a redundancy perspective. For example, if you’re using a proprietary PaaS then you can’t just lift and shift your data and code.
Above all, when designing for resiliency, take a risk-based approach. Not all your assets have the same criticality – know your RPOs and RTOs. Remember that SaaS can be built on top of AWS or Azure, exposing you to supply chain risks.
Even when assuming the worst, you may not have to keep every single service running should the worst actually happen. For one thing, it’s too expensive – just ask your business stakeholders. The very worst time to be defining your approach to resilience is in the middle of an incident, closely followed by shortly after an incident. As with other elements of security in the cloud, resilience should “shift left” and be addressed as early in the delivery cycle as possible. As the Scout movement is fond of saying – “be prepared”.
Image by Berkeley Lab.