Many organizations underestimate the need for disaster recovery strategies for their cloud-based applications. However, those that do understand the issues sometimes struggle to put effective plans in place.
Unlike the completion of simple IT tasks, these plans require close collaboration and a definite commitment from multiple parties to complete. Many IT services now rely on multiple application components, some of which may run in the cloud and others in data centers. Building an effective disaster recovery plan therefore requires a structured, cross-functional approach that focuses on the resilience of IT services as a whole, not just individual workloads.
Answering the tough questions
To address disaster recovery planning, companies need to question their approach, even if it raises uncomfortable questions. This process is particularly useful because by raising the gaps, companies will be able to redirect efforts and stimulate stakeholders who have been overlooking the risks.
When a workload fails, the service it supports is interrupted, impacting user productivity and tainting customer confidence. Restoring the service requires a certain amount of coordination, and above all it must be carried out quickly to limit the extent of the damage. It is important to remember that it is the responsibility of companies (not cloud service providers) to ensure that disaster recovery procedures are in place.
Developing a disaster recovery plan
Effective disaster recovery planning begins with an assessment of the impact of downtime on the business. This cross-functional exercise identifies all the IT services used by the business, determines the impact (operational and financial) that a service outage could have, and therefore the disaster recovery requirements for each service. Many IT organizations maintain a service catalog and configuration management database (CMDB) to simplify the process of identifying a comprehensive list of IT services. In the absence of such a catalog, the inventory must be established through a discovery process.
In order to determine the level of requirement for disaster recovery, it is useful to consider two critical metrics: recovery time objective (RTO) and recovery point objective (RPO). The RTO represents the amount of downtime (usually measured in hours, days or weeks) that the business can tolerate for a given IT service. The RPO, on the other hand, is the amount of data loss (usually between almost zero and a few hours) that the company can accept for each of those same services.
In practice, there is often a trade-off between these two objectives: for example, IT services may be restored quickly, but experience greater data loss and vice versa. Logically, demanding RTOs and RPOs usually require more expensive technology solutions.
Dependency Mapping and Technology Assessment
After determining the RTOs, RPOs, and the impact that a termination may have on individual IT services, the next step is to understand all of the IT application components on which they depend. Creating a dependency mapping for each IT service will ensure that the appropriate recovery measures are in place for all necessary application components, whether they are running in the data centers or in the cloud.
Next, organizations should assess their data protection and resiliency capabilities for each application, including whether they can consider RTOs and RPOs collectively. This assessment should be done holistically, taking into account the impact of the most severe outage. For example, the right technology may already be in place to recover a single application within the required recovery time, but does that technology currently recover dozens, hundreds or even thousands of applications in parallel? Can organizations use the same technical solutions in the data center as they do in the cloud? The need for multiple tools will undoubtedly complicate recovery procedures. After assessing current technology capabilities, organizations can then identify additional technical solutions to fill the gaps.
Document and test recovery steps
While deploying the right recovery tools is critical, technology alone is not enough to ensure disaster recovery. A critical step is to create a hierarchical set of recovery plans that can be used to guide the business through the recovery process. Higher-level plans will document how recovery activities are coordinated, while lower-level plans will include step-by-step procedures to ensure the recovery of each IT service. Developing and maintaining these plans is a significant investment, but they are essential to ensuring effective recovery from a major incident.
To ensure that the plans will work well in practice, they must be tested regularly. Testing should be done at least once a year, and even more frequently for critical applications. They can also be a risk of incident if they involve the use of live data. However, testing is an essential part of disaster recovery planning that should not be ignored.
Building Resilience
The public cloud offers enterprises a highly scalable and resilient platform for hosting workloads. When used properly, it can strengthen the resiliency of IT departments. However, adopting the public cloud does not relieve the enterprise of its responsibility for service availability and disaster recovery. While the cloud offers many building blocks to support a recovery strategy, organizations must use them in combination with other technologies and procedures to develop a cohesive plan.
Achieving multicloud resiliency requires a holistic approach around data assets, some elements of which are in common with the disaster recovery process. Disaster recovery in the multicloud raises other issues around where data is stored. Existing dependencies and how data and workloads can be recovered in the event of an adverse situation with the cloud provider.
The objective of disaster recovery planning and testing is to ensure that recovery is possible in accordance with RPO and RTO objectives. In particular, this will provide assurance to customers – both internal and external – of the enterprises that they will not be affected in the event of downtime.