Aaron Ricadela | Senior Writer | July 25, 2024
Disasters of many types can knock critical systems offline, damage offices and data centers, or render the databases and applications needed to run normal business operations temporarily unusable. A disaster recovery plan is a business’s process and technology roadmap for getting its most important systems and applications back up quickly so it can resume work while restoring others.
Disaster recovery (DR) encompasses a business’s technical plans for getting its computing workloads back online after a disruptive event, as well as the methods for testing the playbook before calamity strikes. In a disaster recovery plan, workloads are ranked in order of importance. Businesses aim to minimize computing downtime and lost data while balancing the cost of doing so for each workload.
While disaster recovery has long been an important component of IT operations, cloud computing and software architectures designed for the internet are lowering the cost and work of implementing comprehensive disaster recovery plans.
Disaster recovery describes the policies, technologies, and budget that businesses devote to bringing important IT systems back online after unexpected downtime caused by operator errors, malfeasance, software bugs, natural disasters, or other calamities. Before a disruption occurs, businesses need to identify which mission-critical applications must be restored immediately after a disaster and rank others in groups of importance, called tiers. Then they need to decide how much downtime and data loss the business can withstand for each application and plan IT strategies accordingly.
Disaster recovery is important because unplanned downtime caused by disruptive events can lead to substantial financial losses—on the order of US$100,000 per hour, according to industry estimates. Prolonged downtimes can also harm a brand’s reputation and result in regulatory reprimands or penalties. In some highly regulated industries, including financial services, energy, and healthcare, companies need to restore data and computing operations faster than conventional backup data copies allow.
Unplanned downtime can cost lives, too, in fields such as emergency services and healthcare. If there’s a catastrophic event—such as a hurricane, tornado, or earthquake—then all services are at risk. Can information flow where it needs to in order to save lives?
There are two critical disaster recovery metrics: recovery time objective (RTO), which measures the maximum amount of time a system can remain offline, and recovery point objective (RPO), which measures how much data a business can afford to lose and is associated with the frequency of backups or replication. For both, shorter thresholds are better but costlier. IT organizations often set an RTO and RPO for each system they run, allowing them to balance costs with criticality.
DR is a well-established practice area, but more use of cloud services combined with so called “pilot light” deployments, which use live, up-to-date data with standby services to restart a system in a cloud data center, are helping planners deliver excellent RTO and RPO metrics for less money. That’s because cloud providers invest in redundancy at every infrastructure layer, allowing for automated and semiautomated failover and recovery processes. These are investments that their customers no longer need to make. In addition, pilot light deployments can reduce the time needed to get services back up and running to minutes.
More on cloud-based DR deployments to follow.
Many types of disasters can affect IT systems, including cyberattacks, hardware failures, natural disasters, and outages caused by human error. Some you can anticipate. For example, all organizations can be targeted by cyberattacks. Some companies are based where natural disasters, such as hurricanes, earthquakes, and floods, are more likely to occur. Human error is a constant.
The job is to be ready to react when something goes wrong.
Unplanned outages are unexpected interruptions in a system or service that result in downtime and disruption to normal operations. These outages can occur due to the factors just discussed and can have serious consequences for businesses, including lost revenue, reputational damage, decreased customer satisfaction, and even loss of life. It’s essential to have recovery plans in place to minimize the impact of unplanned outages and ensure the rapid restoration of services.
High-availability technologies that replicate data among nodes in a cluster or cluster servers together so they can fail over to one another and keep workloads running can ensure very high IT service levels. These technologies seek to eliminate single points of failure and generally are backed by service level agreements that guarantee uptime percentages. In cloud computing, high availability protects physical infrastructure, including power, cooling, storage, networks, and servers. Application-level load balancing software also helps ensure high levels of uptime.
Disaster recovery, on the other hand, protects against multiple points of failure and aims to restore critical workloads to an operational state after an extreme disruption, such as when an earthquake or hurricane takes a facility down. DR sites are typically geographically distant from one another.
Both high-availability and DR technologies should be part of a comprehensive business continuity plan.
The primary goal of a disaster recovery plan is to ensure that business units can continue working during a crisis. DR plans include processes for quickly restarting computing services and limiting data—and dollar—losses. They also aim to satisfy regulatory requirements governing business continuity and data retention.
The two primary metrics for disaster recovery plans are recovery time objective (RTO) and recovery point objective (RPO). Each system a business runs may have different RTO and RPO requirements depending on the service level agreements between IT and the relevant business units.
For each application or service, the RTO is the maximum allowable downtime after an unplanned outage, while the RPO measures the maximum amount of data loss a business is willing to tolerate. Shorter/smaller thresholds are better but generally more expensive. IT organizations can set an RTO and RPO for each system they run to balance costs with criticality.
DR plans include thorough assessments of the potential risks of catastrophic events, the damage to operations they’d potentially cause, how employees and external stakeholders may be affected, and the financial losses or regulatory fines that could be incurred as a result.
As part of developing a DR plan, companies need to identify executive sponsors and affected teams; catalog physical and IT assets that could be harmed during a disaster; and consider the potential impacts on customers, suppliers, partners, and other stakeholders.
IT departments need to decide which workloads can be restored from backups, which require live data combined with services running at lower capacity, and which workloads need full capacity. In some cases, active systems that are down will automatically switch over to standby systems, incurring minimal downtime and zero data loss. In other cases, the switchover will be manual. IT teams will want to select backup sites and craft a plan that lets them quickly restart applications. The cloud is a big help here. Businesses also need to look for IT dependencies that could impede restarting operations—cases where one offline application prevents bringing another back online.
In addition to these technical aspects, executive leadership and lines of business should have emergency communication and response plans in place as well as provisions for training employees on the DR plan, testing and rehearsing it via tabletop testing or walk-throughs, and continuously improving it.
Every DR plan should include a risk assessment of events that could interrupt business operations, an impact analysis of the applications that could be affected, and an estimate of the resultant financial losses. The business impact analysis should include RTOs and RPOs for each application. Businesses can then decide on their recovery plans and choose where it makes sense to trade higher costs for shorter recovery time and recovery point objectives.
Approaches to backup and recovery fall along a performance-cost spectrum and include the following:
It isn’t enough to create an IT inventory, determine application tiers, and map dependencies. For DR to work at the level the business expects, every technology, from operating systems to applications, needs to be redundant. DR success also depends on regular testing, whether that be tabletop exercises, in which stakeholders run through the steps verbally, or a physical walk-through of the measures IT departments will take and testing of the system components that are used only during disasters.
Financial reporting and data protection regulations also impact DR plans. For example, the Sarbanes-Oxley Act, a US corporate financial reporting regulation, sets data retention requirements. The US Health Insurance Portability and Accountability Act (HIPAA) requires contingency plans for electronic health information during a disaster, and the European Union’s General Data Protection Regulation (GDPR) mandates the availability of citizens’ personal data during a disaster.
Disaster recovery as a service (DRaaS) is a cloud service that lets enterprises run applications in a public cloud or hybrid cloud, with a DR plan enacted in the cloud providers’ facilities instead of an on-premises data center. Cloud-based DRaaS offerings let companies transition compute, database, and application loads among cloud regions remotely and automate the steps needed to recover business systems without re-architecting them or using specialized management software. It’s crucial that a cloud provider’s DRaaS solution is designed for high availability at the standby region to ensure the service is accessible and functional during a catastrophic event.
Businesses can use DR in the cloud to plan for recovering data after a natural disaster that destroys infrastructure or after a cyber incident, such as a ransomware attack, where access to local network resources is cut off. Because the data can be stored in a regional cloud, the strategy can be made compliant with data protection regulations such as the GDPR. DRaaS can also be a good solution when budgets are tight, since costs can be lower than those of setting up redundant recovery sites.
Developing a disaster recovery plan should start with a risk assessment of potential catastrophic events and their impact on IT systems and business processes. Then IT and line-of-business teams, supported by management, should rank assets and systems by their importance and assign DR strategies to protect each, considering the desired RTOs and RPOs and the available budget. DR plans are part of broader business continuity plans for bridging the time from a disaster, cyberattack, or outage caused by a technical error to recovery. They need to be continually tested and updated.
Traditional DR relies on redundant servers and storage devices located in a company-owned data center or backing up business data and application instances to remote data centers so a problem in one geographic area is unlikely to cause damage to remote copies far away. Cloud-based DR strategies, by contrast, let businesses save on up-front costs by storing smaller or standby copies of application instances in a public cloud, scaling them up by adding computing resources when they need to be activated in an emergency. Businesses can also distribute mission-critical applications across multiple cloud regions.
A disaster recovery workflow contains an overview of the steps and sequences needed to restart systems, recover data, and communicate during a crisis. DR runbooks go into more detail on recovery processes and the associated documentation. They provide easy-to-follow checklists for moving digital operations to safety during emergencies, and they can ease testing or failover during an emergency. Workflows and runbooks show businesses how to stage a recovery in phases, and they identify critical systems and service level agreements.
DR workflows include risk assessments, the committees involved in a plan as well as management support, recovery strategies, and testing procedures. Runbooks may contain detailed checklists for different databases, servers, and networking gear so staff can carry out recovery steps under time pressure.
A disaster recovery operation is the process of executing each predetermined step or task in a DR plan that’s required to restore an organization’s infrastructure, databases, and applications to a fully operational state. Two terms, failover and switchover, are used to describe an application stack’s transition to a different location.
Failover provides a quick shift to a backup system during unexpected crises, including power outages and equipment failure. It’s employed when applications, databases, and virtual machines have crashed and resources such as storage, data, and operating systems are in an unstable state.
Switchover is the orderly transition to a secondary system during planned downtime for maintenance. It allows for the shutting down of applications, databases, and virtual machines or servers. In this case, both the primary and standby regions operate normally, and IT operations staff move systems from one region to another for maintenance or to complete rolling upgrades.
Cloud computing’s flexibility lets businesses implement DR strategies that fit their requirements without overextending their budgets. Hybrid cloud arrangements, in which some computing resources run on-premises and some in a public cloud, can lower the cost of disaster recovery. Cloud architectures, including microservices, let software components run on distributed virtual servers, making them less vulnerable to many types of disasters.
Cross-regional disaster recovery solutions protect organizations from outages, such as those caused by hurricanes, that would knock out access to systems hosted in just one data center. Services can run in fault-tolerant, geographically separate and isolated availability domains outside the impact zone. An entire application stack for a given system, including virtual machines, databases, and applications, can be transitioned to a different cloud region in another location.
Hybrid cloud is a popular architecture that lets enterprises transition some workloads from their own data centers to cloud infrastructure. It can be helpful for disaster recovery too. Adopting a hybrid architecture generally requires running workloads on virtual servers so the underlying hardware within the cloud data center can easily change without affecting operations.
Once workloads are virtualized, they can be restarted in a cloud environment when primary data centers become unavailable. Cloud data centers can be economical alternatives to arrays of geographically dispersed data centers.
Multicloud DR solutions protect applications and data by spreading applications’ components across the cloud infrastructures of two or more providers. This strategy can suit businesses that use more than one cloud provider, letting them set recovery time and point objectives for different applications while managing costs and making decisions about geographic dispersion. A multicloud DR process might also derive from how services and applications were developed.
Disaster recovery orchestration and management services can provide comprehensive DR for all the layers of an application stack, including infrastructure, databases, and middleware. DRaaS reduces human error and minimizes recovery time by quickly executing disaster recovery workflows to restore application stacks in different regions.
Oracle Cloud Infrastructure (OCI) Full Stack Disaster Recovery lets customers manage the transition of infrastructure, databases, and applications between OCI regions worldwide. Customers can use Full Stack DR without redesigning or redeploying existing infrastructure, databases, or applications, while eliminating the need for specialized storage or management servers.
Why is disaster recovery important for businesses?
Unplanned enterprise outages are expensive. More than two-thirds of them cost more than US$100,000, according to the IT advisory group Uptime Institute, while a quarter of unplanned IT outages cost more than US$1 million.
What are the key components of a disaster recovery plan?
A disaster recovery plan includes a company’s strategy for selecting backup sites or deploying computing workloads in a public cloud in a way that lets it swiftly restart operations. Organizations also need to rank their mission-critical and important business applications and map dependencies among them that could stand in the way of getting software back online.
How does disaster recovery differ from data backup?
Backing up data to a remote server or site is one aspect of disaster recovery, but modern DR plans cover much more. Companies need to consider technology strategies that balance data replication with service availability to keep costs in check while letting them restart applications from a small, standby instance.
How does cloud computing impact disaster recovery?
Cloud technologies can provide safeguards during a disaster by separating cloud regions into availability domains that are isolated from one another and fault tolerant. Companies can replicate systems for high availability and disaster recovery using the facilities and utilities often provided by the cloud vendor.