DR 101: Recovery Time Objectives (RTOs) — Definition and Drivers (Updated June 2023)
In the event of a disaster that disrupts your business, the time it takes to recover your systems may be the difference between thousands or millions of dollars of lost revenue and productivity. In some cases, disruptions may even affect emergency services or utilities necessary for our health and well-being. Recovering quickly is crucial to mitigating all the possible costs of a disruption.
Every organization has a different tolerance for downtime and disruption. And within the same organization, different departments and services may have varying tolerances too. Understanding those tolerances and defining recovery time objectives (RTOs) for systems is a key part of disaster recovery (DR) planning.
In this blog we will define RTOs in detail, explore what factors affect them, and explain how to achieve the best RTOs for your organization.
RTO: A Target to Mitigate Downtime
In disaster recovery, an RTO refers to the time required to recover data and applications. In other words, an RTO is the amount of downtime that can be expected before IT operations, including data and application services, are back online following a disruption.
To understand how to use RTOs in your DR planning, keep two key considerations in mind.
First, consider your desired or necessary recovery time. As “objective” implies, an RTO is a target to mitigate downtime as much as possible. An RTO may be defined by a service-level agreement (SLA) or regulation for provided services. This is a hard objective with defined consequences for failing to achieve it.
Second, consider the recovery time your organization can actually achieve with the current technology and plan it has in place. You may have an SLA that requires four-hour RTOs, but can you achieve that during a downtime event with your current recovery capabilities? Could you surpass your SLA and recover within minutes rather than hours? DR testing can show your actual, currently achievable RTO, as opposed to your target recovery time.
The Main Drivers for RTOs
RTOs are influenced by many factors, including the level of manual vs. automated recovery processes, such as boot ordering of VMs, redirecting network traffic, and recovering application stacks consistently. Let’s review the most prominent of these.
Before the recovery day arrives, you need a plan. It should answer questions like: What recovery steps do you need to follow? What tools do you need? What is your desired RTO? Who is the incident response team that will handle the recovery?
Documenting your recovery plan is crucial to a successful recovery. After the plan is documented, teams need to be trained on their role in the recovery process, whether they are part of initial incident response or the DR team. The more informed teams are, the more quickly they can respond to a disruption effectively.
2) Assessing the disruption
When a disruption occurs, the first step is to determine whether to declare a disaster event, which initiates your DR plans. Enacting a DR plan could mean shifting production workloads to a remote facility or public cloud. Depending on the particular DR plan, this process may be measured in minutes or hours. More importantly, the time to then recover back to your primary production environment may take many more hours or days.
Disruption assessment is critical because in some cases the event may not be a disaster: short-term network or power outages, for example, are resolved quickly. A one-hour power outage may prove less disruptive than executing your disaster recovery plan and still allow you to meet your RTOs. Your incident response team should be well trained in making the assessment and choosing the option that best mitigates the particular disruption.
3) Automating and orchestrating the recovery processes
When a disaster is declared and recovery is initiated, the process can be complex, especially if it is manually driven. Recovery is rarely as simple as just starting up all systems in a recovery site. There may be dozens or even hundreds of applications, VMs, or containers to recover! What’s more, there are likely dependencies between applications that require some to be started before others.
Orchestrating and automating recovery logically can greatly reduce recovery time from hours or days to just a few minutes. When VMs and containers are logically grouped together in recovery groups, you can recover applications in the correct order. Doing so allows hundreds or thousands of workloads to be recovered in a very short time and with very low RTOs.
4) Managing network traffic
A DR scenario typically involves bringing up applications at a remote site or cloud. Depending on how networking is configured across the primary and recovery sites, the recovery process may require steps to redirect network traffic to applications. There are many ways you can configure virtual networks, DNS updates, redirects, and routing to make sure users and systems can connect.
Planned networking ensures a smooth and quick recovery to meet RTOs. If applications are running but users cannot connect, downtime is still occurring. You can plan ahead by understanding your current network architecture and the steps necessary to allow users to connect to the recovered applications within the desired RTOs.
5) Recovering applications consistently
The recovery time for an application depends on the application consistency of a recovery point. Many solutions can recover an application running on a single VM in an application- or crash-consistent state, but when an application spans multiple VMs or containers, most solutions will struggle to recover it consistently. Inconsistent recovery results from solutions that are unable to treat a group of VMs or containers as a single entity.
Without consistency during recovery, applications may not start up. Instead, they require recovery to a more consistent point in time or need a second recovery process designed to synchronize data across application components. These extra steps take time, potentially pushing the recovery process beyond the desired RTOs.
6) Conducting DR testing frequently
DR testing may be the most crucial factor in achieving low RTOs. If your recovery plan is not tested regularly, it can result in a number of factors that increase recovery time. Over time, IT environments and teams change, and a recovery plan that was once bulletproof can quickly become filled with gaps. Testing reveals gaps in documentation, training, and execution that are not easily visible on paper.
Ideally, recovery testing should be performed multiple times per year to properly train teams and update recovery plans. The reality is that organizations delay recovery testing because it can be resource intensive and disruptive to operations, often due to recovery tools that test slowly and impact production. But executing an untested recovery plan can significantly increase recovery time and exceed desired RTOs.
Optimizing Your RTO
A successful disaster recovery solution should include certain key features that optimize RTOs.
Orchestrating recovery at scale across hundreds of VMs, containers, and applications is essential to achieving low RTOs. Easily grouping workloads for recovery makes it far easier to recover applications in order, which leads to quick recovery.
Too many manual steps increase recovery time and create room for human error. A solution that automates recovery and provides APIs to integrate with third-party management tools significantly reduces recovery time, especially when recovering dozens or hundreds of workloads and applications.
• Application-Consistent Recovery Points
A recovery solution should provide application- or crash-consistent checkpoints for applications. Otherwise, recovery time will be wasted on reconciling differences between application components that prevent them from functioning.
• Recovery Testing
Recovery testing that doesn’t disrupt your normal operations is key to testing regularly. A good recovery solution has built-in mechanisms for automated testing that doesn’t interrupt protection or production services. When testing is automated and nondisruptive, it can be carried out during normal business hours and takes less time and resources to complete.
Achieving the Lowest RTOs in the Industry with Zerto
In today’s digital business world, it’s more important than ever to keep RTOs as low as possible. Zerto consistently achieves the lowest RTOs in the industry to help businesses become ransomware resilient. This solution adapts your organization’s DR strategy to minimize downtime. Using its own continuous data protection (CDP) technology, that combines real-time replication, recovery point journaling, and application-centric protection, Zerto adapts your organization’s DR strategy to keep RTOs as low as possible.
Increase your knowledge about DR with our Disaster Recovery Guide!