Disaster Recovery: Set, Change and Fine-tune RPO and RTO

One hour -- just one hour -- of downtime costs organizations an average of $84,650, according to a report from Veeam. The average outage lasts 79 minutes. During that time, businesses can’t function properly, leading to lost productivity, data, revenue, customer confidence and security protections.

Organizations are conditioned to do whatever it takes to avoid downtime. Because that’s not always possible, most choose to set specific targets for both recovery time and recovery points.

A recovery time objective (RTO) is the amount of time a specific application or workload can be down without significantly affecting the business. Calculating RTO requires knowing how much data the application generates and determining the acceptable level of loss. It’s typically a trade-off between lost data and the cost of systems needed to avoid that loss. IDC research shows that the most common RTO is four hours for mission-critical applications and eight hours for non-critical applications.

A recovery point objective (RPO) is the amount of data an organization is willing to lose during an outage. The most common RPO for mission-critical applications is one hour and four hours for non-critical applications, according to IDC.

Today, RPO and RTO are arguably the most important metrics for a company. However, getting RPO and RTO right -- and keeping the pair of metrics on track -- is as much an art as it is a science.

Read on for RPO and RTO advice from the experts.

There is no ‘one size fits all’ when it comes to determining RPO and RTO.

Your organization’s tolerance for downtime can depend on many factors, including your company’s size and the industry you are in. In general, the more you rely on IT to deliver your product or service, the lower your tolerance. Financial organizations, for example, require near-instant protection, because even an hour of downtime can be prohibitively expensive.

The same is true of online retailers and streaming services. For them, downtime means losing business to a competitor.

On the other end of the spectrum are businesses like manufacturing, where compute is important but not always critical for keeping workers busy and business moving. These businesses generally can tolerate longer recovery times and recovery points.

IT should never be responsible for setting the company’s RTO and RPO.

While this rule of thumb might seem counterintuitive, it makes a lot of sense. “The IT department typically gets asked to decide all sorts of things, like how long to keep backups and how quickly a system should respond to a disaster, but those decisions should be made by the business based on the impact of those metrics,” said W. Curtis Preston, chief technical evangelist at Druva, a data resiliency company.

Of course, IT should help guide the business in the right direction. “If you asked the average business stakeholder what they want for RPO and RTO, they would say ‘zero and zero’,” Preston said. “ITs job is to tell them how much it would cost to get to that point so they can make an effective decision.”

Be realistic about your metrics.

Too few organizations truly calculate their cost of downtime. As a result, they accept RPOs and RTOs that may be far more expensive than the infrastructure needed for fast recovery, said Phil Goodwin, a research vice president at IDC.

One of the best ways to calculate RPO and RTO is to define your criticality tiers, which are based on the criticality of applications and workloads. Mike Isaak, a sales engineer with Recovery Point, an IT infrastructure and business resilience provider, suggests using at least four tiers:

Tier Zero for systems that need always-on connectivity, such as Active Directory and DNS;
Tier One for systems that need to be recovered in four hours or less;
Tier Two for systems that need to be recovered within 24 hours; and
Tier Three for systems that need to be recovered within 72 hours.

Measure and test early and often.

While it’s common to test RPO and RTO when they are first assigned or when a new backup or disaster recovery system is implemented, testing often falls by the wayside after that. That’s a mistake.

While older systems must be tested manually, it’s much easier to test RPO and RTO in cloud-based systems. Newer software technologies specific to recovery like Zerto also have full automation and orchestration capabilities, making testing simpler. But it does take some manual labor, Isaak said. After conducting a scheduled automated test, end users should also physically test and validate the system. “It’s about testing it end to end -- not just bringing up the systems, applications and databases,” Isaak said.

Reevaluate RPO and RTO frequently.

Business processes, business models, technology and external inputs change continually, and any one of those things can change your RPO and RTO needs.

Changes could include the addition of a new application or system, the retirement of legacy applications, execution of a merger or acquisition, or a shift in business direction. Even if nothing changes, Preston recommended that organizations do a full RPO and RTO reevaluation at least annually.

Comments

Plain text