DR 101: Recovery Point Objective (RPO)
By Joshua Stenhouse, Zerto Technical Marketing Manager
In disaster recovery terminology the term Recovery Point Objective, often abbreviated to RPO, is a widely used term. However, it is seldom fully understood in terms of being able to explain it easily and quantify the impact it has on a recovery scenario.
“The Recovery Point Objective is the point in time you can recover to in the event of a disaster.”
Put into context this means that if you have a disaster, such as a database corruption, power outage, SAN failure or user error that ceases production operations then you will lose all of the data up until the achieved RPO. If you have an RPO of 4 hours on your critical applications then this means you would lose 4 hours of data, as 4 hours ago is the last point in time to which you can recover.
The cost of just 1 hour of lost data for any size business can be a significant amount and as you scale upwards this becomes an even larger impact. If we take a sample organization with a turnover of $100m you can see the potential impact:
Of course this is a very simplistic calculation of the annual revenue divided by number of days in a year and then hours in a day, but any attempt to be more exact would be a vast complicated exercise in coming to an answer of “I don’t know”. The reason being that it is impossible to know when a disaster will strike and over what time period the data loss will occur. You could be lucky and have a disaster out of working hours and lose no data, but this assumes you even have the concept of “out of working hours” in your organization, which many in this day and age do not. Or you could be really unlucky and have a disaster at peak usage during your busiest period, and of the 2 it always seems to be the latter!
Due to the importance of RPO on data loss, it is recommended to agree on an acceptable and achievable RPO on a per application basis with all of the relevant business units and owners. Once this has been done you might have a basic SLA such as the below:
- CRM System — 1 hour RPO
- Finance System — 1 hour RPO
- Email — 2 hour RPO
- File Servers — 4 hour RPO
- Directory Service — 8 hour RPO
- Print Servers — 24 hour RPO
- Development Servers — 24 hour RPO
If you have implemented a BC/DR technology to deliver the above RPOs by replicating on a schedule, as defined by the SLA, you may think you are covered, but you could be wrong.
The reason being is that you would always be “red lining” your achieved RPO to your SLA. By which I mean that by replicating on an hourly basis with perhaps a SAN based snapshot, the best you will ever do is meet the SLA. However, if there is a huge amount of data change you might start to miss that SLA and not to able to recover to a point acceptable to the business.
This is why you should always aim to achieve the lowest RPO possible, then configure alerts to warn if you are in danger of the achieved RPO getting close to your defined SLA. In order to ensure low priority applications don’t impact the RPO on high priority applications, a priority and Quality of Service (QoS) setting should be applied to individual replication streams. This ensures they are prioritized accordingly in the circumstances of high IO and/or low bandwidth.
By applying QoS you can ensure that any available bandwidth is used to maintain a consistently low RPO across all of your applications, yet if the bandwidth becomes constrained only the high priority applications continue to maintain the low RPO.
I hope this has given you a good insight on the subject of RPO and you can now use this knowledge to help you define your BC/DR strategy and SLAs.