By Sean M Masters, Manager of Solutions Engineering, Americas West
Blog Originally Published at – https://www.linkedin.com/pulse/disaster-recovery-public-facing-applications-sean-m-masters
One of the questions I’m often asked is how to protect and recover public-facing applications, that is, applications which are accessible through the Internet. These applications commonly include e-commerce sites, SaaS platforms, supplier or purchasing solutions, and more. Such systems are often major revenue drivers to the business. With increasing B2B and B2C demands, rising competition, and tightening budgets, it grows ever more critical to guarantee minimal data loss and rapid return to operations in the event of a disaster.
My experiences with an e-commerce startup, in the not-too-distant past, showed me first-hand the challenges internal stakeholders and IT teams face when dealing with major system outages. Frankly the internal pressures can pale in comparison to those put on us by public consumers or business partners banging down one’s door should an outage ever occur. Because of greater public visibility with many outages today, these are often referred to as “resume generating events”.
With planning, design, and continuous testing we can address these concerns leading to reduced revenue impact from an outage and a more rapid time to recovery.
Some of the risks facing these platforms include:
Before we look at one way to address these problems today, let’s define how we tackled these problems with legacy approaches.
The number of servers and systems involved meant a team of people monitoring for uptime 24/7 or a heavy reliance on complex monitoring systems. Rapid data growth meant constant challenges in adding and maintaining database features such as log shipping, mirroring, and clustering. These features often add to storage growth or sprawl, increased licensing costs, personnel skill requirements and even a need to add headcount. More recently, many organizations have tried to reduce risk through storage-based or in-guest replication technologies but both have their pros and cons, and worse still – both have completely failed to address the rise of virtualization, the growing complexity of recovery or the need to do more with less.
Today, public outcry over downtime events means management and even shareholder or board pressure across the IT organization. What is needed are data protection and recovery solutions that can protect the full application stack, provide for data integrity in replication and recovery, not impact the running application, and provide for rapid recovery to minimize user downtime. Coupling such a platform with managed pilot light service and DNS services we can finally mitigate or virtually eliminate all of our defined risk factors.
Step 1 – Protect the Core Application
By leveraging a modern data protection solution we group the complete application into a single wrapper where we can:
Step 2 – Prepare the Infrastructure for Disaster
To ensure rapid recovery to users and customers, we need to pre-configure those infrastructure components that exist outside of the core application with the features needed to support a disaster declaration. This involves:
Step 3 – Continuously Validate Recovery Plans
To ensure rapid return to operations in the event of a real disaster, it is critical to test and validate your achievable recovery time objective (RTO) frequently, with some organizations testing on a weekly basis. This testing must also be simple enough that anyone on the Emergency Response Team can perform it in as few steps as possible. Such testing often includes:
Putting it all Together
This is just a high-level overview of technology risk management planning for public-facing applications. By protecting the core application, preparing secondary systems for disaster, and continually testing for preparedness, any organization can ultimately achieve true push-button disaster recovery in minutes. This brings a new level of application availability enabling a rapid return to operations and revenue while reducing cost impact due to data loss or downtime in the event of a disaster.