Stretched Server Clustering is Not a DR Plan!

By Zerto, on 4 December, 2012

Today’s blog post was written by Joshua Stenhouse, Zerto’s UK-based Solutions Engineer.

A stretched server cluster / stretched SAN is a fantastic solution for protecting against the failure of a single SAN. However in the modern day datacenter with redundant power supplies, generators, controllers and networking a single SAN failure is a most unlikely event. Given the high cost of implementation, requiring high speed low latency links, matched infrastructures, same vendor storage and the relative proximity of the datacenters; stretched SAN clusters are not going to deliver the BC/DR the business expects.

There are four key reasons that stretched server clustering is not right for Disaster Recovery:

1. Stretched SAN clusters are not going to protect you from logical failures.

A key issue with stretched SAN clusters is the most likely of DR events; a logical failure. This can be caused by a system wide virus, human error or multiple database corruption. With a stretched SAN cluster a logical failure is immediately written to both sites and offers no quick Recovery Time Objective (RTO) from such DR events.

Zerto protects against logical failures with the ability to failover a single application or whole site to a previous point in time within minutes, protecting against the most likely of DR events.

2. Stretched server clustering can’t protect you from site link failures.

The second most likely DR event can be a site link failure. With stretched SAN solutions this can lead to a split brain scenario with different VMs if different sites and an end result of no business continuity.

With Zerto; failover initiated manually and then Zerto automatically handles the entire failover process. This ensures that performing a DR failover is a business decision and not at automatic incorrect reaction to a link failure.

3. Close proximity of stretched clusters won’t protect you from environmental disasters.

Regional power cuts, flooding, earthquake or hurricane are all events that are commonly blamed for data center outages and failures. Due to the close proximity that stretched SAN cluster datacenters need to have to ensure adequate performance, they are not likely to protect against these events.

As Zerto can replicate over much longer distances with an RPO of seconds, this is a better guaranteed line of defense against such environmental DR events.

4. Finally, stretched server clustering is high cost.

Bandwidth & storage are not cheap in a stretched cluster scenario.

Zerto is storage agnostic and replicates over existing IP networks. Existing storage can re-used in the DR site irrespective of model, connectivity or vendor, even local disks can be used. Replication can be over links as small as 5mbs with no sensitivity to latency. Couple all the above together with built in bandwidth compression and the cost of implementing Zerto is significantly lower than a stretched SAN solution.

What if you have already implemented a stretched SAN solution?

Zerto can be a great compliment to an existing environment by offering replication to a true offsite location using an IP link with no maximum latency requirement, without any performance or operational impact on the existing stretched SAN solution.

If the data and applications are of significant value to warrant the size of the existing investment required for a stretched SAN cluster, then the added layer of protection offered by Zerto is a no brainer.

In an enterprise infrastructure with sufficient resources the ideal solution is a stretched SAN cluster plus Zerto replicating to a different part of the world. If I had to choose between a stretched SAN solution or Zerto, I’d select the solution protecting against the most common DR events: Zerto.

2 comments on “Stretched Server Clustering is Not a DR Plan!

  1. Reply

    Hi Joshua,

    I read your article. When the fist stretched cluster solution was introduced on OpenVMS Clusters, it was positioned as a disaster tolerant computing solution to provide automatic site failover in the event that one of the two locations went offline. It was only up to 40km with FDDI, then extended up to 800km with T3 and ATM, using OpenVMS Host Based Mirroring to mirror data between sites.

    Since that time, as stretched cluster solution become available on UNIX and Windows using host based mirroring (via the volume manager), and with SRDF and other storage based replication through an integration effort. Again, with the emphasis on automatic site failover. The distances on these solutions ended up being up to 100km with synchroous mirroring/replication.

    Comments on your 4 key reasons:

    1. Since host based mirroring and storage based replication are block level (physical) mirroring/replication, they are not going to protect against logical corruption. But logical corruption does not translate into a need for having automatic failover induced, so I see these as a separate issue from why a customer would deploy a stretched cluster.

    2. No stretched cluster used with synchronous mirroring/replication configured for automatic failover is without the requirement of an arbitrator or quorum capability to prevent split brain.

    3. I agree that depending on the disaster event, having datacenters up to 100km may not provide insurance of not having both sites being impacted by an event (more of a metro/regional event). This is were you would look to deploy additional replication from one of the two sites configured in a stretched cluster designed for automatic failover, to a 3rd site to an out region location.

    4. Yes, stretched clustering is an expensive solution. But depending on the cost of downtime, one would think the a customer deploying such a solution has done a cost justification that has a potential ROI that makes it a good investment.

    Stretched clusters deployed with synchronous host mirroring or storage based replication is about zero data loss (zero RPO), and just as important very low RTO. In the case of OpenVMS and Oracle RAC, it would have a maximum RTO of when the cluster state transition completes which is typically no longer than 20-30 seconds.

    Glad you wrote on the subject.

    Best regards,
    Dave

  2. Zerto
    Reply

    Thank you David, for reading and commenting on my post. We do see that for many customers who have requirements for very low RTOs & RPOs, the additional requirement of replication to a third, remote location, is also a necessity. Zerto offers offsite replication with no maximum latency requirement, and without any performance or operational impact on the existing stretched SAN solution – a solution that works well for customers with requirements for very low RTOs and RPOs.

    All the best, Joshua

Leave a Reply

Your email address will not be published. Required fields are marked *


*