Downtime & The Importance of Cloud Resiliency | Zerto

No Cloud is Immune to Downtime

Est. Reading Time: 9 minutes

The Importance of Establishing a Cloud Continuity Platform and Using Multiple Clouds for IT Resilience

By Keith Taylor, Product Specialist at Zerto

The core tenant here at Zerto is “A World of Uninterrupted Technology for Everyone.” This is something we passionately believe should be achievable in this modern age. The ever-increasing adoption of the Cloud by businesses of all sizes is perhaps the first step (or maybe the second, after virtualisation?) towards achieving this ‘flawless’ mobility of data. But are we getting ahead of ourselves by rushing to adopt this trend without giving full consideration to the resilience of these services?

The adoption of Cloud platforms, particularly for enterprise-scale businesses, is being led by the likes of Amazon Web Services (AWS) and Microsoft Azure, with the Google Cloud Platform still trying to catch-up with those earlier innovators. We’ve previously done away with the notion of being “too big to fail” in a post about the now-infamous SalesForce NA14 outage – the point being that we shouldn’t necessarily rely on the provision of services from a single provider simply because they’ve got a big name behind them. You’d be right to point out here that these big players have the resources, the fail safes and the financial incentives to often prevent issues that might cause more of a problem for us little-fish, but sometimes even that isn’t enough. Problems happen to everyone and there are plenty of examples of Cloud outages that have affected the biggest to the smallest names over the last few years. We’re going to recap a few here – private as well as public – and look at the implications they’ve had on their users.

Last month saw (yet more) severe weather events around the globe – in particular the savage storms that hit the East coast of Australia, impacting a number of data centre operations in the Sydney area. You probably know where we’re going here … that’s right, the biggest player so far, Amazon Web Services, got knocked out. This presented some obvious problems for all the businesses relying on this platform to conduct their day-to-day business, which included:

• Pizza delivery company Domino’s
• Food and drinks delivery company Menulog
• Car-sharing service GoGet
• Ticket-selling service Try Booking
• TV and film-streaming service Stan
• TV-streaming platform Foxtel Play

The problems were caused by the somewhat inclement weather, but more specifically a power failure at a regional substation – something well beyond the control of any Cloud provider. Of course, as you’d expect, Amazon had not one but two backup power supplies available for just such an occurrence – but on this occasion, they both failed.

Now, in theory, a part of the benefit of utilising a service like AWS is having multiple availability zones so in the event that one instance goes down, workloads should automatically failover to an alternative bit barn. Due to a combination of “DNS resolution failures” and “a latent bug in our instance management software,” for a number of users this automated failover never happened.

It is perhaps worth noting that this particular AWS instance is located in a CoLo data centre (operated by Equinix) – and it wasn’t the only one affected by the poor weather. The Western Sydney Data Center (operated by Fujitsu) also experienced a power outage. One of its clients is First Data, a third-party payment provider, who provides transactional services for several banks in Australia. Customers of Westpac, Bank of Queensland and ME Bank all reported being unable to make card transactions and use other services – a very serious disruption to a critical industry, which goes beyond impacting just the business, to directly affecting its customers as well.

This isn’t the first time that AWS has had problems – in September 2015 Amazon Cloud Services also failed thanks to a problem with one of their own features – DynamoDB, the “fast, fully managed, NoSQL Database Service. “In short, the AWS servers failed having been overloaded with metadata requests from DynamoDB knocking offline a number of popular apps and websites like Tinder, Netflix, IMDB and Reddit.

There are plenty of other Cloud failures that have occurred in the last few years as well – I’ll gloss over the details of these or we’ll be here all day. But before you dismiss these events as ‘odd’ or ‘unusual’ occurrences, consider the frequency with which they occur and the number of providers that suffer similar problems.

Microsoft Azure – In November 2014 Azure suffered a massive outage that affected a number of MS services including Office 365, the Windows Store, Xbox Live and OneDrive for Business. Then again, in March 2015, Azure was knocked offline – that was fixed quite swiftly, but not until the next day when it experienced a second outage in two days.

Microsoft Exchange – I couldn’t help but laugh at this one. During Microsoft’s Worldwide Partner Conference in July 2015 – with a focus on Cloud stability – attendees found they could neither send nor receive emails after the Exchange platform died on them. Oh, sweet irony…

Google Cloud Platform – February 2015 saw Google’s Compute Engine service go down, and then again the following month. Both were related to a faulty configuration patch.

August 2015 saw a number of Google customers running Persistent Disks suffer significant problems after a successive series of lightning strikes killed the power to the storage systems. What’s truly terrible here is that these poor customers were left with no choice but to recover from snapshots of all things (and that’s a dirty, dirty word here at Zerto!). Next – November 2015, when a network admin invoked a manual procedure that resulted in traffic dropping left, right and centre. December 2015 – another patch-related outage, this time caused by a single typo on a case-sensitive variable… (David and Goliath anyone?). And finally (for now at least), in April 2016, Google broke its own Cloud yet again with yet another software bug – which when they tried to correct, unveiled a second bug in the system. The life of a Google Cloud engineer must be all kinds of fun!

Apple iCloud – Not so much of an issue for businesses, but in March 2015, Apple’s iCloud service went down affecting as much as 40% of its 500 million users.

Starbucks – A slightly different example but with a clear impact to the business. Starbucks’ CRM system went offline in April 2015, which might not have been such a problem had it not been linked to the PoS systems in such a manner that caused the cash registers in a massive number of stores to stop working. Many branches closed early as a consequence, but some more magnanimous store owners distributed free coffee instead. I can’t imagine that was a particularly profitable day either way.

PayPal – Last example for you and it’s an old one I’ll admit. All the way back in November 2010 a networking hardware failure coupled with unresponsive recovery systems caused PayPal’s online payment service to fail for 5 hours on a Friday. At the time, it was estimated that PayPal were processing $2,000 a second in online transactions – that’s $36 million in lost e-commerce transactions over that 5-hour period. Some individual businesses estimated that losing this facility cost them 25-30% in lost earnings for that day.

The Cloud is a wonderful thing and it offers many benefits to businesses in terms of cost, efficient utilisation of resources, flexibility, mobility and security – but at the end of the day nothing is immune to failure. We certainly should be seeking ways to make the most of the Cloud and all its advantages, but we still need to maintain responsibility for our data and ensure that the information itself is resilient, not just the infrastructure it’s currently residing on.

The answer is really quite simple – one reader on The Register summed it up in a rather poignant, albeit not particularly elegant manner;

“… and the answer is still as remarkably simple and exactly the same now as it was back then: put your sh*t in more than one data centre! Preferably with the second (or better, third) at least in a different city if not another country all together. Even better, use different business partners/providers for the others so there isn’t a business-level SPOF along with the physical ones.
This shizzle isn’t hard, people. In fact, in this day and age, the likes of AWS make it positively trivial…”

So there you have it. Putting all your eggs in one basket is just asking for trouble. If you’re not taking advantage of the flexibility and data mobility that the Cloud enables by putting your data in multiple locations, then you’re missing out on a good part of the value that embracing the Cloud offers.

This brings us to Zerto as a Cloud Continuity Platform. Whatever way you want to consume the Cloud, we’ll help you to do it. Whether you’re just considering dipping your toes in the water to find out what all the fuss is about or you’ve already built up your Cloud strategy and are looking to incorporate that extra layer of resilience, Zerto can help you to realise the full value of the Cloud.

You might want to reduce your infrastructure costs and look at leveraging AWS as a secondary site, but maybe you’re not sure how you’re going to get your data there in the first place? We’ve got that covered. Migrate your whole site to AWS, or maybe just one encapsulated application at a time, without any disruption to the production environment and without any break in the replication. And remember, everything replicated to AWS remains dormant until you need it, meaning you’ll only pay for the storage of compressed data until you need to failover from production.

Perhaps you’ve decided that your Disaster Recovery process isn’t something you want to manage yourself and the idea of DRaaS has become more and more appealing – we’ve got that covered too. The Zerto Cloud Ecosystem consists of 250+ CSP’s with all the skills you need to take the hassle out of DR. In the 2015 Gartner Magic Quadrant for DRaaS, 8 out of the top 14 DRaaS providers listed are using ZVR to power their DR services.

With Zerto’s Athena release coming later this year we’ll be adding support for Microsoft Azure (at last!), enabling you to make the most of all those credits that Microsoft keeps giving you. Add to that the ability to replicate from “One-to-Many” and suddenly you can have multiple copies of your data wherever you need them in near real-time. Keep a secondary copy stored locally at production for speedy file-level recovery, replicate into Azure for DR and then keep an Offsite Backup in AWS – the possibilities are promising to say the least!

Underlying all of this flexibility is one key factor – independence. Zerto’s revolution of virtual Disaster Recovery is the keystone to unlocking the Cloud through freeing us all from lock-in at the physical layer.

Come talk to us at any of our events, give us a call or download the Zerto Cloud Continuity Platform Whitepaper to find out more.