By Zerto, on 13 October, 2014
Turn on your lights and lock your doors! It’s time for Tales from the IT Crypt…
We’ve all got a story, an IT disaster story, that spooked us into taking DR seriously. Unplanned outages, deletions, natural disasters… This year Zerto is asking for your most frightening IT story. Plus, we’ve got some good spooky ones we will share with you on our Master of Disaster: The Scariest Disaster Recovery stories – EVER! Webinar, on October 28th!
Share your story and WIN:
- All stories we add to this post will win the submitter a special edition Zerto t-shirt
- Win the big prize for scariest story, which will be announced during the webinar!
- Share these stories and scare your friends on twitter, using the hashtag #ScaryDRStories
Make sure to tune into our webinar, on October 28th to find out which scary story wins the big prize and hear some other Tales from the IT Crypt!!
Read below for those who have already shared their stories that are sure to give you GOOSEBUMPS!
Matt C.'s Story
As the clock chimed 3 times, then 4, then 5, signaling the 5 o’clock hour was upon us, and with the slightest glimpse of sun peeking in through the crack of my 3rd floor window, the beads of sweat running down my brow became more plentiful. “People on the east coast are going to be waking up shortly, and not a single critical system is running” I said to the other 4 engineers on the conference call...“Let’s just keep focused and keep working on it” was the response. Any hope that my clock was wrong, and it wasn’t really 5am was shattered by the familiar sound of garbage trucks in the alley behind my downtown apartment. The time was correct, but last I could remember, it was 9pm.
See it was just 8 hours earlier that a major power outage took out 10 city blocks, including my main office location, where the core of my infrastructure is located. At first glimpse of the situation, things seemed to be breaking my way. The timing was perfect, everyone was done working for the day, and I had all night to failover the systems to my remote location, not that I’d need all night, but it was nice to have right?
As 9pm turned to 10, then 11, then midnight struck, it became painfully obvious that failing over these systems was going to be far more challenging than anyone imagined. With a failover procedure so complex, and with so many intricate details during a failover, odds are that 1 of the pieces will not work as expected and would take some time to resolve. Nobody expected that every portion of this plan would behave in that manner though. “This whole process needs to be evaluated” I commented as I try over and over to get the VMFS volumes to mount, unsuccessfully.
By 4am, a few non-critical systems started to come online. This was a minor victory that nobody was going to break out the champagne for. The mood of everyone on the call was positive, but at times you could hear a pin drop, the eerie silence of utter focus on the tasks at hand. The elephant in the room was painfully obvious, we were hours away, if not more, from having the systems online, and as the clock continues to relentlessly move forward without forgiveness or empathy for sleep deprivation, we all know phones are going to start ringing, and questions are going to be asked. Questions that we don’t have the answer to.
As I sit, surely 4-5 pounds lighter from the loss of body fluid in the form of sweat, a break happens. “Guys, the power just came back on” I said with a crackle in my voice akin to a 13 year old boy going through puberty. Could this really be happening or had I fallen asleep and was dreaming of this fortuitous power restoration? It was real. After analyzing the situation I speak up “We can halt the failover procedure right now and bring everything back up in production.” The response is direct and succinct; “Do it.”
By 6am all systems are back online in production, and for this night, crisis was averted. I couldn’t escape the feeling that this was some type of warning. I knew it, everyone knew it. The entire process had to be evaluated and re-architected from the ground up, and fast.
Travis F.'s Story
After a large storm rolled through the area that i work in it knocked out power to our building and the generators did not power on when the power went out. That morning i had over 100+ users trying to login to our network and no one could. We had to go up to our server room and manually boot each server and re-post end of day...Nothing to dramatic but then we had over 20 remote locations that were trying to login to our terminal server and we told them it would be a few minutes until they could login and most of the people just never logged in at all that day because they figured if they couldnt login at that time they wouldn’t have to at all that day.
Matt M.'s Story
This one time I came into work through roads that had over 3 feet of snow to find out that we were on generator power. There wasn't enough coffee in the building to keep me going that day...We had to run extension cords and duct tape those down all over the place to emergency power. We were down for 28 hours. That is why during winter time I carry a bag of clothes and shower supplies in my car. Just in case. I would elaborate more but winter is coming and this story makes me sad, very sad.
Dino T.'s Story
As a Systems Administrator, most of you can relate to the Boss that doesn't care and probably couldn't understand RTO/RPO. At the time of this story, I was working in a data center where Murphy ruled and the laws of physics didn't exist. One year my liebert backup failed - I only had one unit at the time.... After the outage and bringing up all the servers, I had funds to purchase a backup liebert unit. No data loss – down time… 1 – 2 hours. The next outage was really fun. With the newly configured battery backup systems, and a gas generator for redundant power. My data center could last for 45 minutes on batteries alone. Plenty of time for the generator to kick in. (Usually about 48 seconds). Apparently, Gas generators have a $4.00 fuse that needs to be replaced annually – guess who’s was never replaced? So during a rather unique summer storm when power was knocked out, on a saturday night, my backup generator did not kick in. I didn’t know about the outage until Sunday morning when my monitoring software quit talking to me (email). Not such a big issue, again I went in and brought everything up and all was well come Monday morning. Then it happened. Something that could not happen. I’m on vacation driving to families for Christmas when my phone rings. “Nothing is working, no one can access anything”! I am over 150 miles away and realize that there is a major outage, but cannot get back in time to implement DR procedures. I call a local vendor and ask a favor. 1 hour later I get a call. “Your primary SAN lost 3 drives and both fibre connections to the secondary SAN have failed”. Remember I talked about how Murphy ruled the Data Center? Disk 0, 1, and 2 failed at the same time!!! (RAID 10) configuration. In fact, LUNS on both the primary and backup SANs were corrupt. No backup, no virtual machines, no production anything – no DC’s, DNS, Print or file servers, no email, no SQL servers – NOTHING. All I had was tape. 1.68 TB of data to restore and re start. Again, remember how I said the “Boss” didn’t understand RTO/RPO? Well I had everything on tape, but for the next few weeks, I kept trying to explain why backups happen (as the Boss put it) so quickly and why it was going to take 3 – 4 weeks to restore. In the end all was restored except for 4 .pst files. 10MB out of 1.68 TB if data. But my disaster recovery plan was a complete failure. During the restore operation, a user accidentally deleted a file that was nelly created and needed it to be restored immediately. I told the user that they would have to wait.
My scariest story ... an old server room of a very large manufacturing company was under a sewage tank. One auspicious day, the cleaners accidentally punch a hole in the sewage tank on the floor above...The sewage literally fell on the NetApp boxes in the server room below. I was called around lunch time. The customer told me “CF, we are in deep sh*t! Literally sh*t!” I will never forget that moment. I had to scramble to loan NetApp gear from resellers. I got the stuff together and drove 5 hours up north to deliver to this manufacturer. I got there at 1am, and the other engineers were cleaning 5 NetApp systems with everything removed to its barest parts. By 4am, we had cleaned almost all with industry cleaning fluid and had put the NetApp systems back together again. As the fan turns, the smell of the data center room was something I will never forget. Surprisingly 4 of the 5 NetApp ran without a hitch. The customer decided to use their insurance to replace the machine.
Jayme W.'s Story
What seemed to be a normal day, soon turned dark and dreary as evidence of a benevolence force began to stack up! A force from the Crypt-o-locker!!!...Yes! That evil menacing ransom-ware had reached our network and began its attack on kind and gentle file servers. These servers did nothing but good! They stored data for end users, held databases for production floor manufacturing equipment, provided a safe location for archived emails – all the good things you would expect from a file server.
This evil crypt-o-locker took nearly all these kind files and locked them away in its dark underworld and users cried out for help. Where are my files?? What happened to my archived email?? Why can’t we run the floor equipment?? Many more cries could be heard as we rushed to fight this evil intruder. Soon we found that this attack had found its way to our replica file server and there was no longer a quick fix, hope was quickly being lost. Days ticked by as answers were searched for, and the evil was being backed into a corner and forced back to its dark crypt.
Many days passed before the cries for help began to diminish. Slowly the tide turned towards the good, but at a heavy cost. Servers were lost, data had to be restored, from a media not heard from in these days!……Something called tape? No matter, this tape served a purpose this time but would there be another? Would this crypt open once again and how could we keep such evil at bay??
We found a weapon to combat crypto and quickly deployed it for protection. NOW when the danger comes knocking again a simple reset to a point in time using Zerto will instantly give the good people access to their data and vanquish crypto back to the depths from where it came. No more ancient technology and doubts of victory. Next time we KNOW we will be victorious over the evil!
Rik G.'s Story
Before there were really cool things like Zerto and virtualization, DR for the hospital group I worked for was a disaster itself. Our plan consisted of physically taking tape on a 5-hour drive, restoring to rental equipment from Sunguard (which took over 2 days to just restore email) and then bring a system online one at a time...With over 1500 servers at the time, we really had no DR plan that would work. Fortunately we built in a lot of HA to begin with but I can’t think of anything scarier than not having a functional DR plan in a critical environment. If we had Zerto back then, we would have had a very solid DR capability for a majority of our systems.
Greg S.'s Story
I was the hands and feet at a bank in Vancouver doing an upgrade while the guys back East did their job. The SQL crashed and I had to stay over at the bank for 16hrs. I slept on the couch they had in the lunchroom ...It was fixed eventually. It was a re-imaging process of all the computers and the SQL server was not getting the new computer names right. I believe it may have been on a Friday and I left before any employees came in if they had worked that Saturday. I don’t think it really impacted the business, but it sure impacted my OT hours!
By Zerto, on 24 September, 2014
By Shannon Snowden, Zerto’s Sr. Technical Marketing Architect Recently Nigel Poulton published a great blog post, http://blog.nigelpoulton.com/xtremio-craps-on-emc-badge/, where he eloquently laid out his disenchantment with his storage vendor’s marketing message versus the reality of what they actually sold him. This came in the same week where there was more than one Twitter battle between storage …
By Zerto, on 11 September, 2014
We are excited to be hosting a series of events at different race tracks throughout England. Join our ‘UK Race Day Roadshow’ and learn more about protecting your data with our award winning business continuity and disaster recovery (BC/DR) solution. How are the two related? During the event we will present a live product demo …
By Zerto, on 9 September, 2014
It has been a week since VMworld 2014 finished, and as everyone is waking up from their VMworld Hangover (aside from all the sales guys), we wanted to brief roundup of the event. We’ve got a full recap for you on our “What Happened at VMworld” webinar, but the highlights are here: It was a …
By Zerto, on 28 August, 2014
Another great VMworld ended with another fun Zerto Red Wed! Some information on Red Wed: it is an annual promotion that Zerto does at VMworld. We hand out our yearly, creative red Zerto t-shirts and everyone wears them on the Wednesday of VMworld in order to win prizes. As you can see, VMworld was covered …
By Zerto, on 27 August, 2014
By Chris Snell, Zerto Sales Engineer, EMEA This is the third post in our three-part ‘blogging from the VMworld show floor” series, covering the most common questions we get at VMworld. The most popular question today is, “Does Zerto use snapshots?” In a word, “No” – but an explanation is due: There are many products …