We’re seeing some really high RTO values when running a failover test into Azure (and I assume a failover/move would see the same) with our larger servers. I know I’ve experienced even smaller servers being slower to come up into Azure than vCenter so I expected a bit of a difference, but for a couple servers we’re seeing it take over an hour to finish up the Testing Failover job.
These servers are pretty large, around 3.5TB each with a couple disks over 1TB. and 10-15 disks total on each of them. They do eventually make it which is good, but is this kind of timing expected? Has anyone else experienced the same?
What is the Journal History you have configured for the VPGs in question? The longer this is the more data we potentially have to move during recovery which can extend the RTOs. If you try a test failover to the oldest point in time in the journal do you see the same recovery times?
Yeah it seems like the journal was the cause, some Zerto engineers I spoke to in a ticket asked me to check that out as well. Lowering it from the default 24 hours to 4 hours definitely saw an improvement, but it could still take an incredibly long time (probably if there were a lot of changes recently made and stored in it). Going from the oldest checkpoint was pretty quick though.
Here’s hoping when Zerto supports premium disks it manages to speed this process up, as it would be tough to decide between up to date data or a 4 hour RTO.
I would like to jump in on this post as well and say we have experience very large RTO times for big servers to azure. Sometimes 2 hours for a 2TB server . Yea not good. So you’re saying to change the journal down to a lower time right? Also I really want zerto to have the ability to go to managed disk with SSD premium or standard SSD disk. Right now when we fail over from vcenter to Azure, we have persay 1 virtual machine with 3TB of SSD storage but when it recovers into azure the disks will be standard HDD. This is not good because …well ssd’s are always faster than hdd.
Correct, lowering the journal history time should speed up recovery on failover. As I mentioned an easy way to validate this is by testing a failover to one of your older checkpoints and see if it recovers quicker than a newer checkpoint. Keeping VPGs small (fewer VMs) can also assist here too.
Great question on the premium storage, we are actually adding the capability to recover into premium managed storage in 6.5 which is due out in the next couple of weeks.
according to release notes for 6.5 RTO has been improved for this release