• This topic has 13 replies, 5 voices, and was last updated December 9, 2021 by Philip M.

Beware if making your journal too small

  • Zerto is great but recently when we needed it most it failed us.

    Zerto replicates in real time and within xx seconds the remote site has an up to date replica so in the event of a failover, it can bring up the server.

    But, there is a major flaw.

    We have a server that we only need  up to date as of the last checkpoint. The server data is fairly static and holds all our companies passwords. A new password might get added once a week.

    So I set the journal to 1 hour to save on disk space.

    We had an issue and the replication stopped for all VPG’s.

    After an hour or so we decided we needed to bring up the Server to access the passwords.

    Nope, can’t do it. As we had gone over 1 hour journal limit before bringing it up Zerto would not allow us to bring up the server.

    Why? The server is up to date as of the last checkpoint, irrelevant when that was. Just bring it online, surely that is simple?

    Nope, apparently you just cannot do it. The Zerto engineer I called said we had to go through a long process to copy (remove locks) the data which was 450gb to a new location on our datastore and then create a new server and attach the disks. That would have taken hours.

    Just allow us to power on to the last checkpoint, please !!

    Thanks

    Very timely post!

    Our standard for several years has been to reduce the Journal History from the default value of 1 day down to 4 hours, but there isn’t any way we would be able to restore all our critical infrastructure and then critical applications in that timeframe.

    This possibility occurred to me earlier this week due to some other BC/DR evaluations we are working on, and I have been working to test it with our system.  We haven’t experienced it in a real life event yet, but I’m sure we will be changing our settings now that you have confirmed my suspicion!

    Just curious – did you submit the option to fail over with the last checkpoint as an enhancement request?  I’m sure we aren’t the only ones in this situation, and I’m also sure it would take a lot longer to restore from a large outage than most people think it will.  😉  Having the option to restore the last checkpoint would be a great fallback!  If so, let me know and I’ll give it an up-vote!  If not, I would recommend doing that.  You would have more leverage behind it than I would since you had a real world issue and opened a ticket.

    Thanks for taking the time to post…Jeff

    The engineer on the call said he would submit the request for me. Not sure if he ever did.

    If he did you should have received an email with a link to the enhancement request.  If you didn’t get that notification, he probably either didn’t submit the enhancement request or didn’t link it to your ticket.  You can reach out to your Zerto rep and ask them to check on it for you.

    I have submitted the request now

    Thank you!  Do you have a reference number or something I can use to go up-vote it?

    Hi Philip,

    You can post an enhancement request here for others to vote on. This portal is the entrance to all Product Management Feature Requests so if you enter it the PM team is alerted and will assess. https://www.zerto.com/myzerto/support/feature-requests/

    Cheers

    Bob

    Philip,

    Just to clarify, are you saying a failover, restore, or move wouldn’t work because it had been over an hour since your last replication was successful?

    Andrew F .  We only wanted to fail it over and could not. The engineer I called explained why which is daft.

    Philip –
    Andrew and I work for the same organization.  I did some testing and was unable to duplicate your results.  I assume you were on VMware platform since you talk about datastores, but I was only able to test in Azure at this time.

    Here is what I did:

    • VPG with server in West-US replicating to West-US-2.
    • 1:43 pm – shut down ZCA01WESTUS.  Earliest Recovery Point is 10/13/2021 @12:08 pm.
    • 1:48 pm – shut down ZertoTSt1 (server in the VPG).
    • 2:00 pm – West-US-2 Zerto console shows 16 minute RPO and “Recovery is possible”.
    • 3:12 pm – Journal History says Earliest Recovery Point:  NA, but still says “Recovery is possible” on the RPO tab.
    • 3:15 pm – attempting Failover Live, warning message at first screen – but able to continue.
    • I can see 46 checkpoints that I can select from, ranging from 12:08 pm to the latest at 1:45 pm.
    • 3:19 pm – Cancelling Failover Live.  I’m going to wait longer just to make sure it’s a valid test.

    Next Day 10/14/2021

    • 9:31 am – VPG shows 20 hours RPO and Recovery is possible in the West-US-2 console.
    • 9:32 am – started Failover Live. 
      • Same “inconsistency risk” error as in screenshot above.
      • Still show 46 checkpoints available to select for recovery.
      • Reverse Protection is NOT available since source ZCA is shut down (expected).
      • Started failover at 9:34.
      • 9:39 am – Failover ended (commit or roll back), but no ping yet.
      • 9:40 am – server up and pinging.
      • 9:43 am – FQDN ping (server updated DNS with new IP address in West-US-2).
      • 9:44 am – clicked button to commit failover.
      • VPG state is Site Disconnection, but failover is successful.
      • RDP’d into server running in West-US-2 successfully.
      • Shutdown Event Tracker popped at login since restored to checkpoint which was made before the server was shut down gracefully (expected).
    • 9:49 am – logged into Azure portal and powered on ZCA.
    • 9:53 am – ZCA is up and running.  VPG now shows “Needs configuration” (expected).
    • 10:01 am – finished reconfiguring VPG.
    • 10:05 am – Preparing preseed
    • 10:20 am – began delta syncing.

    So here are the results of my testing <drum roll>…

    After shutting down the ZCA and the server in the VPG being replicated and waiting WAY longer than the 1 hour Journal History (almost 24 hours), I still had access to all 46 recovery points and was able to recover the server to the recovery site while the source ZCA was down.

    Thanks

     

    Maybe it is a VMware thing or version 9 fixes the issue.

    I will upgrade to 9 soon and see if it works in that.

     

     

     

    Philip – I probably should have specified above, but we are currently running on 8.5 U3.

    This is an interesting post. I think what’s happening is the destination side is enforcing the policy, without regard to the current state of operations.

    Remote side is told to keep only 1 hour, so it’s constantly pruning the data to meet that criteria. Problem is, if replication stops for longer than 1 hour, the remote side keeps pruning the data until it’s all gone after the 1 hour has elapsed.

    If I’m understanding correctly, the feature you’re looking for is some kind of intelligence so that when Zerto gets into a condition where replication stops, data pruning on the remote side should also stop. That way you’d have the last hour of data, and in theory could use all those checkpoints within the previous hour to failover to.

    I like your theory 🙂

You must be logged in to create new topics. Click here to login