How to Perform Basic Troubleshooting for ‘Not Meeting SLA’ Alerts
Viewed 1659 times
- The "Not Meeting SLA" status means that the VPG is not meeting the journal history nor RPO SLA settings. When a VPG shows this status, under the VPG's summary tab, there should be a sub-status such as the following
- Delta Sync (When Force Sync is not applied)
- Delta Syncing a volume
- Full Syncing
- Full Syncing a volume
- Needs configuration
- Site disconnection
- VPG has no VMs
- The “History Not Meeting SLA” status means that the VPG is meeting the RPO SLA but is not meeting the journal history SLA configured. There are several main causes behind such a behavior, that include but are not limited to the following:
- WAN Outage.
- I/O write rate/burst that is bigger than the available bandwidth between the sites.
- Adding/removing a Volume/VM to the VPG.
- The “RPO Not Meeting SLA” status means that the IOs are not being replicated fast enough and this is causing the current RPO for the VPG to be more than 25% than the target RPO specified in the VPG SLA settings. This alert is triggered when the actual RPO, the time between writing checkpoints to the journal, is more than 25% than the target RPO specified for the VPG. RPO and journal history alerts tend to be triggered when the underlying infrastructure, including source storage, WAN connection, and target storage, is insufficient to maintain the tolerable replication of the IO load from the production application. This can affect the CDP History values for a given VPG, as if there is a sudden overwhelming IO spike, or a WAN slowdown, or storage performance degradation, the overall accumulation of checkpoint data in an affected VPG's journals may start to decrease, as the rate at which the checkpoint data is promoted to the target storage would begin to exceed the decreased rate at which checkpoints are written to the journal volumes.
- There are 3 types of 'Not Meeting SLA' alerts that can be present in a Zerto environment.
- VPG is in a “Not Meeting SLA” state - The VPG is not meeting the journal history nor RPO SLA settings
- VPG is in a “History Not Meeting SLA” state - The VPG is meeting the RPO SLA setting.
- VPG is in a “RPO Not Meeting SLA” state - The VPG is meeting the journal history SLA setting
To perform Basic Troubleshooting for 'Not Meeting SLA' alerts, follow the steps below:
"Not Meeting SLA" status can be a result of various scenarios. As mentioned, the general “Not meeting SLA” error means that the VPG is not meeting neither Journal history nor the RPO SLA settings. In order to troubleshoot this behavior, please review both sections 2 and 3 below. In case none of the mentioned solutions (connectivity checks and environment adjustments) provide a relief, please gather the below information to further troubleshoot:
Relevant screenshots: Of the specific VPG in error state, VPG’s tab, and VRA’s tab.
Exported CSV file of Zerto’s Alerts & Events (under “Monitoring” tab in Zerto GUI)
Brief explanation on the preset for this error. What happened before, were they any changes made in the environment, etc.
“History Not Meeting SLA” status does not necessarily point on a current disconnection between the sites, and is also expected following a recent disconnection which is already resolved. As previously mentioned, if the disconnection has been cleared, you should expect seeing this alert for a few hours after the outage (depending on the journal size), and it will then clear automatically.
To begin testing that all relevant components are able to communicate as needed, please follow the relevant architecture review per your environment configuration, and verify that all ports are open and responding to ping/telnet over the respective ports:
Necessary Ports for Enterprise Customers
Necessary Ports for DRaaS/ICDR Customers
“RPO Not Meeting SLA” can be caused by a sudden IO burst on the VM(s) in the VPG. If these errors persist, it would indicate an ongoing resource issue. The main reasons to investigate are:
a. The network might be the limiting factor – verify the following;
What is the available bandwidth currently? This can be determined using iPerf testing. For further reading on how to test the bandwidth with iPerf, please follow this KB.
What is the Bandwidth required for properly maintaining your replication? In order to answer this question, Zerto provides a WAN sizing tool which allows you to estimate the Bandwidth between the sites necessary to meet your replication requirements based, upon your protected VMs I/O throughput. The WAN sizing tool is available for download within the Zerto Support Portal. Within the support portal under the "Documentation Downloads" tab you will find the WAN sizing package, including the estimator itself.
Please review the "Journal Overview, Sizing and Best Practices" documentation
b. VRA memory could be the bottleneck - Increase the reserved RAM for the protection and recovery site VRAs. NOTE: This should only be done after full performance review has been done and deemed a true next step. Throwing resources at a VRA should only be done when absolutely necessary.
c. The recovery storage is slow - Check the storage latency on the recovery site to see if it reports a value higher than 40 milliseconds, or whatever is setup in Site Settings if not default of 40 milliseconds.