Article number
000003172
Affected Versions
All
Source Hypervisor
All
Target Hypervisor
All

Problem with Spikes in Ownership Update Tasks that Trigger Stuck Bitmap Syncs After VRA Upgrades

Viewed 499 times

Summary

An administrator may notice after upgrading many VRAs that the ensuing bitmap syncs become stuck and also notice temporary unreachable protected volume alerts in the ZVM GUI.

Root Cause

The ownership controller is loading the processing queue with redundant requests and this gets worse when many vMotion tasks occur. For this reason, only larger scale environments are at risk of hitting this issue.

Symptoms

  • Large scale environments with multiple host clusters and 150 or more protected hosts/VRAs.

  • Multiple tasks occur within a short period of time (less than 60 minutes) that trigger the ZVM to process new ownership update tasks.

  • Tasks include vSphere DRS migrating protected VMs from one host to another (at a rate of about 1 per second across a 12 hour period, when using a host cluster DRS migration threshold level of 4).

  • Tasks include the restart of a protected VRA* (60 or more VRAs in about 15 minutes, such as when the multi-session tool is used to apply a VRA tweak).

  • Any VPG that contains a VM using the VRAs on the hosts related to the tasks above will enter a bitmap sync that will not progress.

  • A search for the following term in the production site ZVM logs will show a large increase in the number of new ownership tasks in the time period just before the VPG sync behavior starts:

    • grep 'AddOwnershipUpdateRequest' log.*.csv

* Note: VRA restart requires the ZVM and VRA to re-establish ownership of the protected VMs running on the host the VRA is installed on.

Solution

In terms of a workaround, as long as the tasks that trigger the spike eventually reduce in frequency, then the ZVM finishes processing the backlog and the bitmap syncs complete without further intervention. This could take multiple hours depending on how long it takes for the number of tasks to drop in frequency.

It may also be possible to reduce the time needed to recover by pausing all VPGs in sync and resuming them one VPG at a time until the sync can progress, but this is not yet proven.

As for a permanent fix, this issue was resolved in 8.0 Update 2. Upgrading to this code version will avoid an occurrence of this issue.