Article number
000004192
Affected Versions
6.5 Update 2
6.5 Update 1 Patch 1
6.5 Update 1
6.5
Source Hypervisor
All
Target Hypervisor
All

Problem with Zerto Virtual Replication 6.5 causing PSOD on ESXi hosts

Viewed 214 times

Summary

This article discusses a scenario where a PSOD (Purple Screen of Death) will occur on ESXi hosts where large volumes are being protected by Zerto.

Root Cause

ZVR 6.5 introduced support for volumes as large as 96TB from an old max size of 32TB.  During bitmap sync operations of volumes over 32TB or VRA upgrades two halves of the bitmap may be merged.  One of the bitmap operations may then try to access memory outside of the range of that was allocated for Zerto and cause a page fault on the ESXi host.  The page fault will cause the host to PSOD.

Symptoms

  1. VPG protecting one or more VMs with one or more volumes over 32TB
  2. ESXi host hitting PSOD during synchronization operations that make use of bitmap syncs and during VRA upgrades.
  3. Stack trace similar to the following in the ESXi kernel dump frag files:
2018-12-20T02:04:13.904Z cpu56:697111)@BlueScreen: #PF Exception 14 in world 697111:vmm0:Z-VRA-p IP 0x418009bbe1a2 addr 0x4307fdd97000 PTEs:0x80000001685e5023;0x800000017233b063;0x80000001ae3a9063;0x0;
2018-12-20T02:04:13.904Z cpu56:697111)Code start: 0x418008600000 VMK uptime: 5:05:32:38.118
2018-12-20T02:04:13.904Z cpu56:697111)0x439258b9b968:[0x418009bbe1a2]_zbm_l2_free@<None>#<None>+0x4e stack: 0x4307f1b63f60
2018-12-20T02:04:13.905Z cpu56:697111)0x439258b9b978:[0x418009bbe2ed]_zbm_entries_clean@<None>#<None>+0x4d stack: 0x40020
2018-12-20T02:04:13.905Z cpu56:697111)0x439258b9b9a0:[0x418009bbe38a]_zbm_swap@<None>#<None>+0x3a stack: 0x418009bd01fc
2018-12-20T02:04:13.905Z cpu56:697111)0x439258b9b9b8:[0x418009bd01fc]_zuspace_bm_hardened@<None>#<None>+0x5c stack: 0x439258b9ba18
2018-12-20T02:04:13.906Z cpu56:697111)0x439258b9ba28:[0x418009bd1f5a]zuspace_ctrl_request@<None>#<None>+0x552 stack: 0x439258b9ba58
2018-12-20T02:04:13.906Z cpu56:697111)0x439258b9ba88:[0x418009bcb07f]zfl_cmd@<None>#<None>+0x46b stack: 0xa58
2018-12-20T02:04:13.906Z cpu56:697111)0x439258b9bc98:[0x41800885ca28]VSCSI_IssueCommandBE@vmkernel#nover+0x44 stack: 0x43be40033940
2018-12-20T02:04:13.907Z cpu56:697111)0x439258b9bcd8:[0x41800885cdc9]VSCSI_HandleCommand@vmkernel#nover+0x241 stack: 0x43135fb26c70
2018-12-20T02:04:13.907Z cpu56:697111)0x439258b9bd68:[0x41800885eac3]VSCSI_VmkExecuteCommand@vmkernel#nover+0x1f3 stack: 0x439258b9bf40
2018-12-20T02:04:13.907Z cpu56:697111)0x439258b9be08:[0x41800886edea]LSIProcessReqInt@vmkernel#nover+0x2fe stack: 0x439258b9bf47
2018-12-20T02:04:13.908Z cpu56:697111)0x439258b9bf88:[0x41800886f963]LSI_ProcessReq@vmkernel#nover+0x77 stack: 0x246
2018-12-20T02:04:13.908Z cpu56:697111)0x439258b9bfb8:[0x4180086ac509]VMMVMKCall_Call@vmkernel#nover+0x139 stack: 0x4180086ac054
.
.
.
2018-12-20T02:04:13.912Z cpu56:697111)zdriver_ESX_50_a6615d5d 0x418009bb9000 .data 0x417fe0400000 .bss 0x417fe0400500
Coredump to disk. 
2018-12-20T02:04:13.962Z cpu56:697111)Slot 1 of 1.
2018-12-20T02:04:13.962Z cpu56:697111)Dump: 2352: Using dump slot size 2684354560.

Solution

This problem only affects ZVR 6.5 up to but not including 6.5U3.
The only solution to this problem is to upgrade to ZVR 6.5U3 (or higher) where the issue has been addressed.