PSA: Purple Screen of Death when collecting logs on VxRail Cluster
This is probably just an edge case but I figured I would put it out there into the `webs so admins are aware. While troubleshooting an unrelated VxRail issue (unable to do a proper data evacuation), I was collecting logs with an EMC engineer when one of the ESXi nodes purple screened.
Unfortunately, this was during the work day and had an impact on the environment. VMware High Availability kicked right in but the damage was done. The one host was running about 30 Virtual Machines that experienced a hard reboot. It is unclear at the moment what exactly happened with the log collection process that would result in a purple screen but it did underscore that unexpected events happen and reasonable efforts should be made to mitigate them. For this instance, log collection was halted and resumed after hours and leveraging maintenance mode per host. This meant that we would need to collect individual logs from each ESXi node after putting them safely into maintenance mode.
There is early speculation that this edge case occurred due to some iSCSI connections that might have been in use with the Virtual Machines. I know we have had issues in the past with vMotions of iSCSI VMs causing Purple Screens but was clearly surprised and caught off guard when a simple log collection triggered it.
Complete investigation still needs to happen but I thought I’d put this quick post together as a VxRail Public Service Announcement.
This was VxRail firmware 4.7.100 and vSphere 6.7 Update 1.
Good Luck out there!