VMFS Datastores inaccessible even after PDL is Fixed

How to fix inaccessible datastore issues after you have hit a PDL

At times we come across situations when a vmfs datastore hosting virtual machines has entered PDL( permanent device lost) state, may be due to a SAN outage.

All paths to the device are marked as Dead and Obviously the VMs hosted on the LUN have crashed.

In the /var/log/vmkernel.log file, you see entries similar to:

cpu2:853571)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:661: Path “vmhba4:C0:T0:L0” (PERM LOSS) command 0xa3 failed with status Device is permanently unavailable. H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0.
cpu2:853571)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:972:Could not select path for device “naa.60a98000572d54724a34642d71325763”.
cpu2:853571)WARNING: ScsiDevice: 1223: Device :naa.60a98000572d54724a34642d71325763 has been removed or is permanently inaccessible.
cpu3:2132)ScsiDeviceIO: 2288: Cmd(0x4124403c1fc0) 0x9e, CmdSN 0xec86 to dev “naa.60a98000572d54724a34642d71325763” failed H:0x8 D:0x0 P:0x0
cpu3:2132)WARNING: NMP: nmp_DeviceStartLoop:721:NMP Device “naa.60a98000572d54724a34642d71325763” is blocked. Not starting I/O from device.
cpu2:2127)ScsiDeviceIO: 2316: Cmd(0x4124403c1fc0) 0x25, CmdSN 0xecab to dev “naa.60a98000572d54724a34642d71325763” failed H:0x1 D:0x0 P:0x0 Possible sense data: 0x5 0x25 0x0.
cpu2:854568)WARNING: ScsiDeviceIO: 7330: READ CAPACITY on device “naa.60a98000572d54724a34642d71325763” from Plugin “NMP” failed. I/O error.

Once the LUN is made available back again from the storage side and Storage connectivity is restored it is possible that you will find the datastore is still not accessible on the host after rescan. This could be due to running vm world processes on that hosts for the VMs hosted on the device which was lost and declared PDL.

One way to restore the situation back to normal is to find out which VM-worlds are running on the ESXi hosts/hosts from that device and kill them. After killing all the VM-worlds from all hosts in the cluster using that LUN, we can perform a rescan to re-discover the datastore back again.

Steps:

  1. SSH to ESXi
  2. esxcli storage core device world list -d naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx(to identify the running world on that device)
  3. esxcli vm process list(to list all VM world process on ESXi host)
  4. From step 2 make a note of the VM processes for that device
  5. esxcli vm process kill -w <world-ID>

NOTE: The VM world could be running from any ESXi host in the cluster where the VM was powered on.

Another option to restore the situation back to normal if it was a SANoutage(all of your VM environment is down anyways)is to reboot all ESXi hosts in the cluster. This will obviously clear out any running vm world(processes) and the device will be re-registered as part of the ESXi boot procedure.

Good Luck!!!!!!!!!!!

 

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s