Month: November 2017

VMFS locking and ATS Miscompare issues

Brief about VAAI

By now, almost all the VMware techies are aware of what VAAI is, however I will add a little brief for those who are not.

VMware vSphere Storage APIs – Array Integration (VAAI), also referred to as hardware acceleration or hardware offload APIs, are a set of APIs to enable communication between VMware vSphere ESXi hosts and storage devices. The APIs define a set of “storage primitives” that enable the ESXi host to offload certain storage tasks like cloning, zeroing to the storage array and improve performance.

The goal of VAAI is to help storage vendors provide hardware assistance to speed up VMware I/O operations that are more efficiently accomplished in the storage hardware.

Most of the new arrays (FC/iSCSI/NAS) which support vSphere 5 and later usually support VAAI also (vSphere APIs for Array integration).However this can be verified from VMware HCL.

https://vmware.com/go/hcl

Listing the fundamental operations are controlled by these advanced settings:

Advanced Parameter name Description
HardwareAcceleratedLocking Atomic Test & Set (ATS), which is used during creation of files on the VMFS volume
HardwareAcceleratedMove Clone Blocks/Full Copy/XCOPY, which is used to copy data
HardwareAcceleratedInit Zero Blocks/Write Same, which is used to zero-out disk regions

 

This is a list of commonly used SCSI opcodes related to VAAI operations:

0x93 WRITE SAME(16)
0x41 WRITE SAME(10)
0x42 UNMAP
0x89 SCSI COMPARE and WRITE – ATS
0x83 EXTENDED COPY

 

NOTE: Check the below link for more info on VAAI

https://www.vmware.com/techpapers/2012/vmware-vsphere-storage-apis-array-integration-10337.html

In a shared storage environment, when multiple hosts access the same cluster filesystem (VMFS datastore), specific locking mechanisms are used. These locking mechanism prevent multiple hosts from concurrently writing to the metadata and ensure no data corruption.

VMFS supports SCSI Reservations and Atomic Test and Set (ATS) locking.

ATS

ATS is a lock mechanism designed to replace SCSI reservations. With the amount of SCSI reservation conflict issues on older versions of ESX/ESXi this was a much needed feature.

ATS modifies only a disk sector on the VMFS volume whereas a SCSI reservation locks the whole LUN. When successful, it enables an ESXi host to perform a metadata update on the volume. This includes allocating space to a VMDK during provisioning, because certain characteristics must be updated in the metadata to reflect the new size of the file. The introduction of ATS addresses the contention issues with SCSI reservations and enables VMFS volumes to scale to much larger sizes. ATS has the concept of a test-image and set-image. So long as the image on-disk is as expected during a “compare”, the host knows that it can continue to update the lock.

A change in the VMFS heartbeat update method was introduced in ESXi 5.5 Update 2, to help optimize the VMFS heartbeat process which meant there was a significant increase in the volume of ATS commands the ESXi kernel issued resulting in increased load on the storage system. Under certain circumstances, VMFS heartbeat using ATS may fail with ATS miscompare which causes the ESXi kernel to again verify its access to VMFS datastores.

In this case, a heartbeat I/O (1) got timed-out and VMFS aborted that I/O, but before aborting the I/O, the I/O (ATS “set”) actually made it to the disk. VMFS next re-tried the ATS using the original “test-image” in step (1) since the previous one was aborted, and the assumption was that the ATS didn’t make it to the disk. Since the ATS “set” made it to the disk before the abort, the ATS “test” meant that the in-memory and on-disk images no longer matched, so the array returned “ATS miscompare”. When an ATS miscompare is received, all outstanding IO is aborted with host sense 8 (H:0x8 SCSI reset). This led to additional stress and load being placed on the storage arrays and degraded performance.

There are some EMC and IBM arrays which have known issues and recommend disabling the functionality (Use of ATS for vmfs Heartbeat) on all hosts accessing the set of LUNs.

https://kb.vmware.com/s/article/2113956
Sample log messages in the vmkernel logs of ESXi hosts and events in vCenter confirming the issue.

In the /var/run/log/vobd.log file and Virtual Center Events, you see the VOB message:

Lost access to volume <uuid><volume name> due to connectivity issues. Recovery attempt is in progress and the outcome will be reported shortly

In the /var/run/log/vmkernel.log file, you see the message:

ATS Miscompare detected between test and set HB images at offset XXX on vol YYY

In the /var/log/vmkernel.log file, you see similar error messages indicating an ATS miscompare:

2015-11-20T22:12:47.194Z cpu13:33467)ScsiDeviceIO: 2645: Cmd(0x439dd0d7c400) 0x89, CmdSN 0x2f3dd6 from world 3937473 to dev “naa.50002ac0049412fa” failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0.

Disable ATS on vmfs5 and vmfs6 datastores:

esxcli system settings advanced set -i 0 -o /VMFS3/UseATSForHBOnVMFS5

Disable ATS on vmfs3 datastores:

esxcli system settings advanced set -i 0 -o /VMFS3/UseATSForHBOnVMFS3

To review the results of changing options, run this command:

esxcli system settings advanced list -o /VMFS3/UseATSForHBonVMFS3
esxcli system settings advanced list -o /VMFS3/UseATSForHBonVMFS5
You see output similar to:
Path: /VMFS3/UseATSForHBOnVMFS3
Type: integer
Int Value: 0 <— Check this value
Default Int Value: 0
Min Value: 0
Max Value: 1
String Value:
Default String Value:
Valid Characters:
Description: Use ATS for HB on ATS supported VMFS3 volumes

Some additional helpful links:

https://kb.vmware.com/s/article/2146451

https://storagehub.vmware.com/#!/vsphere-storage/vmware-vsphere-apis-array-integration-vaai-1/atomic-test-set-ats/1

https://docs.vmware.com/en/VMware-vSphere/6.0/com.vmware.vsphere.storage.doc/GUID-DE30AAE3-72ED-43BF-95B3-A2B885A713DB.html

https://kb.vmware.com/s/article/52486