RHEL Clustering in vSphere with fence_vmware_soap as fencing device.

This article is only a simple guide for Configuring RHCS in vSphere 6 using shared disks and VMware fencing using the fence_vmware_soap device in Red hat Cluster Suite.

#The cluster node VMs can use RDM’s (Physical or Virtual) or they can use shared vmdk’s with multi-writer option enabled for the scsi drives.

Below is my Cluster node configuration:

  • RHEL6
  • 4vCPU and 12 GB RAM
  • 80 GB Thick Provisioned Lazy Zeroed disk for OS.
  • 5GB Shared Quorum on Second SCSI controller shared physical (scsi1:0 = “multi-writer” scsi1:1 = “multi-writer”)
  • 100GB shared Data drive for Application
  • Single Network card

The after creating the node 1 cln1.mylab.local, I cloned the machine cln2.mylab.local, assigned a new IP and added the shared resources.

Added the Quorum drive and Data drive on node 1 and the add the same drives on node 2 by using add existing harddisk option in vCenter

As I wanted to keep my nodes on separate physical servers (ESXi hosts) to provide hardware failure resiliency, I had to use physical bus sharing. As my Quorum and data drive are not shared SAN LUN but vmdks I have to enable scsi multi-writer in vmx advanced configuration for my scsi nodes. This is because although vmfs is clustered file system at any given point generally a vmdk can only be accessed by one powered on VM.

Also make sure ctk is disabled for the VM. Check the linked KB.

http://kb.vmware.com/kb/2110452

From the conga GUI in RHEL follow the instructions mentioned in the below KBs to create a cluster, add the cluster nodes and add VMware fence device. The article would make a lot more sense once you go through the Redhat KBs

For someone who is new to fencing below explanation from Redhat is Awesome.

A key aspect of Red Hat cluster design is that a system must be configured with at least one fencing device to ensure that the services that the cluster provides remain available when a node in the cluster encounters a problem. Fencing is the mechanism that the cluster uses to resolve issues and failures that occur. When you design your cluster services to take advantage of fencing, you can ensure that a problematic cluster node will be cut off quickly and the remaining nodes in the cluster can take over those services, making for a more resilient and stable cluster.

After the cluster creation, GFS has to be created to use the shared storage for the cluster. RHEL cluster is a mandatory requirement for creating clustered file system called the GFS.

https://access.redhat.com/solutions/63671

https://access.redhat.com/node/68064

cluster1

Run the clustat command to verify the cluster creation.

Configure the nodes by adding VM node details, Guest name and UUID

cluster2

cluster3

In the shared fence device option under the cluster TAB please provide vCenter server details and the account details used for fencing.

Hostname: DNS name of your vCenter

Login: fencing account created, I preferred to create a domain account (fence@mylab.local) as my vCenter is Windows and provide Specific permissions to the domain account.

A vCenter role dedicated for the fencing task was created and assigned to “fence@mylab.local” user. The role requires permission to perform VM power operations.

cluster4

Run the below command to list the guest name and UUID.

fence_vmware_soap -z -l “fence@mylab.local” -p mypasswd -a vcenter.mylab.local -o list

cln1.mylab.local, 5453d1874-b34f-711d-4167-3d9ty3f24647

cln2.mylab.local, 5643b341-39fc-1383-5e6d-3a71re4c540d

The cluster is now to be tested. And you encounter any issues with a particular node you can expect the fencing device to shut it down to avoid any issues.

Leave a comment if you have queries….

Some of Basic and Important vCenter Alerts.

vCenter 6 Comes with lot of default alerts and I will list the configuration of few of them which could be very handful.

In vSphere C# client when you select vCenter and go to Alarms tab it displays all the alarms vCenter has to offer.

You can also select individual vCenter object like a VM, Datastore, Host, Cluster, DVS, Portgroup etc and go to alarm definitions to find out what alarms are available for that particular object.

Example: test1 is a VM and datastore1 is a vmfs datastore

Alarms

Alarms_1

Best way to make use of these alerts vCenter is to send email notifications to the admin.

The vCenter needs to be configured for email notifications and below are the steps:

To configure an email notifications for an alarm:

  1. Log in to vCenter Server.
  2. Click the Administration tab and select vCenter Server Settings.
  3. Select the Mail option.
  4. For the SMTP Server option, enter the IP address or the DNS name of the email/exchange server to which the alert notification must be sent.
  5. For the Sender Account, enter the email address from which the alert must be sent.
  6. We will need the vCenter IP to be configured in SMTP relays of Exchange server.

Alarms_2

I have chosen below Alarms as examples, however you can configure the ones that might interest you.

Datastore Usage on disk

A very useful alerts that monitors the usage of a vmfs Datastore for its utilization. The defaults in the alarms are Warning on 75% Utilization and Alert on 85% Utilization. I am going to go with the same thresholds assuming they are VMware recommendations.

Alarms_3

Alarms_4Alarms_5

VM Has Snapshots

Another Important alarm that I use is a snapshot alarm, it’s a custom alarm and below are the steps to create. Right click on the blank side of the alarm definition page and click New Alarm.

In the trigger type select VM Snapshot Size.

Set the trigger conditions for Warning and Alert and put your email ID in reporting actions so you know if a snapshot created is exceeding a particular size. please note if you are created a VM Snapshot with .vmsn file(including memory) than this alert considers that in the total VM snapshot size.

Example is the warning is set to 10GB and you create snapshot with memory of a VM which has more than 10GB RAM, the alert triggers immediately.

Alarms_6Alarms_7

Some other default Alarms which are useful include the below and can be configured similarly.

  1. Virtual Machine Consolidation Needed status.
  2. VM CPU/Memory Usage
  3. vSphere HA Virtual Machine Monitoring Action
  4. ESXi host CPU/Memory Usage
  5. Host Connection and Power state
  6. Network Connectivity redundancy lost
  7. Network Connectivity lost
  8. Thin Provisioned Volume Capacity threshold exceeded

For any questions regarding the configurations of the same please comment on my post and I will get back to you.

For someone who does not have vRealize Operations Manager these alarms could be really helpful. In the future I will posting blogs regarding monitoring from vRealize Operations Manager.

Have a Nice Day…!

VDP Stuck in Admin State and All Backups failing due to imbalanced usage of data volumes

I recently came across this issue with my VDP appliance went to Admin state and all my backups would fails.

It’s a 6.1.4.30 (Major Version) version appliance with dedupe capacity of 546GB created using 3 * 256 GB drives.

This happened after more VMs were added to the backup Job. One of those VMs seemed to be the Symantec server and consumed lot of space in the dedupe store,

The Night the backup ran I got below alert from VDP:

The vSphere Data Protection storage is nearly full. You can free the space on the appliance by manually deleting the unnecessary or older backups, and modifying the retention policies of the backup jobs to shorten the backup retention time.
————————————————————————————–

Current VDP storage usage is 84.15%.

————————————————————————————–

The usage went on increasing and it said Current VDP storage usage is 96.92%. Obviously I should have looked into it then but I missed out. Resulting in the appliance going in ReadOnly State.

The vSphere Data Protection storage is full. The appliance runs in the read-only mode till additional space is made available. You can free the space on the appliance by manually deleting unnecessary or older backups only.

status.dpn

Wed Sep 20 12:52:30 GST 2017  [dxb02-vlp-vdp1.dib.ae] Wed Sep 20 08:52:30 2017 UTC (Initialized Wed Apr 19 09:23:11 2017 UTC)

Node   IP Address     Version   State   Runlevel  Srvr+Root+User Dis Suspend Load UsedMB Errlen  %Full   Percent Full and Stripe Status by Disk

0.0  172.22.250.109 7.2.80-129  ONLINE fullaccess mhpu+0hpu+0hpu   1 false   0.63 5563  3463473  47.3%  47%(onl:574) 47%(onl:572) 47%(onl:572)

Srvr+Root+User Modes = migrate + hfswriteable + persistwriteable + useraccntwriteable

System ID: 1492593791@00:50:56:AA:50:57

All reported states=(ONLINE), runlevels=(fullaccess), modes=(mhpu+0hpu+0hpu)

System-Status: ok

Access-Status: Admin

From the log: /space/avamar/var/mc/server_log/mcserver.log.0

avtar Info <17844>: – Server is in read-only mode due to Diskfull

avtar Info <17972>: – Server is in Read-only mode.

dpnctl status

dpnctl: INFO: gsan status: degraded

The 1st step I did was to manually delete high capacity old restore points for some clients to free up space. The appliance dedupe capacity came down to 82% and I was expecting that the backup jobs would run that night after letting the garbage collection run during the scheduled maintenance window.

The Appliance did not return to full Access state even when the old backups were deleted and the dedupe usage was decreased to 67.5%

Output for df –h shows my data volumes data01,data02 and data03 are not used proportionately.

admin@dxb02-vlp-vdp1:~/>: df -h

Filesystem      Size  Used Avail Use% Mounted on

/dev/sda2        32G  6.1G   24G  21% /

udev            2.9G  148K  2.9G   1% /dev

tmpfs           2.9G     0  2.9G   0% /dev/shm

/dev/sda1       128M   37M   85M  31% /boot

/dev/sda7       1.5G  187M  1.2G  14% /var

/dev/sda9       138G  7.9G  123G   7% /space

/dev/sdb1       256G  173G   84G  68% /data01

/dev/sdc1       256G  170G   86G  67% /data02

/dev/sdd1       256G  143G  114G  56% /data03

To fix this set the freespaceunbalance value to a higher percentage depending upon the difference in usage you notice.

Steps:

  • Stop maintenance mode “avmaint sched stop –ava”
  • create checkpoint “avmaint checkpoint –ava” – Allow it to finish before moving ahead
  • Perform rolling integrity check “avmaint hfscheck –rolling –ava” – Allow it to finish before moving ahead.
  • Verify the checkpoint using “cplist”
  • Create another checkpoint “avmaint checkpoint –ava”
  • Check the current percentage utilization of each data volume.
  • If the difference is more than 10% (default value) run the below command.
  • “avmaint config –ava freespaceunbalance=20”
  • Check the status again using status.dpn
  • Start the maintenance mode “avmaint sched start –ava”

If the appliance status does not switch to Full-Access mode after changing the value increase it to 30% or more accordingly.

Let the appliance go through the maintenance window. Once you configure the more backups it is expected for the dedupe space to be utilized equally from all data volumes.

If the issue persists, open a ticket with VMware Support.

Happy Troublehsooting

VMFS Datastores inaccessible even after PDL is Fixed

How to fix inaccessible datastore issues after you have hit a PDL

At times we come across situations when a vmfs datastore hosting virtual machines has entered PDL( permanent device lost) state, may be due to a SAN outage.

All paths to the device are marked as Dead and Obviously the VMs hosted on the LUN have crashed.

In the /var/log/vmkernel.log file, you see entries similar to:

cpu2:853571)VMW_SATP_ALUA: satp_alua_issueCommandOnPath:661: Path “vmhba4:C0:T0:L0” (PERM LOSS) command 0xa3 failed with status Device is permanently unavailable. H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0.
cpu2:853571)WARNING: vmw_psp_rr: psp_rrSelectPathToActivate:972:Could not select path for device “naa.60a98000572d54724a34642d71325763”.
cpu2:853571)WARNING: ScsiDevice: 1223: Device :naa.60a98000572d54724a34642d71325763 has been removed or is permanently inaccessible.
cpu3:2132)ScsiDeviceIO: 2288: Cmd(0x4124403c1fc0) 0x9e, CmdSN 0xec86 to dev “naa.60a98000572d54724a34642d71325763” failed H:0x8 D:0x0 P:0x0
cpu3:2132)WARNING: NMP: nmp_DeviceStartLoop:721:NMP Device “naa.60a98000572d54724a34642d71325763” is blocked. Not starting I/O from device.
cpu2:2127)ScsiDeviceIO: 2316: Cmd(0x4124403c1fc0) 0x25, CmdSN 0xecab to dev “naa.60a98000572d54724a34642d71325763” failed H:0x1 D:0x0 P:0x0 Possible sense data: 0x5 0x25 0x0.
cpu2:854568)WARNING: ScsiDeviceIO: 7330: READ CAPACITY on device “naa.60a98000572d54724a34642d71325763” from Plugin “NMP” failed. I/O error.

Once the LUN is made available back again from the storage side and Storage connectivity is restored it is possible that you will find the datastore is still not accessible on the host after rescan. This could be due to running vm world processes on that hosts for the VMs hosted on the device which was lost and declared PDL.

One way to restore the situation back to normal is to find out which VM-worlds are running on the ESXi hosts/hosts from that device and kill them. After killing all the VM-worlds from all hosts in the cluster using that LUN, we can perform a rescan to re-discover the datastore back again.

Steps:

  1. SSH to ESXi
  2. esxcli storage core device world list -d naa.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx(to identify the running world on that device)
  3. esxcli vm process list(to list all VM world process on ESXi host)
  4. From step 2 make a note of the VM processes for that device
  5. esxcli vm process kill -w <world-ID>

NOTE: The VM world could be running from any ESXi host in the cluster where the VM was powered on.

Another option to restore the situation back to normal if it was a SANoutage(all of your VM environment is down anyways)is to reboot all ESXi hosts in the cluster. This will obviously clear out any running vm world(processes) and the device will be re-registered as part of the ESXi boot procedure.

Good Luck!!!!!!!!!!!