vSphere Host Isolation Response for Nutanix

First I would like to start with reference. As it is primary source of my understanding. Therefore credits are due.

Reference:

1.       http://www.joshodgers.com/2013/08/07/example-architectural-decision-host-isolation-response-for-a-nutanix-environment/

2.      http://www.vbrain.info/2014/05/16/nutanix-autopathing-the-search-for-the-magic/

 

Design Decisions

Descriptions

DD01

Turn off the default isolation address

Justification

Non-availability of management gateway do not necessarily suggest VMs are isolated. Nutanix presents datastore as NFS mount which is network based storage. Therefore availability of vmkernel management network is not suitable and may cause false alerts.

Impact

Change from default configuration. Must be documented in operations guide

DD02

das.isolationaddress1 : DSF Cluster IP Address

Configure Host Isolation Response to: Power Off

Justification

In Nutanix, CVM is responsible for presenting storage. To reliably identify availability storage, IP Address of CVM cluster will be used. If cluster CVM IP is not reachable, it suggest catastrophic condition has occurred as node cannot function. When CVM is not reachable either host is isolated or entire infrastructure has gone down, in such scenario it is recommended to power off VMs.

CVM communicate with each other and are load balanced. Non availability of CVM would cause I/O path to change to another available CVM using Autopath. Therefore CVM IP is not suitable. Similarly internal IP of CVM cannot be used as it is always reachable by ESXi host.

Impact

Virtual machines will be powered off in case of CVM Cluster IP is unreachable. Change from default configuration. Must be documented in operations guide.

DD03

For CVMs override the cluster settings and configure Host Isolation Response to “Leave Powered On”

 

When Host is isolated, CVM will have access to local storage and can continue run. When the isolation event is over, CVM will sync with other team members and avoid further delay which might happen if CVM was powered off

Impact

Overriding Leave Powered On is change from default configuration. Must be documented in operations guide