Availability & Recoverability Matrix of View Components

While working for one of the View Design Project, I have to do lots of reading to ensure Horizon view components are highly available. Eventually I felt it would be excellent idea to  create below matrix to find Single Point Of Failure (SPOF), and how to address the SPOF at each layer (Service, OS, Hypervisor, Storage). Below I have listed main components in View, what are potential failure points and ways they can be addressed.

So, We made a Design decision to protect Horizon View solution  using Windows Server Failover Cluster (WSFC)

In most of use cases, vSphere HA is sufficient to meet the availability of solution. With vSphere HA, If esxi Host reboots  Guest OS and its application is back in operation in less than 15 minutes. Please note standard is less than 5 min but remember application also need some time to start. I’m referring to generic application not an application which has multi-dependency. e.g. vCenter has down stream dependency on Database, DNS

Let’s first find out  what are all scenarios vCenter can fail. vCenter as VM can fail if ESXi fails, vCenter as Guest OS can fail, if OS gets corrupt/Virus attack, vCenter as service can fail, may be because database is unavailable and for other reasons cause could be many.

What are various protection mechanism for each type of failure?

  1. vCenter as VM & OS will be protected using vSphere HA, it will restart but then there is 10-15 mins delay before vCenter is back in operation.
  2. vCenter services can be protected using in-built watchdog and by protecting vCenter services using Microsoft failover cluster. In this case vCenter can return to operations in less than a min. But heart of vCenter i.e. it’s database still remain SPOF.

Accordingly each component is tabulated below

Components

ESXi Failure

OS Failure

Application Failure

Impact on Availability

Special Configuration

vCenter

Reboots vCenter Node

VM monitoring with VMware HA will attempt to restart OS. 

vCenter services are seamlessly failed over to second node.

near zero downtime

Minimal impact as services are offered by second node

2 node MS Failover cluster must be configured & Need to configure Anti-affinity rule

MS SQL Database

Reboots SQL Server Node

VM monitoring with VMware HA will attempt to restart OS

MS SQL services are seamlessly failed over to second node. Near Zero downtime

Minimal impact as services are offered by second node in MS Failover cluster

2 node MS Failover cluster must be configured

& Need to enable Anti-affinity rule

View Connection Server

Reboots View Connection Server

VM monitoring with VMware HA will attempt to restart OS

Load Balancer removes this node from membership so that traffic Re-directed to other nodes.Zero downtime

Zero impact as services are offered using other nodes in Load Balancer

Need to enable Anti-affinity rule

View Composer

Reboots View composer Server

VM monitoring with VMware HA will attempt to restart OS

No in-built protection. 5-10 minutes downtime.

VM is rebooted, Impact provisioning, Recompose operation during the outage window

Persona

Reboots File Server Node

VM monitoring with VMware HA will attempt to restart OS

File services are seamlessly failed over to second node.Near Zero downtime

Minimal impact as services are offered by second node in WSFC cluster

2 node MS Failover cluster must be configured & Need to enable Anti-affinity rule.

Design Justification

  1. Number of components (as per downstream figure below), are dependent on Database (MS SQL in our case), unavailability of database can potentially bring entire solution down
  2. Even if database can protected using vSphere HA,  database might take 15 minutes to be back in operational. Post that vCenter database might take another 15 minutes to come up, as vCenter service is dependent on vCenter database. Similarly view composer service will need 15 minutes to be back in operational. But vCenter and view composer service can come on line together if they are co-located on same server.
  3. Post that View connection server will be able to send commands to vCenter and view composer.
  4. Below explains the delay from the point database server rebooted till view connection server is ready to send any operation command to vCenter/View composer. If WSFC is not used, view solution might not be available for minimum 30 minutes

    Horizon View Availability Impact with no WSFC
    Horizon View Availability Impact with no WSFC
  5. With WFSC cluster if database server fails, in less than 5 min database failover happens, vCenter / View composer service are not impacted and therefore no impact on view connection server.
  6. vCenter database is heart of vCenter. vCenter database must be highly available.Since view connection server has dependency on vCenter for various operation, unavailability of vCenter will impact vCenter, view composer, view connection server.
  7. WSFC will protect  vCenter services against OS failure, ESXi failure and Service failure.
  8. View composer database will be protected by WSFC. View composer service will not be protected using WSFC. vSphere HA for applications will protect against OS and Service failure. There is potential downtime of 5-10 minutes

    WSFC is complex to configure and manage. Considering the impact it can have on over all availability of solution, efforts in installing, configuring and managing WSFC are worth the efforts. With WSFC, maintenance of all components can be done without having any significant downtime all components are redundant.

    Downstream Dependency Map

    image

    As technical architect we can easily conclude, not having WSFC can bring the entire solution down. If it was just vCenter it wouldn’t need to have WSFC but as dependency on vCenter services increases, your architecture must balance complexity with availability.