Objective 2.3 – Build Availability Requirements into the Logical Design

In this objective VMware refers to 5 important topics:

Manageability, refers to the OPEX (Operational Expenditure) to run the system.

Performance, is the ability to deliver something within the required key metrics. These are usually tied within the Organization’s KPI’s

Recoverability, refers to system recovery after failure. A Virtual Machine restarted from an HA event can be catergorized under recoverability. Also, Disaster Recovery through products like SRM can also be termed under this section

Security, refers to access controls on the system, its important to understand which group needs access to which systems.

Availability, refers to the system uptime, it is more usually calculated on the basis of SLA’s in percentage, for ex. A production Tier 1 Application should be up and running 99.99% of the time

The key to deploy High Availability is to get rid of Single points of failure. Any component failure in the whole Infrastructure may cause an outage to the whole system. Providing resiliency at vulnerable points helps reduce or eliminate downtime caused by hardware failures, such as :

  • Server components, such as, NIC or HBA’s
  • Blades and Blade chassis
  • Network Switches/Cables
  • Storage Arrays and Storage Networking

Let us look at the VMware High Availability in detail and some of the best practices:

Host Best Practices:

  • Always select hosts with redundant power supplies, error-correcting memory, remote monitoring and so on
  • If you’re using Blades then consider deploying blades across multiple Chassis/Racks
  • Identical Hardware, different hardware will lead to an unbalanced cluster, directly impacting your Admission control
  • Ensure that you size all the hosts equally, else excess capacity being reserved to handle failure of the largest possible node
  • Avoid creating “Must” DRS rules so that HA doesn’t have to respect them and prevent VM restarts
  • Cluster sizing is important, it can’t be too small or too large, for ex. cluster with 3 nodes, 33% of capacity will need to reserved to tolerate a single node failure. Similarly a cluster with 10 nodes will require just 10% of capacity to be reserved
  • Avoid using mixture of ESX/ESXi versions in the HA cluster
  • With Auto Deploy in mind, include vmware-fdm VIB in the image profiles
  • If the vCenter is running as Virtual Machine, run couple of Hosts without Auto Deploy and pin the vCenter VM to those hosts by VM to host rule. Also, set the VM restart priority to high so that vCenter is back up quickly and provide accessibility to components like Auto Deploy itself,  DRS, vDS and so on….
  • Keep the VM sizes consitent as this have a direct impact on the slot size especially with full CPU/memory reservation
  • Increase Management Network’s resiliency by using a combination of onboard and PCI based NIC

Networking Best Practices:

  • Enable PortFast on the physical switches. This will prevent the host from incorrectly determining that a network is isolated during the execution of lengthy spanning tree algorithms on boot
  • Disable host monitoring while performing any Network Maintainace, else it might trigger Network Isolation
  • Although the new HA architechture isn’t DNS dependant, it is still considered to be a best practice to have hosts and VM’s resolved by DNS
  • Use standard naming conventions for Port Groups, this is required for DRS migrations as well
  • TCP/UDP port 8182 should be open on the network, if at all there are firewall rules in place. Also ensure there is no service running in the ESX console that uses port 8182
  • Management Network Redundancy is must! NIC teaming, NIC’s connected to Redundant Physical Switches
  • Alternative to NIC teaming, you can even setup a second management network on a different vSwitch wherein the Network heartbeats can be sent over both the networks

Storage Best Practices:

  • Redundant HBA’s, switches..cables for path failover
  • Cho0se appropriate multipathing to provide HA as well as load balancing
  • For iSCSI, setup an additional initiator to enable multipathing
  • Let vCenter select the datastore automatically for Datastore Heartbeating
  • Ensure all the hosts in the cluster have access to all the datastores
  • Do not enable HA if the Cluster has mixture of ESXi 5.0 and prior versions, Storage vMotion is used extensively or Storage DRS is enabled.

Admission Control considerations:

  • Number of host failures Uses slots based on the highest reservation, as such could lead to conservative consolidation ratio
  • Percentage of Cluster Resources, gives Weaker guarantees, but as of vSphere 4.1 resources can be defragmented
  • Recommended HA Admission Control Policy is Percentage of Cluster Resources, it’s not the default… But most flexible!

Fault Tolerance:

FT is used for High Availability of Virtual Machines. FT is built on the vLockstep technology, provides continuous availability by running identical Virtual Machines (Primary & Secondary) in the Virtual lockstep on seperate hosts. So if the Primary VM is failed the secondary VM takes over immediately providing HA to the users. Also, when the secondary VM fails, a new secondary VM is started and FT is reestablished within a few seconds

Let us look at some of the best practices to deploy FT:

  • Hosts running Primary and Secondary VM’s should operate at the same processor frequency, else it might restart the secondary VM more frequently
  • Apply the same Instruction set extension configuration (enabled to disabled) to all hosts
  • Have CPU’s from the compatible processor group and same ESXi version
  • Access to all the datastores in the cluster
  • Same Virtual Machine Network configuration
  • Use 10Git NIC and enable Jumbo frames
  • Should not have more than 4 FT enabled VM’s on any single Host
  • for NFS, ensure to have atleast 1GB dedicated network bandwidth
  • For resource pools, ensure enough memory is reserved
  • Maximum of 16 Virtual disks per FT Machine
  • Minimum of 3 hosts in the cluster

Difference between Business Continuity and Disaster Recovery:

Business Contuinity is more of an Proactive approach where you already have decided on plans to run your systems smoothly after a disaster recovery.

Disaster Recovery is more of an reactive approach on how to recover the system from a disaster, this topic is discussed more in detail under objective 2.6

General considerations:

  • When a secondary Service Console is used, recommended value 20000
  • When an additional Isolation Address is used, recommended value 20000
  • When Spanning Tree is not set to Port Fast, recommended value > 60000
  • Specify an additional isolation address to avoid a false positive

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s