This objective focuses mainly on Disaster Recovery and Business Contuinity planning. This involves the implementation of both non-replicated and replicated storage of the Virtual Machines. These days most of the Storage Arrays have in-built storage replication capabilities. Disaster Recovery can be for a single Virtual Machine or the whole site for that matter.
Let us look at DR and BCP more in detail:
So what is required to implement Disaster Recovery? a DR site to start with?
A DR Site should be a remote site where all the workloads can be shifted in the event of a disaster. These are either dedicated or non dedicated DR sites. As Dedicated DR sites don’t serve any purpose apart from disaster receovery, they incur high operational costs. Non dedicated DR sites are regional DC’s which can be used for DR as well.
Then comes the data replication, which is done from Primary to Recovery site. This requires proper planning of WAN Network connectivity, Storage Replication technology and so on….Also its important to consider the distance of a Recovery site, if its too remote you can have problems relocating your staff and if it is within the same location then you might run into problems if there is complete disaster and you just cannot reach that site at all.
Once you have identified the components, DR Site and etc…it’s time to build a Disaster Recovery Plan. From Virtualization perspective the DRP’s should include Hypervisor setup (if applicable), Mounting SAN Snapshots or replicated LUNS and so on…..and it is important to test the DRP’s with business operations in mind. Let us also understand that DR and BCP are just not the same, DRP’s is a solution to handle the chaos and BCP is a streamline process to operate the system smoothly at the Recovery site.
Let us look at the DRP process in detail, some of the steps that we can follow:
Enable Management buy-in: The management must agree on the scope of DR/BCP and fund the same. Managers from all the streams are required to take part in the planning process
BIA: This step captures the key inventory, MTD (Maximum Tolerable Downtime), application dependencies and so on..
RPO: Need to define the Recovery Point Objective, basically the point in time to which a system must be recovered after an outage
RTO: Need to define the Recovery Time Objective, basically the time required to restore the services after an outage. This should also include the time take by management to decide to hit the failover button
Risk Assessment: This step is to identify the risks which can lead to potential outage, It can be just anything, a natural calamity, power outage in the building, fire in the building etc….
Regulatory Compliance: Need to ensure the RTO’s and RPO’s are in line with the Regulatory Complaince, there are government regulations which do not allow the data to be hosted in a different geographical location
Develop DRP’s: After the pre-analysis, you can now start to develop a full fledged DRP
Finally you create DR/BCP Runbooks and perform thorough testing
Let us look at how VMware maps their products with HA and DR/BCP
VMware vCenter Heartbeat: It provides protection to the vCenter, includes automatic failover and failback of vCenter Server
NIC teaming, Storage Multipathing can also be considered as High Availability components and we have already spoken about VMware HA and FT in earlier objective 2.3
VMware provides VMware Data Recovery as the backup Solution, which is an Agent less, disk-based backup and recovery of VMs. Some of the key features include:
- Backups occur independent of the power state and location of the VM
- Restore can be file level and the whole VM itself
- Data Deduplication to save disk space
- Centralized Management through vCenter Server and so on….
For a complete DR Solution, VMware provides VMware vCenter Site Recovery Manager:
SRM enables integration with array-based replication, as well as the use of a native VMware vSphere®–based replication engine, discovery and management of replicated datastores, automated migration of inventory vCenter environments, automated reprotection, and failback of environments.
Lets look at some features of SRM:
Array-Based Replication: When using array-based replication, one or more storage arrays at the protected site replicate data to peer arrays at the recovery site. Storage replication adapters (SRAs) enable integration of SRM with a wide variety of arrays
vSphere Replication: In vSphere Replication (VR), SRM uses vSphere replication technologies to replicate data to servers at the recovery site. vSphere Replication uses vSphere Replication Management Server (VRMS) to manage the VR infrastructure. VR requires installing the VR Server (VRS) virtual appliance and VRMS virtual appliance, both of which can be installed with SRM during the installation process. While VR does not require storage arrays, an VR storage replication source and target can be any regular storage device, including, but not limited to, storage arrays
Protection Groups and Recovery Plans: A protection group is a collection of virtual machines and templates. A recovery plan specifies how the virtual machines in a specified set of protection groups are recovered. In the case of virtual machines replicated using array-based replication, protection groups are composed of virtual machines that use the same replicated datastore group.
Most of the DRP’s map to SRM:
- Site Failover
- Automated DRP Testing
- Automated DRP Testing Reporting
- Single/Multiple Site Recovery
- Customized Network/Storage at the Recovery Site
- Managed Capacity at Recovery Site
- Integration with Storage Vendor Support
- Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are the two most important performance metrics you need to keep in mind while designing and executing a disaster recovery plan. For array-based replication, RPO is fulfilled by the replication schedules configured on the storage array
- The minimum RPO you can set with VR is 15 minutes. The VR algorithm adjusts the replication schedule dynamically in order to fulfill the RPO
- Specify a Nonreplicated Datastore for Swap Files
- It is recommended that the SRM database be installed as close to the SRM server as possible, such that it reduces the round-trip time between both of them. This way recovery time performance will not suffer greatly because of round trips to the database server
- It is a good practice to have fewer but larger NFS volumes so that the time taken to mount a large number of such volumes decreases during the recovery. This might also translate to fewer protection groups on your setup leading to reduced recovery time
- It is a good practice to have DRS enabled on a recovery site
- More hosts lead to more concurrency for recovering VMs which results in a shorter recovery time
- Also, before protecting VMs, bring recovery site hosts out of standby mode so that they get leveraged for creating placeholder VMs.
- Configuring VM dependencies across priority groups instead of setting per VM dependencies is usually the best idea because VMs within each priority group will be started in parallel
- It is strongly recommended that VMware Tools be installed in all protected virtual machines in order to accurately acquire their heartbeats and network change notification
- Make sure any internal script or call out prompt does not block recovery indefinitely