Virtual Machines for Disaster Recovery Planning
Pages: 1, 2
Example Recovery Site Implementation
Let's take an example of corporate site and recovery site, comparing physical versus virtual implementations for servers. For this discussion, we'll assume that the recovery site requires the customer to purchase or provide the recovery hardware and that this has happened before a disaster occurs.
For a very simple case, let's consider 10 servers in a physical environment at a corporate data center (site P1) with a recovery to 10 servers at a recovery site (site R1). For this implementation, we need 10 servers at the recovery site.
When performing recovery, we have two options: bare-metal restore or typical tape-backup restore. In either case, we must address OS types (which may require different software), plug-and-play issues (if the recovery site hardware is different), and re-licensing issues.
The recovery site cost it is relatively equal in cost to the original site and will require two components to recovery: data recovery (typically, from tape) and hardware reconfiguration.
If we take the same number of servers (10 VMs) running on one VMware ESX Server at the corporate data center (site P2), we now require one server at the remote site (site R2). For cost comparisons, there is a significantly smaller budget requirement for the recovery phase than the physical implementation above. In this case, there is only one physical machine to set up and install.
For performing recovery, we need only the virtual disks for each machine and the virtual configuration file for each machine. We don't need an operating system, backup/restore software, or physical configuration.
After restoring the data, for the VM we only need to perform the registration and power it up. Even without going into detail, the cost for the recovery site is significantly lower than for a comparable physical implementation: cost(P1+R1) > cost(P2+R2)
VMware offers other key items that increase the DR capabilities of its virtualization products. The first is the VMware Virtual Center server, which can manage all of the VMs in an ESX Server and GSX Server environment. It can identify the current capacity of each server and types of VMs that peak CPU utilization. This last item is different from what you'll see from within a VM. Virtual center shows actual utilization rather than what the VM thinks it is using.
The second item is the VMware VMotion software, which allows you to move a running VM from a highly loaded ESX Server to a lesser-burdened ESX Server. This enables on-demand computing to maximize resources, allowing systems to migrate over high-speed (typically gigabit) network connections, including certain categories of wide area networks (WANs).
The third item is a technology that VMware implements with an Application Programming Interface (API). This allows you to add a write cache to a VM disk (called a REDO log) so that the disk is quiescent and the write cache receives all of the changing data to the disk. In effect, this allows you to take snapshots of a running VM that you can then transfer to a remote system or remote recovery site without bringing down the original production system.
The fourth item is the VMware P2V software, which enables you to convert a physical machine (running a specific operating system) to a virtual machine. This updates the HAL, device drivers, and other key components to create a virtual bootable replica of the physical machine. The P2V software can help to prevent data loss on soon-to-fail hardware.
Example DR Use of Virtual Infrastructure
We've set up several businesses with these systems. For example, Oak Associates, an independent equity investment manager for institutions and individuals, has used virtual infrastructure to improve their DR capabilities. This technology helped them to lower costs, speed recovery, and simplify their DR implementation. Here are some excerpts from the case study.
- Hardware savings: Instead of spending money on buying or upgrading new servers for each application, Oak Associates creates five virtual machines on each physical machine. They've stopped purchasing duplicate servers to achieve redundancy at the disaster recovery site.
- Hardware independence: The virtualization software provides hardware independence, so Oak Associates does not need to standardize on one hardware vendor or platform in order to meet its business and disaster recovery needs.
- Optimized disaster recovery process: VMware virtual machines back up servers at the disaster recovery site in real time. If one machine needs rebooting or fails unexpectedly, another machine with redundant virtual machines takes over. The IT team can also quickly restore machines, making copies of virtual machines in a matter of minutes.
Scott Hill, senior technology officer for Oak Associates, says, "If something were to happen to this site, we'd just go to our disaster recovery site, and all of our VMware host machines are already up and running. With ordinary backups, you have the data, but it can take a long time to restore it and make it usable. With VMware, it's easy to shut down a machine and make a copy of it. I can back up the whole machine in three minutes."
Jeff Szastak, technology officer for Oak Associates, said using VMware has dramatically reduced hardware costs, saving the company from buying more servers for the disaster recovery site. "For our original disaster recovery system using a SAN, we were doing everything at both sites," Szastak says. "We were replicating everything, which means that if I purchased a brand new server for the disaster recovery site, it didn't actually run anything unless there was a failure. We were purchasing twice as much equipment as we really needed. With VMware, I don't have to have the same equipment there. I can use my existing server hardware there and bring in newer, higher-performing servers here."
- Cold site: A disaster recovery site that does not including mirroring of systems or data prior to a disaster.
- Disaster: An event triggered by natural, man-made, or facility-related causes.
- Hot site: A disaster recovery site that includes the redundancy of mirroring a primary data center's production systems. This allows for the quickest recovery time, but at the highest cost.
- Mobile Internet Protocol: A standard protocol that builds on the Internet Protocol by making mobility transparent to applications and higher-level protocols, such as TCP. Some DR scenarios use this.
- Physical to virtual (P2V): A technology to convert a physical machine to a virtual machine.
- Quiescent data: Unchanging data, as found on a powered-off disk computer system.
- REDO log: A write cache used in association with a virtual disk file to allow quiescing of the virtual disk. This enables full backup of a virtual disk at the time of the REDO log start. At this time the virtual disk will not change.
- Snapshot: Capturing the state of a machine at runtime without affecting the running applications.
- Time to recovery (TTR): The length of time from the start of the recovery process until completion of recovery testing.
- Virtual infrastructure (VI): A computing infrastructure that consists of virtualized hardware that sits above physical hardware, providing maximum portability and consistency across the entire environment.
- Virtual machine (VM): An operating system that runs on top of virtual hardware that is identical, no matter what the base hardware is.
- Virtual machine monitor (VMM): A software layer that virtualizes all of the resources of a physical machine, defining and supporting the execution of multiple virtual machines (VMs).
- Vmres: Restore or resurrect a VM snapshot image set.
- Vmsnap: A script to snapshot a running VM, written in Perl, for use with Vmware ESX Servers.
- Disaster Recovery Planning, 3rd Ed., John Toigo, Toigo Partners International, www.drplanning.org, ISBN 0-13-046282-9.
- The Backup Book, 3rd Ed., Dorian Cougias, SV Books, ISBN 0-9729039-0-9.
- Disco: Running Commodity Operating Systems on Scalable Multiprocessors, Edouard Bugnion, Scott Devine, and Mendel Rosenblum, www-flash.stanford.edu/~bugnion/Disco/sosp-html
John Y. Arrasjid is currently a senior member of the VMware Professional Services Organization as a consulting architect.
Return to the O'Reilly Network