31 January, 2013

Disabling VMware DRS with vCloud in your environment

So there is a commonly used practice when 1 (or multiple) host have a issue with applying DRS settings. Usually the quick fix is to disable DRS altogether on the Cluster containing the host(s), this fix is also used as a troubleshooting step by VMware GSS.

But if you run a environment which includes vCloud Director then this fix is a absolute No No !! By disabling DRS you (temporarily) remove all resource pools, and vCloud Director relies heavily on these. In fact by disabling you destroy your vCloud environment completely, all VM's running will keep on running as usual but you will not be able to do any vCloud actions.

The complete explanation as why this happens can be found here in a excellent blog post of Chris Colotti.

If you find yourself in the situation as described above your best bet is to contact VMware GSS and have a skilled DBA on standby. Because fixing this problem will ask for a lot of manual editing of records in the vCenter and vCloud databases.

After experiencing this issue with a customer, I found out there is a other way of fixing this issue. Please note this solution is not the recommended solution, but it is also used by GSS in smaller environments or as a last resort.
In my case I was very lucky to be in a environment that had the vCenter database located on Oracle and this Oracle system had what is called a "flashback" feature. It means a DBA can "rewind"  the transaction log very precise (up until the exact second) and almost instant. As we knew the exact time DRS had been disabled (vCenter logging) we could use "flashback" to go back to exactly 1 second before it was turned off. Off course all vCenter related services where stopped and after the database restore the vCenter server was rebooted.
When the server came back online again all was back to the state as we had before the disabling of DRS, so we had a working vCloud Director again (although it also needed a reboot to start functioning properly) and we had 1 host with DRS issues. But the mean problem was solved, vCloud was working again.

Because the Oracle "flashback" function was available and we acted quickly after the problem occurred, we minimised the potential loss of vCenter changes and actions getting lost / erased.

It still took some work to get the DRS issue solved on 1 host of the vCloud cluster, but it saved a lot of work by not having to manually editing the databases.