26 July, 2013

The downside of "Fast Provisioned" vApps in vCloud

When you or your company is replacing their existing SAN on which their vSphere and/or vCloud environment is running, you are probably going to replace it with a VAAI capable storage solution.
VAAI (vStorage APIs for Array Integration) was introduced with ESX(i) 4.1 and it basically provides a way to offload storage related tasks to the storage device. This reduces the load on the ESX(i) hosts and vCenter, it also speeds up those tasks.
When looking at vCloud Director, it is since version 5.1 also supporting VAAI, it is possible to offload any cloning of VM's "Fast Provisoning" as it is called in vCloud.
This speeds up the deployment of vApps considerably, it also helps to be very storage efficient when combined with deduplication and thin provisioning.
So far still no downside, well the point is that when you need to move vApps to a datastore that is on a different storage system or you need to move your entire vCloud environment to a new storage solution then that is the moment you could run into some unpleasant suprises (limitations) that VAAI brings.

A customer where I was recently working on their vSphere / vCloud environment where running both these environments on the same storage solution, but due to the explosive growth of the vCloud part they where starting to experience performance issues. The storage solution could not deliver the needed IOPS. Their current storage solution is a Netapp VAAI capable one, the purchased a new storage solution which would be only used for their vCloud environment. This also is a Netapp VAAI capable one, only a more high performance model.
They had enabled "Fast Provisioning" within vCloud from the get go, so all vApps where deployed like this. This means all linked clones where actually "Flex clones". When the new high performance storage solution was setup and ready to be used they needed to move / migrate their deployed vApps to the new datastores. They knew that this would cause the linked clones to be consolidated and become full clones as this is the only way a SvMotion could move the VM's within the vApps.
But a unexpected error stopped this, vSphere vCenter reported a error explaining that it could not consolidate and therefore not migrate the VM's. When I had a look at it I first thought the error could be caused by a compatibility mismatch between vSphere and Netapp Ontap, so to gather more info on this I opened up a support case with VMWare GSS. After providing vCloud logs and some additional information the SR became a PR and went to the engineering department, quickly after this had happened I got a answer from VMWare, which basically said that we where trying to do something that is not supported.
Not the answer I was looking for ! And when I continued to read thru the list of unsupported actions regarding vCloud and VAAI storage I even became more unhappy no consolidate or Storage vMotion are permitted on VAAI enabled storage (Fast Provisioned VM's and vApps that is). Luckily at the bottom of the email it read that it move could be done by the use of the "relocate" methode which is not present in any UI but only thru the vCD API. So there was a way of accomplishing the move, but it would take me some time to figure out how to do this. The mail provided a link to vCloud 5.1 API guide.

Unsupported / not permitted actions on "Fast provisioned" or "VAAI" clones:

  • Consolidate operations are not permitted for VMs created through the VAAI cloning process, even for powered-off VMs.
  • Using Storage vMotion through VC UI to move VMs created through the VAAI cloning process, is not permitted. However it is possible to relocate VAAI clones through the VCD API
  • Reporting on capacity remaining on a given datastore may be inaccurate. 
  • This release supports a maximum VAAI chain length of 256. Once VAAI clones reach this limit, further clones will be full-copy, handled by vCloud Director. This maximum is configurable using the db flag, Config.VirtualMachine.AllowedMaxVaaiChainLength
  • Source VMs, when cloned, have a REDO log attached to them, which, if they are running, may cause a negative read performance (compared to non-cloned vm) impact.  
  • No explicit way to prevent an admin from turning on Vaai flag for NAS datastores that do not support Vaai.
  • Vaai clones will not work for vSphere  < 5.0 (Linked clones in VCD only work for VC versions after 5.0).
  • Relocate of vaai clones (VMs residing on vaai enabled volumes) will not work if the vm has user created snapshot. The snapshots will need to be removed for clones to work.
  • There maybe additional constraints on Vaai Clone support imposed by arrays. Please contact vendors. 

On page 18 of the guide you find the information about authentication and headers required and on page 231 there is information on relocating a VM to a different datastore, example code is provided.

With this new information I started to figure out what this relocate actually does, when I was looking for some additional information I stumbled upon the blog of Matt Vogt he wrote a article on VMware opening up a lot of API's with the 5.1 release of vCloud Director, but there was still a lot more to be desired. One of the things being able to change the storage profile of VM's and for this you need to know the Href for the storage profile you want to change to.
When you change the storage profile of a VM in vCloud it will trigger a relocate action, because the different storage profile refers to a different datastore. 
Matt used a script from Jake Robinson posted on the VMware Community which could retrieve the Href of storage profiles and created his own script which could change storage profiles of all VM's within a vApp. I took his script and adjusted it to my needs, this resulted in a script which can do the following.
It can change the storage profile of VM's within the same vApp in one go (sequentially). When the VM has a chain length lower then 2 it can be powered on, when the chain length is equal or greater then 2 the VM needs to be powered off to successfully complete the change. In any case the linked clone will be consolidated to a full/thick clone, so it becomes independent of it's base disk(s). 
When I tried the script it did not work for some reason it would not retrieve the Href of the new storage profile. For my purpose I did not spend any time on solving this I just hard coded the Href in the script, be sure to retrieve it yourself and update the script before use. This can be done easily with a Powercli one-liner.
Steps to get Href of destination (new) storage profile:

  1. Manually change the storage profile of a VM to the new storage profile, this VM will be relocated to the corresponding datastore(s).
  2. Setup a Powercli connection to vCloud director
  3. Run $VM = Get-CIVApp "vApp name" | Get-CIVM VM name
  4. Run $VM.ExtensionData.storageprofile.name (verify the name reflects the name of the destination storage profile)
  5. Run $VM.ExtensionData.storageprofile.Href

This last line gives you a output that should look like: 
This is the Href of the storage profile, update this in the script at $profileHref line and you are ready to relocate vApp's.

Script code

Of course you can also use the script when relocating traditional/vSphere linked clones.

19 July, 2013

Virtual Machine Disk Consolidation is needed !?

vSphere 5 introduced a new feature to clean up VM snapshot "left-overs" which could be the result of a snapshot removal action where the consolidation step has gone bad. This would result in the snapshot manager interface telling you there are no more snapshots present, but at datastore level they still exist and could even be still in use / still be growing. This could case all kinds of problems, first of all VM performance issues as the VM is still running in snapshot mode, secondly you are not able to alter the virtual disks of this VM and in the long run your could potentiality run out of space on your datastore because the snapshot keeps on growing.
Prior to vSphere 5 there was a possibility to fix such a situation thru the CLI, now with vSphere 5 you get a extra option in the VM - Snapshot menu called "Consolidate" this feature should clean up any discrepancies between the Snapshot Manager interface and the actual situation at datastore level.
I'm always a little reluctant when I'm at a customer and they use a lot of snapshotting, it is a very helpful and useful tool but you have to use it with caution otherwise it could cause big problems. Usually the problems start if snapshots are kept for a longer period of time or if the snapshot a layered onto each other, but even if you are aware of the problems it can cause when it's used wrongly it can still happen that you run into issues when using snapshots.
That being said and when we look at features offered by the various Vendors of storage devices, I'm pointing to the VM backup solutions that they offer. When a VM is running when being backed-up they all use the vSphere snapshot (with or without the Quiescing of the guest file system option). Basically if your company uses a SAN that leverages this functionality and it's configured to backup your VM's on a daily basis you have a environment that uses snapshotting a lot (on a daily basis) and therefore you could possibly run into more snapshot / consolidation issues then when you would not have a SAN with this functionality (nobody snapshots all it's VM's manually on a daily basis, I hope).
When I recently was at a large customer (+2000 VM's) that uses their storage device feature to backup complete datastores daily and also uses vSphere snapshots to get a consistent backup of running VM's.
For some reason they run into snapshot / consolidation issues pretty often and they explained to me that the Consolidate feature did work ok on VM with a Linux guest OS , but they almost always had a problem when trying to consolidate on a VM with a Windows guest OS it would simply fail with a error.
So I had a look at one of their VM's that could not consolidate although the vCenter client was telling it did need it.

 When looking at the VM properties I cloud see that it was still had a "snapshot" (delta file) as a virtual disk, as it had a -000001.vmdk as virtual disk instead of a "normal" .vmdk

When a VM is in this state and it is still operational there is a issue, but the uptime is not directly affected, but most of the time the VM will be down and it will not power on again because of the issue. It will simply report a error of missing a snapshot on which the disk is depending. 
The way I solved it at this customer (multiple times) is by editing the VM's configuration file .vmx and re-registering the VM to vCenter and after manually cleanup the remaining snapshot files. Please note that if the VM was running in snapshot mode all changes written in the snapshot will be lost using this procedure, in other words the VM will return to it's "pre snapshot" situation. For this particular customer this was not a issue, because the failed snapshots where initiated for backup purposes so no changes where made to the VM when it ran in snapshot mode.

So if you run into this issue and you know that their where no changes made to the VM or the losing the changes is a acceptable loss you could solve it by these steps.
  1. Download the .vmx and open it with a text editor (I prefer Notepad++ for this kind of work) find the line that has the virtual disk files configured  scsi0:0.fileName = "virtual-machine-000001.vmdk" and remove "-000001" so you are left with scsi0:0.fileName = "virtual-machine.vmdk" save the file.
  2. Rename the .vmx file on the datastore to .old and upload the edited .vmx file.
  3. Either reload the VM by using PowerCLI** or remove the VM from the Inventory and re-add it again.
  4. Power on the VM
  5. If you get the "Virtual Disk Consolidation needed" message, go to the Snapshot menu and click "Consolidate" it should run correctly now and remove the message.
  6. Manually remove the unused files from the datastore .old, -000001.vmdk and -000001-flat.vmdk (I use Winscp to do this kind of work on a datastore)
** you could do this by using the following one-liner:

Get-View -ViewType VirtualMachine -Filter @{"Name" = "VM name"} |%{$_.reload()}

05 July, 2013

New Fling from VMwareLABS called "VisualEsxtop"

Most of the Fling's coming from VMwareLABS are worth trying and every once in a while there is a Fling that is really cool and above all useful (like InventorySnapshot I wrote a post on a while back), just a couple of days ago some of the engineers from the VMware performance group release VisualEsxtop.
The name says it all, it is a graphical version of esxtop which can be run on Microsoft OS, Linux OS and on Mac OS** so it is really "cross-platform".
It works remotely and can be run on any computer which has network access to ESX(i) host or vCenter (although I haven't been able to connect it to a vCenter successfully).

When you run it, it looks like a enhanced version of esxtop

It will color coat important counters and issues automatically. Further more it has the ability to record and playback batch output, it can create line charts for selected counters and it has counter descriptions when you "mouse-over" them.

To download this Fling, which I recommend as it is a very useful tool to have please go to VMwareLABS

** As said before it is cross-platform, but to get it to run on Mac OS you need to take some extra steps. For details on how to do this please read How to Run VMware's New Fling VisualEsxtop on Mac OS X from the virtuallyGhetto blog of William Lam

Shutdown unresponsive VM on ESX(i) 4.x / 5.x

Sometimes you will run into a VM that is unresponsive and or unreachable by RDP, even the vSphere client console will not work. The only way to solve this is to shutdown the VM, in some occasions even this will not work nor will the power-off or reset commands work.
When this is the case you will be getting error messages like: The operation is not allowed in current state or The attempted operation cannot be performed in the current state (Powered Off) or anything else along this line.
Recently a was working at a customer and the had a similar situation with 2 VMs, the admin's working on it where not able to successfully shutdown the either of the VMs. They asked if I knew a way to do it, maybe from the CLI.
I knew I could kill the process of these VMs thru CLI with the command  esxcli vm process list to get the world-id followed by esxcli vm process kill -t [soft,hard,force] -w WorldNumber to kill the process running the VM. But for some reason I was not able to find the VMs concerning in the output presented in the CLI (perhaps to many VMs on the host and over-looked it).
As alternative I also knew there was a way to kill processes from esxtop (only available on ESXi 4.x and 5.x), but I did not have all the details on the steps to perform but this was quickly solved by a quick search thru the VMware KB.
As it turned out, I think I prefer the "esxtop way" above the other ways of doing it, for 2 reasons; first esxtop is something you use regularly (I assume) and second it is a very clear and "visible" way.

How to use esxtop to kill processes of unresponsive VMs:

  1. On the ESXi console, enter Tech Support mode or connect thru SSH and log in as root.
  2. Run esxtop
  3. Press c to switch to the CPU resource utilization screen.
  4. Press Shift+v to limit / filter the view to virtual machines. 
  5. Press f to display the list of fields.
  6. Press c to add the column for the Leader World ID.
  7. Identify the target virtual machine by its Name and Leader World ID (LWID).
  8. Press k.
  9. At the World to kill prompt, type in the Leader World ID from step 7 and press Enter.
  10. Wait 30 seconds and validate that the process is not longer listed.
For more information on ways to shutdown unresponsive VMs please read VMware KB1014165

*Note; for the one's running ESX instead of ESXi please refer to VMware KB1004340 as there a different ways to do this on these systems.