Virtual-Stones Blog: April 2013

26 April, 2013

vSphere and ESXi 5.1 U1 released

VMware has released the much "needed" Update 1 yesterday. If you want all ins- and outs of the enhancements en bug fixes incorporated in this Update, please read the ESXi release notes and the vCenter release notes.

I took a look at the release notes and I found some bug fixes that I want to highlight:

Reinstallation of ESXi 5.1 does not remove the Datastore label of the local VMFS of an earlier installation.

Re-installation of ESXi 5.1 with an existing local VMFS volume retains the Datastore label even after the user chooses the overwrite datastore option to overwrite the VMFS volume.

When the quiesced snapshot operation fails the redo logs are not consolidated

When you attempt to take a quiesced snapshot of a virtual machine, if the snapshot operation fails towards the end of its completion, the redo logs created as part of the snapshot are not consolidated. The redo logs might consume a lot of datastore space.

ESXi 5.x host appears disconnected in vCenter Server and logs the ramdisk (root) is full message in the vpxa.log file

If Simple Network Management Protocol (SNMP) is unable to handle the number of SNMP trap files (.trp) in the/var/spool/snmp folder of ESXi, the host might appear as disconnected in vCenter Server. You might not be able to perform any task on the host.

The vpxa.log contains several entries similar to the following:

WARNING: VisorFSObj: 1954: Cannot create file

/var/run/vmware/f4a0dbedb2e0fd30b80f90123fbe40f8.lck for process vpxa because the inode table of its ramdisk (root) is full.

WARNING: VisorFSObj: 1954: Cannot create file

/var/run/vmware/watchdog-vpxa.PID for process sh because the inode table of its ramdisk (root) is full

vSphere 5 Storage vMotion is unable to rename virtual machine files on completing migration

In vCenter Server , when you rename a virtual machine in the vSphere Client, the VMDK disks are not renamed following a successful Storage vMotion task. When you perform a Storage vMotion task for the virtual machine to have its folder and associated files renamed to match the new name, the virtual machine folder name changes, but the virtual machine file names do not change.

This issue is resolved in this release. To enable this renaming feature, you need to configure the advanced settings in vCenter Server and set the value of the provisioning.relocate.enableRename parameter to true.

There are a lot more resolved issues and enhancements, take some time to go thru both the release notes. It for sure will benefit you !

vCloud loses sync with the vCenter Inventory Service

Yesterday the vCloud admin of a customer I am working for on a other project, had a strange problem. He told me that it became impossible to deploy new vApps from the catalog, this proces would stop with all kinds of error's.
A few days before when we where testing the deployment of vApps on NFS datastores that where on new storage devices he also ran into a strange problem which looked quite similar. When deploying vCloud would first generate a other set of "shadow VMs" before actually deploying the vApp. This was strange because the vApp already had a set of running "shadow VMs" and it should have been "using" these.
Because the issue of yesterday had stopped production on the vCloud environment, the vCloud admin opened a support request with VMware GSS.
Once they had a look at the issue, it became quite quickly clear what was causing these strange problems. It looked like the vCloud Director cell had lost sync with the vCenter Inventory Service, this is not uncommon and you can find several "solutions" to this problem when searching thru some blogs.
In short the steps you need to take to re-start the syncing process again (If you are running a multi-cell environment):

1. First disable the cell and pass the active jobs to the other cells.

2. Display the current state of the cell to view any active jobs.
# /opt/vmware/vcloud-director/bin/cell-management-tool -u <username> cell --status

3. Then Quiesce the active jobs.
# /opt/vmware/vcloud-director/bin/cell-management-tool -u <username> cell --quiesce true

4. Confirm the cell isn’t processing any active jobs.
# /opt/vmware/vcloud-director/bin/cell-management-tool -u <username> cell --status

5. Now shut the cell down to prevent any other jobs from becoming active on the cell.
# /opt/vmware/vcloud-director/bin/cell-management-tool -u <username> cell --shutdown

6. Then restart the services.
# service vmware-vcd restart

If you are not running a multi-cell environment, you can just restart the services but it will incorporate loss of service for a couple of minutes. If you want to keep the loss of services to a minimum, you can "monitor" or tail the cell.log file (when it reads 100%, it's done)

# tail -f /opt/vmware/vcloud-director/logs/cell.log

In case that the above does not work, you can also reboot the complete cell (In a multi-cell environment first pass all active tasks to other cells). Upon reboot the vCloud cell will reconnect and will sync again.

Ok back to the issue, in this case this did not work, vCloud did not start the sync again in either way tried.

The support engineer wanted to restart all vCenter service to make sure that they were all running ok. Unfortunately this did not help, but with in this specific environment the Inventory Service is run from a separate server (VM) and after restarting the Inventory Service and after a other restart of the cell services vCloud did sync upon starting.

After this when talking to the vCloud admin, he told me that he had found other minor issues that probably could be interpreted as signs that vCloud and Inventory Service were getting out of sync again.

He found that some VMs where present in vCloud but not in vCenter, hence you could not find them when using the search feature (which is driven by Inventory Service) in the vSphere client. And he found that the "VMs and Clusters" view of the vSphere client had become very very slow upon to unresponsive. All other views in the client were working as usual.

As this issue can occur again, we decided to keep an eye out if we would detect either of these "signs" and when we do, do a restart of the Inventory Service ASAP.

Better to be safe then sorry.

25 April, 2013

VMware vSphere certificates just got a whole lot easier....

When deploying or upgrading to vSphere 5.1 in a customers environment which has the prerequisite to use 3rd party signed certificates, most VMware consultants and admins already know that this is not the easiest part to manage. In fact it is down right pain to setup the vSphere environment with 3rd party certificates.
VMware has a guide to help you with the process of doing the certificate replacements, the document is called "Replacing Default vCenter 5.1 and ESXi Certificates".
If attempting to replace a certificate please do consult this guide and keep in mind that it is not going to be an easy task.
But wait VMware has taken their vCenter Certificate Automation Tool out of Beta and made it publicly available with the introduction of version 1.0
Details on the features you can find on the VMware Blog "VMware Support Insider".
With version 1.0 certificates of the following VMware products can be changed / update with the use of this tool:

vCenter Single Sign On
vCenter server
vCenter Inventory Service
vCenter Log Browser
vSphere Web Client
VMware Update Manager (VUM)
vCenter Orchestrator (VCO)

Hopefully in the next releases they will add more products which can be managed by this tool, like the ESXi hypervisor itself, vCloud Director, Site Recovery Manager and vCenter operations.

For details on deploying and using the tool, please have look at the KB article VMware has published under kb2041600

22 April, 2013

Great (new) fling from VMware Labs

So the title reads that it is a new fling, but this is not completely true. It has been launched in it's current version over a year ago. During that time I did not have the time to check it out, recently when searching for something else I stumbled upon this fling again and thought why not give it a go.
So what does this InventorySnaphot fling do exactly, well it gives you the possibility to "snapshot" a vCenter inventory and use this to reproduce this inventory. This can be used as backup and restore or use the snapshot as a template inventory. A other purpose for this which I came up with is when you run a home lab and you use the 60 days evaluation licenses for the VMware part of it, which is done a lot for sure. You could build you home lab and configure it as you want it / need it and take a "snapshot" with this fling. All your re-builds afterwards will be a lot quicker as you can restore your complete vCenter inventory very easy and fast, this will save you a lot of time configuring.
The snapshot can be taken at any level within the vCenter inventory.
To be able to run this fling on your computer you need two things installed, first PowerCLI (duh!!) and second Java.
You can download the binaries in a zip package thru this link
After downloading it, unpack it and run "InventorySnapshot.bat" if it doesn't work run it from a command line and see if there is a error. I got a error first time round about java, it could not be found. After editing "InventorySnapshot.bat" to add the full path to Java it started working.
The usage of the tool is pretty straight forward and all is documented on the VMware Lab website, there is even a video that shows you how to use the tool.
So how does it work, the tool creates a PowerCLI script crafted from the vCenter inventory it is connected to and once you run this script it will re-create the vCenter inventory.
You can restore the complete vCenter inventory or you can select parts of the inventory that you want to restore.

17 April, 2013

Is rescanning storage adapters something you can do during production hours ?

First off, why do you initiate a rescan action ? Usually after you have made modifications on the storage / datastore part of your vSphere environment, like adding a new datastore. In general we can presume that a "Rescan All" can be done during production hours. In theory you could do this action without disrupting your environment, therefore it is acceptable to allow it during production hours. But in reality this is not always the case, so as with many decisions and actions on vSphere (and vSphere design) the correct answer to the question should be; It depends....on numerous factors.
I will try to clarify my "statement" by the hand of a use case combined with information gathered from a real life incident at one of the customers I worked for.
All I am writing in this article is related to vSphere 5.1 and up, why not the older versions ? For 2 reasons, first I did not investigate it on older versions and second starting version 5.1 the new Inventory Service was introduced. This service basically is there to make vCenter perform better (on several parts, for instance on the search feature) as it has it's own database which is used by the search feature. The Inventory Service is closely linked with Profile-Driven Storage, this manages the feature know as "Storage Profiles". If Inventory Service is not working then "Storage Profiles" will not be usable, please keep this in mind.
Back to the rescanning actions on storage adapters, this usually not something you do a daily basis. It will concentrate around changes to the storage area of your vSphere environment as mentioned before. Now you can imagine that it will be heavily used when move from one storage device to a other storage device. A action I recently worked on, the customer needed to change his SAN for performance and stability reasons. They decided to no longer go with FC storage but instead go and start use NFS storage, this resulted in a large migration action of all VM's in their vSphere environment.
When all VM's where moved to the new NFS backed datastores, the next step was to remove all FC backed datastores. One of the VM admins started this action deleting the FC datastores from the vSphere client host by host, what happend was that with every removal automatically a rescan was initiated on that host.
During this time the vCloud environment (runs on the same backend and vCenter as the "normal" vSphere environment) started to show very strange behaviour, It was not possible to deploy a vApp from the catalog, adding aVM to a vApp did not work also. And on the vSphere / vCenter part the search feature stopped working, deploying a new VM failed on most occasions.
With the search feature not working, there should be something wrong with the Inventory Service. I opened up my vSphere webclient to see if I could use search from there and I was presented with a error notification "Connection lost to Inventory Service". I check the services on the server that was running Inventory Service and was showing "started". After a reboot of the server (Inventory Service ran on a separate VM because of the size of the environment) and a long wait before the Inventory Service went form "starting" to "started" (about 40 min.) the search feature started working. But vCloud still did not operate normally and all Storage Profiles where gone from vCloud aswel as from vCenter. After the Profile-driven Storage service was restarted they re-appeared and vCloud started acting normal again.
Every admin returned to his normal work after this incident, so again one of the VM admins went back to removing the FC datastores and within a short time the strange behaviour on vCloud and vCenter appeared again. After opening a support request with VMWare GSS and having a talk with one of the support engineers, it became clear that the rescan action was putting to much stress on both Inventory Service and Profile-Driven Storage. In such a way that it slowed down all other tasks and eventually crashing the Inventory Service completely.
I had never seen or heard of such a incident with other customers or colleagues, so I started investigating what combination of factors are responsible for this to happen and soon it became clear and started to make sense.
The customer uses Storage Profiles on both vCloud and on vCenter (for all VM's +/- 1700) and it is known that vCloud can create a fair amount of load on Inventory Service by itself. Rescanning puts a extra load on Inventory Service and also puts a strain on the host agents aswel this combined generates a load of storage traffic which results in a service to fail / crash.
Advice of VMware GSS was to reduce rescanning and datastore manipulation to a minimum during production hours and at best to execute these actions during quietest hours of the day, or even better during a maintenance window.
And if rescanning is needed, to do it more localised instead of host, cluster or even datacenter wide. This last part is not possible thru the vSphere client, but on the DCUI of a host you can run a rescan per host and even per storage adapter. This will reduce the load by a big amount and makes it possible to run rescan actions during production hours without disrupting your vCloud and vSphere environment.

Example:
esxcli storage core adapter rescan -A vmhba0
This will rescan vmhba0 on this specific host, leaving all other adapters (and hosts) alone.

So you can do a rescan action during production hours but you can never doing this without thinking it over and without knowing your environment. So please use it with caution and only execute it during production hours when it's absolute needed.

More information on rescanning can be found at VMware KB1003988

10 April, 2013

Sphere client shows "wrong" disk provisioning type in VM properties after deployment from template

Last week I got a question from a senior VM administrator at a customer where I had been working on several vSphere and storage related projects.
They use templates to deploy new VM's and they where created to use "thin" provisioned disks when the VM had been deployed. Now he told me that one of the other admin's had deployed a VM on the recently replaced storage (NFS storage) and this VM got "eager zero thick" disks when it was deployed.
To rule out a human error, he tried a other deployment from the same template himself and found that the VM also displayed "eager zero thick" in the VM properties after the deployment.
I asked me if I knew why this happend or if I maybe had altered their templates. On both question I had to tell him no, but it got me curious. As this customer had some disk provisioning issues before when moving from FC to NFS storage (read blog article :Real life benefits and caveats of NFS storage with VAAI) they where very keen to find out what was the cause of this to prevent any issues in the future.
When I started looking at this issue I knew they had the problem with a template containing Windows 2008 R2 SP1, so I thought I would clone this template and have a deeper look at it.
After cloning the template I converted the template to a VM and looked at the provisioning of the disks of this VM and to my surprise I found the disk where "thin"! I converted the VM back to a template and deployed a new VM from this template. When completed I again looked at the VM properties and found the disk provisioning was telling me "eager zero thick", now I was getting confused.
When I took at the VMDK discriptor file to check the virtual disk details I found just below The Disk Data Base line the following ddb.thinProvisioned = "1" so even if the vSphere client GUI stated the disk was "eager zero thick" the disk was actually "thin".
But I still needed to "convince" the customer that all was ok and it just was a flaw in the GUI. At the time I wanted to show the customer what was actually happening I ran over my test VM's once more.
In the mean time I had powered on some of the VM's and when I looked at their properties I found the disk provisioning now was "thin" to verify I did a deployment from template again.
After deployment I looked at the VM properties and the disk was "eager zero thick" I checked the VMDK discriptor and there it was "thin" I powered on the VM and went back to the properties screen of the VM and saw that the disk provisioning had changed to "thin".
For me it looks like if the vSphere client "chooses" a provisioning type after deployment but only reads the actual value from the discriptor file upon powering on / boot and then updates the VM properties.

Real life benefits and caveats of NFS storage with VAAI

I recently was involved in a storage replacement project. This storage was used for a fairly large vSphere and vCloud environment amongst other.
During the replacement the customer went from FC storage to NFS storage, the FC storage did not incorporate VAAI (actually it did, but was disabled for stability reasons). The new NFS storage did incorporate VAAI and the design and sizing was done by involving all size reduction technics available on both the storage and vSphere. Therefore thin provisioning and deduplication were must haves to not run out of storage space before the end of the project.
In the initial phase of the project, right after the new NFS datastores where presented to the vSphere environment the customer ran into a big surprise. The VMware administrators had started to move VM's from the FC storage to the NFS storage and they forced the disk provisioning from these VM's from "thin" to "thick lazy zero" as they where convinced running "thin on thin" was very dangerous and could lead into a out of space issue without receiving any warnings or alerts from vCenter.
Not very well know but luckily recently very well documented by Cormac Hogan in his blog is that from the three disk provisioning types "thin, lazy zero thick and eager zero thick", only two are usable on NFS storage. Please read his blog posts on NFS best practices to get all the details, but in short NFS does not use the "lazy zero thick" type it will set it as "eager zero thick"! But they thought the deduplication feature of the storage device would straighten this out by deduping all "zero's". But what actually does happen with VAAI enabled, vSphere and the storage device become aware of each other and because vSphere has the disks provisioned as "eager zero thick" the storage device will not touch these zero'd blocks as they are reserved by vSphere (Reserve space feature of VAAI). So there was no dedupelication at all on these eager zero thick provisioned VM's.
After the storage admin's alerted them that they were about to run out of space on their datastores they only got up to 25% dedupe, I started looking at this and found the reason why.
The issue was solved without any uptime lost on the VM's, by creating and using a Powercli script that would Storage vMotion all VM's from 1 datastore to a other (temp.) datastore and back to the original location after the first action was completed. On the way back the disks where converted from "eager zero thick" to "thin".
After all VM's where back on there original datastore with "thin" provisioned disks the dedupe factor went up to almost 70%.