29 April, 2014

Removing orphaned replica VM's

I have been working on a SAN replacement project with a customer, for me this also meant moving all VM workloads from the old SAN to the new SAN. Usually I work on VMware vSphere environments (Data Center Virtualisation) but this project also involved moving their VMware View 5 environment and View workloads.
With this View environment not using linked clones, the storage migration was pretty straight forward even for a DCV guy like me. At least until I came across VM's called Replica-GUID with which I could not do much as these VM's no longer existed within the View environment (assuming that they were leftovers from a linked clone experiment), but where still registered with vCenter.
So I thought, this should be easy just right click and "Remove from inventory" or "Delete from disk", but both of these "solutions" were grayed-out.
When I searched the VMware KB I came across a very detailed article on how to manually remove replica virtual machines KB1008704
Before I could start removing the replicas I wanted to double check if the VM's were really obsolete. The View admin thought / assumed that they were leftovers from a experiment or on-site training, as we all know assumptions are the mother of all #$&@.
Given the fact that they were powered off I figured why not check for how long they have been in this state. By using the datastore browser to locate the files corresponding with the Replica VM's and checking the "Modified" column I found these Replica VM's have not been altered / powered-on for over a year. This is what I was expecting and aligns with the thought of the View admin.
I started manually deleting the Replica VM's by following the steps outlined in KB 1008704, although this is pretty straightforward please do note that the used command SviConfig is case-sensitive.

At first I had absolutely not luck with removing the Replica VM's it looked like there was a permission / rights issue some where. But I used a user account with full Administrator access to vCenter, used the correct SQL user and pasword to access the View database and still no luck.
As it turns out, the vCenter user account you use, does not only need full Administrator privileges on the vCenter but it also needs Administrator privileges on the View environment (Composer, broker)

When searching for a solution to SviConfig not working at first, I came a cross a blog post of Terence Luk he explains in good detail how to use the SviConfig command. He even provides some extra info on top of the VMware KB article.

Change the name of a ESXi host

Recently I needed to rename a considerable amount of ESXi 5.x hosts, VMware has published a KB article KB1010821 that describes the various way's of doing this very well. But for me there are two things that are missing in this information. First of it being that there is nothing written specifically regarding the consequences of a host renaming action on the distributed vSwitche(s) the hosts' Physical Adapters where used as dvUplinks.
When I followed the manual steps of the KB to test run the procedure I got error's regarding the adapters on the dvUplinks when removed the host from vCenter (after I first put the host in maintenance mode and disconnected it).
These error's came back after the renaming was done and I added the host again to vCenter. The host added successfully to the HA-Cluster it was previously part of, but failed to reconnect it's management, NFS and VM network through the distributed vSwitches. I had to manually run the "add host" procedure to add the physical adapters to the correct distributed vSwitches, the physical adapters used for vmkernel ports where pre-selected the adapters used for VM network I had to manually select the physical adapters which I wanted to use. With the additional steps the procedure was successful.
The second thing that not mentioned, although understandable is that you lose all historic (performance, event, task, etc.) data from the renamed host because of the "remove from vCenter" step mentioned in the KB.

When I read a blog post by Reuben Stump on the Virtuin blog called "Rename ESXi Hosts in vCenter (Without Losing Historical Data)" which describes a way of renaming a host without removing it from vCenter which let's you keep the historic data. I started thinking it could also work for the distributed vSwitch issue of having to re-add the uplinks. The way Reuben describes it is by the use of a Perl script. Running a Perl script can be done in different ways, one I like is to have VMware vCLI ( VMware vSphere Command-Line Interface) installed on a Windows computer. Especially if you have it installed on the same computer as you have PowerCLI installed, because you can then easily use PowerCLI scripting to invoke a Perl script. Please take a look at the blog post of  Robert van den Nieuwendijk, How to run VMware vSphere CLI perl scripts from PowerCLI on a PowerCLI function he has written to do this.
With the host no longer being disconnected during the renaming proces you besides keeping historical data, do not have to re-add the uplinks to the distributes vSwitches.

Of course you will need to take care that the DNS records are also updated so they reflect the new host name. vCenter will try to resolve the DNS name upon adding it to the inventory, so make sure that the DNS records on the vCenter server are refreshed / updated before you run the script.

If you don't want to resort to using Perl, please have a look at blog post Rename an ESXi 5.x Host of Luc Dekens As always he has a PowerCLI solution to almost everything, although this script does remove and re-add the host from vCenter.

28 April, 2014

vMotion fails on MAC-address of virtual NIC

During one of my recent projects (replacing ESXi hosts, from rack servers to blades) there was also a second project ongoing that touched the VMware environment. The current EMC SAN solution was being replaced by a new EMC SAN solution comprised of VPLEX and VMAX components.
One of the inevitable tasks involved is moving VM's and Templates to datastores that reside on the new SAN. After all VM's of a particular datacenter were moved successfully,  it was time to move the templates.
As templates cannot be moved by the use of Storage vMotion, the customer first converted them to normal VM's. In this way they could leverage the ease of migrating them by Storage vMotion. Well so much for the idea, about 80% of the former template VM's failed the storage migration task. They failed at 99% with a "invalid configuration for device 12" error.
When I looked at this issue at first I had no idea what could be the cause of this issue, although it looked like it had something to do with the VM virtual hardware. I took a look at the former template VM's that did go through a successful storage migration and compared the virtual hardware to the ones that failed. There was no difference between the two. The only thing different was the OS used, this was also pointed out by the customer. Now the difference in OS is not what is important, but the point in time the template was created is!.
It stood out that the former template VM's with the older OS's where failing, so I asked to customer if he knew when these templates were created on more importantly on which version of vSphere.
As you might know the MAC-address of virtual NIC's has a relation to the vCenter which is managing the virtual environment, I don't know the exact details but there is a relation. And I remembered reading a old blog post about a invalid configuration for virtual hardware device 12, this post related device 12 to the Virtual NIC of the VM. The templates where originally created on a vSphere 4.1 environment of which the vCenter was decommissioned instead of upgraded along with the rest of the environment. When you put this information (or assumptions) together it could very well be that the MAC-address of the virtual NIC was not in a "good" relation with the current vCenter and that this resulted in failing Storage vMotion tasks. I know it was a bit far fetched, but still I gave it a go and removed the current vNIC from one of the failed VM's and added a new vNIC. I checked and the replacement changed the MAC-address of the vNIC.
After the replacement I re-tried to Storage vMotion and this time it succeeded!. Did the same replacement action of the remaining failed VM's and they all could now successfully be migrated to the new datastores.
So for some reason when doing a Storage vMotion vCenter needs a VM to have a "compatible" MAC-address to succeed.
In short if you ever run into a error "invalid configuration for device 12" when trying to perform a Storage vMotion, check if the MAC-address of this VM "aligns" with the MAC-adresses of VM's that can be Storage vMotioned.
If they don't, replacing the virtual NIC might solve your issue.

04 April, 2014

vSphere sees datastore as snapshot datastore

Last week a colleague contacted me to get my thoughts on a issue he was facing at a customer with a small VMware vSphere 5.5 environment.
Apparently the customer had faced a power outage for a longer period then their UPS could cope with, the result was 2 hosts and a HP P2000 iSCSI SAN were powered off without a clean shutdown.
When the power was restored this resulted in 1 RAID set being in degraded mode and the other RAID set being OK, while the RAID set was recovering just fine there were some issues on the vSphere environment.
After the hosts booted and vCenter was started, it was possible to connect with the webclient to vCenter. There it looked like the first datastore was OK and the second was not OK, all VM's on the first datastore booted without any issues. But because the second datastore presented itself as a VMFS volume on a snapshot LUN, the VM's that resided on this datastore couldn't be powered on. The real second datastore was not visible at all from vCenter.
I came up short on ideas during the phone call, so my colleague resorted to VMware support (GSS) and they came up with a rather quick solution to this issue. I thought I would share this.
The first thing that was done, was to rename the snapshot datastore. Next they added storage and selected the existing LUN with the re-signature option. After this completed, the only thing left was to re-register the VM's to vCenter that resided on the second datastore.
For me the solution that GSS came up with was a good one, it solved the issue quickly without too much efford.

Adding new datastores to a existing vSphere environment

Today I got asked by a VMware admin at a customer how he could prevent or maybe schedule storage rescans.
He asked me this because he was adding 25 new datastores to 12 ESXi 5.1 hosts in a existing cluster and every time he added a datastore a rescan of the HBA adapters is automatically initated. As the cluster already was under a pretty heavy workload, the "rescan storm" started by his actions were having a impact on the performance on most of the VM's running in the cluster.
As far as I know it is not possible to schedule storage rescans, I don't see any added value to such a feature anyway.
But what is possible is disabling automatic host rescan of HBA adapters, this is done on a vCenter level with a advanced setting "config.vpxd.filter.hostRescanFilter" together with the value "False"

VMware has a KB article about this, so if you want to have a reference or want to know how to make this advanced setting from the webclient please have a look at KB1016873
One very important thing not to forget, change the value of the advanced setting to "True" as soon as you have finished adding the datastores !

01 April, 2014

VM stops at POST screen

Recently at a customer I was asked to have a look at 2 VM's that supposedly did not boot well after receiving and installing Microsoft patches. These VM's had been running just fine up until the mandatory reboot after patching.
They had strange boot behaviour, usually you would expect that the boot would halt or go wrong when loading the Windows OS. But these 2 VM's wouldn't even go beyond the BIOS POST screen.

For troubleshooting purposes I created a new diskless VM and attached the system disk of one of the failing VM's to it, this combination resulted in a successful boot. So the boot issue was not related to the recently installed Microsoft patches, it had to be something within the VM configuration.
When looking more closely at the configuration of the 2 VM's I found both of them had RDM's.

I checked if the RDM's had a active path(s) to the LUN and it turned out about half didn't.
Once I removed the RDM's with the dead path(s), I powered on the VM again and it successfully booted the OS.
I never thought a dead RDM path would prevent a VM from getting through it's BIOS POST screen, I've checked if there was a VMware KB article around this VM behaviour but came up with only one blog that had info about this issue, Enterprise IT blog He also has a some good pointers and checks to verify if there is no other cause to the issue, so do check out the article