27 February, 2014

VMkernel port challenge on a Distributed Switch

Recently I was working on a project which consisted of adding new hosts to a existing vSphere 5.1 environment. As the form-factor and specs where far from what the customer already had in-place, the new hosts were put in new HA-clusters. Because the new hosts have 4 10GbE NIC's, the virtual networking design had to be re-designed.
I designed one Distributed Switch for all traffic and for all clusters, regardless of the purpose of the clusters. Security wise this was agreed on by the customer, they only needed logical network separation in the form of VLAN's. This would keep the design and also the physical network configuration fairly simple.
One of the customers' requirements was that the pace they could vMotion should be a lot higher then on their current hosts, having a large amount of hosts they wanted hosts to be able to go into Maintenance mode quickly. They also wanted to speed up the patching and updating of the hosts by Update Manager.
So within the Distributed Switch design I added the use of multi NIC vMotion, I got a lot of information from the blogs of Frank Denneman and Duncan Epping especially these two blog posts How to setup Multi-NIC vMotion on a distributed vSwitch and Multiple-NIC vMotion in vSphere 5…. Of course NIOC (Network I/O Control) is also used to control / guarantee required bandwidth to the various sorts of network traffic.
Because the new hosts will use a newly created VLAN for vMotion, but the current workload needs to be moved from the current hosts to the new hosts (by the use of vMotion to prevent VM down-time) there is a challenge, vMotion traffic does not route! A simple solution to this is temporarily using a extra VMkernel interface for vMotion traffic which is in the VLAN that is also used on the current hosts for vMotion and removing this after the workload is completely moved.
All of the multi NIC vMotion VMkernel interfaces were created by the use of a PowerCLI script, which you can find in a this previous post.
The temporarily created vMotion VMkernel interfaces on the other hand were done manually and on one host something went wrong, not sure what but it looks like a duplicate IP was used for the interface. So to correct I first removed the VMkernel interface, so it would not interfere the environment anymore. After I saw that there were more IP related issues on this host and at some point the host lost its connection to vCenter, I used both a direct vSphere client to host connnect as well as the DCUI to straighten things out and get at least the VMkernel for management back up and running with the correct IP. Now keep in mind this is still all on the Distributed Switch, so when the hosts successfully re-connected to vCenter it had a error the Switch configuration was not in sync with vCenter. This synchronization takes place every five minutes usually, after some time passed the error went away and all was good.
So I could start to re-add the VMkernel interfaces for multi NIC vmotion, I ran my script and this resulted in error's. I tried to add the VMkernel interfaces manually, but this also resulted in error.
It looked like I would not be able to sort this out from the vSphere client and a the quick fix from VMware I found on the internet was to restart the vCenter service, not a possibility at the time for me. So I resorted to the CLI and esxcli.
I connected to the host by ssh and listed all its VMkernel IP interfaces with "esxcli network ip interface list" which resulted in the following output:

vmk0
   Name: vmk0
   MAC Address: 00:50:56:6a:21:2e
   Enabled: true
   Portset: DvsPortset-0
   Portgroup: N/A
   VDS Name: dvSwitch_PROD_02
   VDS UUID: 78 94 06 50 7b 69 04 20-9e 3c 07 0f e2 0f cf f5
   VDS Port: 9059
   VDS Connection: 767794954
   MTU: 1500
   TSO MSS: 65535
   Port ID: 50331658

vmk1
   Name: vmk1
   MAC Address: 00:50:56:61:57:86
   Enabled: false
   Portset: DvsPortset-0
   Portgroup: N/A
   VDS Name: dvSwitch_PROD_02
   VDS UUID: 78 94 06 50 7b 69 04 20-9e 3c 07 0f e2 0f cf f5
   VDS Port: 9436
   VDS Connection: 415422000
   MTU: 0
   TSO MSS: 0
   Port ID: 0

So there indeed are not one but two VMkernel interfaces, only one shows up in the vSphere client. If you look at the output you see that vmk1 has Enabled:false this is probably why its not visible in the vSphere client.
With esxcli network ip interface remove --interface-name=vmk1 it should be possible to remove this hidden VMkernel interface. After running the command I checked the VMkernel IP interfaces of the host again:

vmk0
   Name: vmk0
   MAC Address: 00:50:56:6a:21:2e
   Enabled: true
   Portset: DvsPortset-0
   Portgroup: N/A
   VDS Name: dvSwitch_PROD_02
   VDS UUID: 78 94 06 50 7b 69 04 20-9e 3c 07 0f e2 0f cf f5
   VDS Port: 9059
   VDS Connection: 767794954
   MTU: 1500
   TSO MSS: 65535
   Port ID: 50331658
~ #

Problem looks solved, after I was able to re-add the multi NIC vMotion VMkernel interfaces again witout any problem. In the end all it took were a couple of esxcli commands.

26 February, 2014

My take on how to deploy ESXi hosts quicker with PowerCLI

When building a new or expanding a existing vSphere environment, there a lot of helpful features and tools that can help you speed up the deployment and configuration of new hosts.
For instance VMware Auto Deploy and Host Profiles can be very useful, also they will help you to setup a Cluster or Datacenter with consistent configured hosts.
Nevertheless there are several situations when you either can't or don't want to use Auto Deploy and/or Host Profiles, please note that I believe that Host Profiles is a very powerful feature which if possible (License) always should be used to assure that your run a environment with consistent configured hosts.
For one something that you need to do / have before you are able to access shared storage, in case you use FC shared storage is provide FC adapter WWN's to your storage team in order for them to setup the LUN zoning correctly.
You can get the information you need by clicking through the vSphere (web) client and copy/paste the info of all your hosts' FC adapters you are going to use. But if you have 10+ hosts it's becoming tedious, time consuming and not to forget prone to mistakes. Why not use PowerCLI ? This will make this job a lot easier and quicker, below you find a script which will get the WWN's per host and also provide vendor/type info (in case you have multiple types of FC adapters presented within the host as with most CNA / FCoE cards).



With most of the newer server hardware there is either a internal usb or flash (SD) slot to be used with media to boot a OS off. When running ESXi of such media it is advised to move the scratch location to persistent storage to have crash consistent location for ESXi to store it's log files. As referred to in VMware KB1033696, the KB also explains in great detail all way's to setup a persistent scratch location.
The script below will help you to automate these steps by the use of PowerCLI.



Two other settings that are frequently set when deploying new ESXi hosts are NTP and Syslog, when using host profiles these settings are done when applying the host profile. With the script below you are able to set it on multiple hosts with ease without the use of host profiles.



I hope the scripts provided will help you to do quicker and more consistent ESXi deployments.

12 February, 2014

Orphaned VM after failed migration

When a VM is orphaned, usually I check on which datastore, cluster and folder  the VM is placed. Then I remove the VM from the inventory and add the VM again through the datastore browser (or by using PowerCLI, check at the bottom of this post).
Lately I am seeing orphaned VM's at a customer that cannot be (re-) added because the .vmx has a lock on it by a other process / host.
vCenter "thinks" the VM is still active on this host, when trying to go in to maintenance it will stop at 80% when you send a shutdown command to the host on the DCUI it will notify you that there are still VM's active on it. And from the VM' standpoint it is still active, the VM will still be running and accessible by RDP for instance !
But when you look at esxtop on that host and look at all the VM process still running, there will be no VM process active.
A quick and somewhat dirty way is to get the host to go in to maintenance mode, most of the time this will stall at eiher 65% or 80% at this timt you will see the only remaining vm on this host is the orphaned VM, login to the DCUI (or use Powercli or CLI) to send a reboot command to the host. The host will tell you that you still have active VMs on it, ignore this and have the host reboot anyway. The VM will be forcefully powered off.
After the reboot, you will find the host in maintenance mode with 1 powered-off VM present, this was the previously orphaned VM.
Take the host out of maintenance mode and power-on the VM, it will probably boot normally and if used DRS will move some VM's to the host to balance out the HA-cluster the host is in.
The method described above is as mentioned before a "quick and dirty" way and should be a last resort option in my opinion as it results in outage eg. downtime for the VM in question.
So before you go with this method make sure that there is other way to resolve the issue, for instance if there is no lock on the .vmx you could easily re-register it to vCenter without any reboots needed. If you need to do this for a larger number of VM's than have a look at one of my previous posts "VM's grayed out (Status Unknown) after a APD (All Paths Down) event on NFS datastores".