30 May, 2013

Where is my VM ?

I think every VM admin has experienced the following situation; For some reason planned or unplanned your vCenter server is down, no real issue because it will not affect the VM's running in your vSphere environment. Even HA will continue to work.
But how do you know where a specific VM is located (on which host) when you don't have vCenter ?
I already had a small script which allowed me to search for a VM on a selection of hosts or all active hosts of a environment, because if you vCenter is down unplanned their might be some other servers / VM's that are also down with the same root cause. If this is the case you will not only have to worry about getting your vCenter back up-and-running but you will also get a lot of questions on how to find and access other unresponsive VM's. This will be a challenge for any Admin which solely relies on the vSphere (web) client, because this Admin will have manually and separately logon to every host with his vSphere (web) client and search the inventory per host to find the VM he is looking for.
This is not a big problem if the environment has 5 or 6 hosts but the problem gets bigger as the environment gets bigger, just imagine the amount of work and time it will take when you have to search thru 50 hosts !.
I had a "vCenter down" situation recently with a customer and some other VM's became unresponsive. These VM's did not respond to RDP and because vCenter was down there was no direct way to know on which of the 58 hosts ! the VM's were running. So I got my old small script to lookup some of these VM's and one of the Admins saw this and he asked if he could also run it to find some VM's. This Admin had little experience with vSphere so he had some trouble to provide the needed input into the script before it would work (ESXi hosts IP or DNS to connect to and name of the VM which it was registered with in vSphere) but with a little help he managed.
After this event I thought I would simplify the script by using pre-created files with host information of the vSphere environment and have a selection menu to choose from, the only info that you need to provide is the full name of the VM.
Of course you will need to create the files, but you can do this when vCenter is up and it is easy to retrieve this kind of information.

The files you need to create are one file per search selection, in my script I have 1 file for every Cluster and 1 file for the complete environment. The file is a plain text file with 1host IP address or DNS name per line.
For me this is a script (with or without the menu) that you want to have available to you in case of a emergency, it take little time to setup and will help you big time in case of a issue.

27 May, 2013

VM's grayed out (Status Unknown) after a APD (All Paths Down) event on NFS datastores

Last week during a change on one of the core switches of the NFS storage network at a customer, we ran into a big problem causing a outage of 50 % of all VM's for around 4 hours.
The problem started with a network related error, on which I will not elaborate other then the result was a unstable NFS network causing random disconnected NFS datastores on the majority of the customers' ESXi hosts. On top of that it also caused latencies, which triggered vMotion actions causing to ramp-up the latencies even more and resulting in a storm of failing vMotions.
In theory this would never have happend as the NFS network of the customer is completely redundant, but in real life it turned out completely different in this particular case.
After putting DRS into "partially automated" the vMotion storm stopped, the latency continued on the NFS network and this also had it's effect on the responsiveness of the ESXi hosts. Only after powering down the core switch (the one which had the change) all returned to normal status, datastores were connected to ESXi hosts again and latency disappeared. When looking into the vSphere client I found lots and lots of VMs that had a inaccesible or invalid status. When trying to power-on such a VM it would not work and you would get a "action not allowed in this state" message. The only way I knew to get them accessible again at the time was to unregister the VMs from vCenter (Remove from Inventory) and add them again to browsing to the .vmx file with the Datastore Browser and selecting "Add to Inventory". This was time consuming and tedious work, but the only quick fix in getting those VMs back into vCenter. Mind you, most of the VMs where still up-and-running but in no way manageable thru vCenter.
By the time I had all VMs registered again, some also needed a reboot as their OS crashed thru to high disk latencies. I was contacted by the vCloud admin, he had also lost around 100 VMs from his vCloud environment. It looked to be a other long task of getting those VMs back, but we faced a extra problem.  vCloud relies heavily on MoRef Id's for identification of VMs, in other words if the MoRef Id changes vCloud will no longer recognise this VM as it cannot match it to anything in its database.
But removing a VM from Inventory and re-adding it changes / updates its MoRef Id, so even if we wanted this quick fix I had could not be used on the VMs in vCloud. Luckily the vCloud admin found VMware kb1026043 it looked like VMware had the solution to our problem, but for some reason this solution was not working for us and it needed to have the host of the affected VMs in maintenance mode. It did help us with the search for a working solution, which was quickly after found by the vCloud admin on www.hypervisor.fr a French VMware related blog of Raphael Schitz. He wrote a article "Reload du vmx en Powershell" (Reload a vmx with Powershell) on how to reload VMs into Inventory without having the need for maintenance mode on your host(s), it all comes down to a PowerCLI one-liner that does the trick. You can alter the command to run it against entire Datacenter or just a Cluster.
In the end it saved our day by reloading all inaccesible and invalid VMs within just 5 minutes, this is a very useful one-liner as NFS is getting more and more used as preferred storage.

17 May, 2013

Database redundancy for your vCenter database(s)

The most important database within a vSphere environment is the vCenter database without a doubt. VMware therefore has enclosed detailed instructions on how to setup and configure this database, they have a guide for every supported database type. Recently I ran into a situation which made me believe that VMware "forgot" some details on this database configuration guide, at least when you have your vCenter database running on Oracle.
A customer has chosen to put their vCenter database on Oracle as this was their preferred database knowledge wise. And they set it up to also be resilient, the way they achieved this was by having a active and a standby database placed on 2 different database server in separated datacenters. To me it looked like a very solid solution. On the vCenter part they modified the TNSNames.ora in such a way it now included 2 database server addresses and also contained the parameters for connect-time failover and load balancing.
By doing this they made sure that vCenter could (almost) always connect to one of the two database servers, it would simply do a failover when the connection time would expire. In this case the failover would not have been quick enough to keep vCenter up-and-running but it would need a reboot (or at least a restart of services) to get connection again. But this would not affect the running VMs at all.
For maintenance purposes to the database servers, we had to switch from the active server to the backup server. As this was a planned action, we could first gracefully stop the vCenter services and after switch to the standby database server. After the switch all vCenter services were started again and vCenter went up-and-running like it supposed to do.
One issue that occurred during this database server switch was that VMware Orchestrator, which was installed on a separate server stopped working, logging all kinds of database related error's. With a quick look at the database configuration of Orchestrator I remembered that it could not cope with multiple database server addresses and was set to connect to the database server that now had become the standby. By changing the database server and starting the Orchestrator services again this problem was solved.
At least until the next day when I took a look at the vCenter Operations dashboard and found that the health of vCenter was 0

When I looked into more detail on what caused this I found VMware vCenter Storage Monitoring Service - Service initalization failed on only thing I found that could link this alert to the database failover was the timestamp, it was recorded right at the same time the failover had happened.


Not really knowing where to start investigating on the vCenter server, I first tried to find some information on the VMware KB and the first article that came up described the exact same error message. When reading kb2016472 I quickly found confirmation that this issue was related to the database failover although it refers to vCenter 4.X and 5.0 with the use of a SQL database instead of vCenter 5.1 / Oracle database.
It appears that this vCenter Storage Monitoring Service does not use the TNSNames.ora for the database connection, it has it's own configuration / connection file called vcdb.properties. This file has only the first of the two database server addresses.
Thru the information in the KB article I knew what to change to get the connection set to the backup database server, and after a restart of the vCenter Server service the vCenter Storage Monitoring Service initialized ok and started without any error.
So my conclusion is that even when you have redundancy or failover setup on vCenter database level, there are still some vCenter related products and services that need some manual action to continue to work in case of a (planned) database failover.

16 May, 2013

VMware Horizon View Optimization Guide for Windows 7 and 8

VMware has published a new Horizon View Optimization Guide for VDI / View Windows 7 and 8 Virtual Machines. This guide contains recommended configuration settings to optimize the OS and provide a overall better scalability and performance in a Horizon View environment.
Horizon View 5.2 supports the Windows 8 Metro style user interface and supports the basic touch gestures in a View client running on a Intel Surface tablet.
With Windows 8 you can improve the performance by using the new services state change "Manual (Triggered Start)", services are started when triggered the user accesses a component that requires the service to be started.

New SDS (Software Defined Storage) solutions introduced

Just a quick post on new products / solutions introduced or going to be introduced. EMC introduced ViPR at EMC World convention last week. Great articles on ViPR you can find on the StorageIO blog by Greg Schultz, he wrote a 3 part series on this new solution of EMC "EMC ViPR virtual physical object and software defined storage (SDS)".
A other take on this new solution has been written by Duncan Epping on his blog "EMC ViPR; My take".
Also Netapp gave notice that they are going to introduce Cobra Commander during Q3 2013, maybe just in time to have some interesting information presented in one of the Netapp related VMworld Europe sessions....

15 May, 2013

Have a look in the crystal ball to predict the future (capacity) of your vSphere environment...

A question regularly asked, how much resources does my vSphere environment have at the moment. Or more specific, how many VMs can I add to my environment and still make sure that I have enough resources available in case of host(s) failure ?
It is a viable question that is often asked, but the answer is not really easy nor straight forward. To be clear if you want this kind of realtime information you should be looking at a analyses tool like VMware vCenter Operations or Dell vOPS (formerly vKernel / Quest) these tools can (when you purchase the right licenses) give you the exact amount of VMs you could add to your environment while retaining the needed failover capacity, vOPS even has the ability to "play" what if scenarios and have the results on these before you actually do anything in real life to your environment.
I will try to write one (or more) blogs post in the near future after I have taken these tools for another test run.

In the mean time, the question remains. Or if your company can't or does not want to invest a considerable amount of money in one of these tools, and believe me they are pricey (but what they provide on useful information sure makes up for their price).
How many VMs can I add to my current environment keeping in account the needed (or requested) failover capacity ?
For me a alternate way of getting a idea on how many VMs I can add to a specific environment is by running a PowerCLI script that rounds up a average CPU and Memory use of the current VMs and you manually specify what the maximum CPU and Memory resource usage per cluster may be (Only for non HA clusters).
On a side-note; In theory you should be able to calculate how many VMs you could add by using the HA slot size , but in real life there are to many dependencies and variables for this to be useful especially if you don't use reservations on your VMs. Anyway I am not going to write to much about HA, DRS slot sizes, reservations.... because there a other (Dutch) people that can (and actually did) write books on these topics. A must read for anyone who want's to be a better VMware admin, consultant or architect is VMware vSphere 5.1 Clustering Deepdive by Duncan Epping and Frank Denneman.
So back to the quick, simple and somewhat educated guess way of getting a idea on how many VMs you can add to your environment. As said earlier I think the quickest way of getting such a result is by having a PowerCLI script get the average values of all VMs currently running and set the maximum  resource usage allowed by yourself.
The script I use is fairly simple and quick to run, and will give you results that you can use as a sort of guideline when you get this question and you need to answer it.

By default the output is directly to the CLI, but you could also output to csv, xlsx or HTML if you want. The CLI output will look something like this.
Both clusters in the screenshot above have more then enough CPU resources, but when you look at the MEM resources they both are around 10% growth allowed based on the current load and failover level. These results are from HA enabled clusters, so no manual config needed. The script uses in this case the "EffectiveCpu", "EffectiveMemory", "FailoverLevel" and "NumEffectiveHosts"from "DasConfig" to calculate the results.

13 May, 2013

VMware snapshots good and bad....

Almost every IT admin "knows" or at least thinks he knows what a VMware snapshot is and how to use it (in a good and sensible way) .
I hope that the myth about a VMware snapshot can be used as a full backup has died all over the IT world, because that is the worst myth in VMware land.
So we all know snapshots can be very helpful in day to day operations, like quickly reverting to a snapshot when a software patch or update turns out bad. It is also widely used in the process of backing up a VM, most 3rd party backup solutions use a snapshot to capture a VM in a stable state during their backup operation.
But it is far less known that a snapshot can also be very difficult when there has something that went wrong with one of the snapshots of a VM. You could potentially lose a lot of data when a snapshot (delta) file becomes unusable, not to mention the loss in performance the VM has when it has one or more snapshots active.
So VMware snapshot are very useful, safe and good when used in the right way ! Use them in a way that they are designed / intended for. This is very easily said and done within a small environment with only one or two VMware admins, but it becomes more prone to error when the environment is larger and or if there are more VMware admins (or other IT persons that have the permissions to create snapshots).
If you look at a bigger company with a larger VMware environment, most of the time there will be separate departments for IT infrastructure, hardware and software. In a lot of cases the IT persons working in the software department will have the permissions to create (revert to and delete) snapshots, in the same cases the IT persons working in IT infrastructure department are responsible for keeping the VMware vSphere environment running and healthy.
VMware has acknowledged the risk that having snapshots without knowing brings along and included a vCenter alarm function for it since vSphere version 4.X but this alarm has snapshot size (in GB) as trigger. Note that the growth of a snapshot is linked to the changes made on the VMs disks, so it is not directly linked to age of the snapshot.
Also note that this particular alarm is known to be a bit unpredictable since version 5.1, please read VMware kb 1018092 for details.
To keep track of snapshots in a environment where there are more persons allowed to create them then there are to manage the environment you might want to have some additional tools to do this.
You could use RVtools (by Rob de Veij) and manually run this on a weekly or bi-weekly schedule, RVtools is a great tool to check your environment and have results presented in a "Excel" style. I use it a lot when I need to assess a customers VMware environment prior to the start of a project. You could also use Powergui with the community powerpack which includes a script to report snapshots older then X days.
But in this way it will be a manual repeating task for the VMware admins and with the results you have two options, you either contact the creator of the snapshot (if you have this information) and ask them to remove the snapshot when it is no longer needed or have them remove the snapshot directly because it is very old / very big. Or you contact the creator and ask them what you should do with those snapshots, either way you will be getting a lot of work managing the snapshots created by others.
By the way the only way to know who created the snapshot (if the creator did not mentioned it in his snapshot description) is by getting the info from the vCenter event log.
As a extra challenge with the introduction of vSphere 5.1 there is a possibility to Storage vMotion a VM with a active snapshot, this is good thing of course. But the downside is that when you do Storage vMotion VMs (for let's say maintenance reasons on a storage device) the process will end up consolidating all snapshots of the VMs on the target datastore, in other words the snapshots will be deleted !
So why not automate this process and have the creator receive a email message with all needed info about the snapshot that is over due (older then X days) and also receive a overview of these snapshots with creators yourself as a reminder. If the creator ignores the email and decides to keep the snapshot, he (or she) will receive the same message again on the next run. You could even create two separate triggers the first as a reminder of the snapshot and a second one more as a warning.
Below you will find a PowerCLI script which retrieves the creators of the snapshots older then X days and will lookup the email address in the Active Directory. After it will send a email message to the creators each one specifying the details of the snapshot and VM it belongs to. It will also send a overview email to one account (VMware administration usually) so you can keep track of the active snapshots in your environment.

Hopefully the reminding of active snapshots will make the creator more aware and will they do a better job of cleaning up snapshots no longer needed. After all the weakest link in IT is the object between the chair and the desk....

Are there still use cases for custom attributes within a vSphere 5.1 environment ?

VMware has introduced "tags" with the release of vCenter 5.1, during a "what's new" presentation at VMworld one of the speakers mentioned that these tags could be seen as 2.0 replacement for custom attributes. Tags do offer a lot more then custom attributes, for instance they will be indexed by the search feature of the vSphere web client and they can be used on (almost) all objects in a vSphere environment. Custom attributes can not be searched within the vSphere (web) client and can only be used on Virtual Machine objects.
But is the statement also true when looking at VMware vSphere 5.1 environments currently running in production at VMware customers ? Probably not at the moment, I believe custom attributes will disappear eventually but at the moment there are to many (legacy) 3rd party applications and (PowerCLI) scripts that still use custom attributes.
A good example for the use of custom attributes by 3rd party applications are backup applications, Netbackup and also Veeam can write the last backup result of a VM to a custom attribute value.
If wanted / needed you could create a export of these values to create a report by the use of PowerCLI.

During a recent project a customer was changing their backup application and procedures, the formerly used Netbackup in a disk-to-disk-to-tape way. To the new solution for this customer, which became Netapp Snap Manager, this provides very fast and full backups at storage level. The only downside for the VM admins is that they no longer where able to "see" backup details, such as last backup time and result. This used to be written by Netbackup in a custom attribute value per VM and also exported csv file to update a status webpage.
The Netapp solution does not provide such thing out-of-the-box, so it was PowerCLI to the "rescue". I started to work on a script that would add a custom attribute to every VM and add the timestamp of the last successful backup. Note that Snap Manager uses a VMware snapshot in it's backup procedure, so I only needed to find the removal of the snapshot created by the Netapp service account in the event log per VM to retrieve this data. After some testing and searching the internet for some extra info, I stumbled upon the "Who created that VM?" script by Alan Renouf .
This script retrieves the user account which created that VM and the timestamp of the creation action. The username is then used to retrieve the full name from the Active Directory, the full name and timestamp info are then written to 2 custom attributes. I saw the potential of knowing who and when a VM was created, especially if you want to report on growth of your environment (will write on this in a later post). So I took the main part of Alan's script, modified it to suite the customer needs and also added my last backup part in that script.

The VM admins can now easily see from within the vSphere client when a VM was created, by who and when the last successful backup was made and it will also create the csv file needed to update their internal status webpage.
So in this particular case there is still a very good use case for custom attributes, in a later post I will go into more detail on how to use the value from the "CreatedOn" custom attribute to create a report which shows with how much VM's your environment is growing per month.
For those who want to use my script or use parts of it, please do but there are some points of attention:
1 To get the full name from the AD the script uses Quest AD cmdlets to resolve the username to full name.
2 If a VM is removed from the vCenter inventory and later added again the creator name will be the person how has re-added the VM not the original creator.

In general when you want to add / attach informational data to a VM by the use of (PowerCLI) scripting, you still will want to use custom attributes instead of the newer tags. But keep in mind that custom attributes can only be made visible in the vSphere web client if you run them thru the "migrate to tag" wizard.

06 May, 2013

VMware Tools and their Upgrade Policy

With the introduction of vSphere 5.1 VMware has made some changes to going about upgrading VMtools in your VM's. Prior to version 5.1 a reboot of the guest OS was mandatory to complete a VMtools upgrade, but now that we have version 5.1 this is no long required.
Keep in mind you need to run the latest VM hardware version and also run the VMtools version that goes with vSphere 5.1 to be able to do a upgrade without a reboot.
All of this is great but still some things remain, how do you get all your VM's up to the correct VMtools version easy and how can you automate upgrading once VMware releases a new VMtools version
and how to make sure you run the latest version of VMtools on imported VM's or Appliances ?

vSphere has VMtools options / features within the VM properties which you can set and one of them is called "Check and Upgrade Tools during power cycling" and this option has been in vSphere since at least version 4.1, but to my knowledge not used a lot.
When you check this option box the VM will check and if there is a newer version of VMtools available upgrade upon every reboot. When looking and VM's with a Microsoft OS, those will need to reboot to complete their patches and updates on a regular basis. So why not enable this option in you VM's ?
I find that keeping your VM's up-to-date when it comes to (security) patches important and for me VMtools is something a lot of VM admins tend to forget in this process. But if you run a large vSphere environment you don't want to go and open the properties of all VM's and to enable this feature !
That's were PowerCLI comes in handy, because you can script the enable (or disabling) of this option.

In the first example script you can set the option on a per Cluster basis

You could change this script so it suites you needs, for instance you could use it to only set this option on a selected number of VM's which you gathered in a CSV file. Like in the example script below.

After all changing such a small script is done a lot quicker then having to go thru all of your VM's one by one to set the option, just imagine having 1000+ VM's in your vSphere environment !
This option is still valuable in vSphere 5.1, although you don't need a reboot anymore when upgrading VMtools it is still a task that is easily forgotten. A other way of making sure you don't forget is by the use of VUM (VMware Update Manager), but this takes some time to setup a baseline and baseline groups. That is if you already have VUM installed in vCenter.