Virtual-Stones Blog: vCenter

Showing posts with label vCenter. Show all posts

19 June, 2015

Installing vSphere 6 gotcha

The last couple of weeks I have been involved in the deployment of a vSphere 6.0 environment. This was my first vSphere 6.0 customer deployment.
Most of the implementation plan I wrote and as this was my first customer deployment I had to do some rework on my default implementation, which I have used on vSphere 5.5 implementations. To write this implementation plan I used the official VMware installation and configuration documentation as guideline. At the time of writing all implementation steps seamed logical, but when we came to the actual implementation we ran into a issue when we tried to add Identity Sources to the PSC (Platform Service Controller).
This customer vSphere environment is a large, multi tenant and multi site environment and for this reason together with the need for flexibility and scalability a design choice was made to use external PSC's. Also during the design sessions with the customer the decision had been made to use the vCSA (vCenter Server Appliance) instead of the up until vSphere 5.5 generally chose Windows based vCenter.
The issue we ran in to has to do with in which order you go to the configuration steps after the PSC is deployed. At some point you want to be able to join a PSC to the Active Directory domain to be able to use AD integrated authentication. When using external PSC's like in this particular case you don't join vCenter (vCSA) to the domain but the PSC. The reason for this is simple, SSO (Single Sign On) and Authentication services are provided by the PSC not vCenter.
The implementation plan followed the exact same steps for this part of the configuration as the VMware documentation and we followed the each en every step as they where documented in the implementation plan, but for some reason we still got a error message.
After verifying that I had used the most recent version of the VMware documentation I did a quick check on the VMware communities to see if anyone else had ran into the same issue.
There was a post dated March 27 2015 called "Wrong information in VMware 6.0 documentation vCSA 6.0" which described the exact issue and the writer came to the same conclusion that the VMware documentation has the configuration steps in the wrong order.
This community post was posted two and a half months ago, clearly VMware hadn't been able to update their online documentation.
Below you will find in short the correct steps and order to join a PSC (internal or external) to a Active Directory domain followed by adding Identity Sources to the PSC. For more details please have a look at the community post.

Use the vSphere Web Client to log in as administrator@your_domain_name to the vCenter Server instance.
Under Deployment, click System Configuration.
Under System Configuration, click Nodes.
Under Nodes, select a node and click the Manage tab.
Under Advanced, select Active Directory, and click Join.
Type the Active Directory details (use the administrator@your_domain_name syntax for the user).
Reboot the appliance, after the reboot login as described in step 1.
Navigate to Administration > Single Sign-On > Configuration.
On the Identity Sources tab, click the Add Identity Source icon.
Add the Active Directory domain as an identity source, enter the identity source settings, and click OK.
Finished, you can now continue with the rest of the configuration like adding AD security groups to the roles defined in Global Permissions.

I am sure that VMware eventually will update their online documentation. In the mean time, if you follow the steps described above the configuration should be easy. Alternatively you still could follow the steps as described in the VMware documentation, but you will run into this issue. The error message has gotten a link to the correct webpage where you can first join your PSC to a Active Directory domain. After a reboot you can then restart your configuration. Although the second option will take more time the end result will be the same.

24 September, 2014

vCenter Orchestrator loses VM networks after host renaming

Last week I was asked to have a look at a VCO workflow issue. There was a issue with workflows used to deploy VM's, the workflows would fail on specific hosts. One of the customers' VMware administrators found that the workflow stopped at the point where a VM had to be linked to a specific VM port group.
This happened with any selected VM port group (VLAN) available within the workflow, these workflows have a automatic host selection a manual selection can also be made. After running workflows with manual host selection some hosts were found which completed the workflow successfully.
When verifying what the difference was between the hosts, it became clear that the hosts that failed the workflow where recently renamed.
The customer uses a dVswitch for all network traffic across all hosts within the HA-clusters. During the renaming you have to disconnect the ESXi host from vCenter and re-connect after the renaming, a PowerCLI script was used to automate the renaming process, a similar script can be found here.
During the renaming there had been a issue with the hosts upon reconnecting to vCenter. After renaming hosts reconnected with a dvSwitch error message, to get rid of this error you manually re-add the host to the dvSwitch. After all the hosts network looked OK, nevertheless this was a good reason to take a better look at the network configuration of those renamed hosts.
One detail which stood out, was the color of the dvUplink interfaces. When all is fine they are coloured green, but when for instance the Physical NIC used by the Uplink is disconnected the color turns to white as shown in the picture below for dvUplink2.

Now with the renamed hosts it was not one dvUplink, but all 4 dvUplinks where coloured white. Strangely enough the VM's hosted on these hosts had a fully functional network connection, so as expected none of the physical NIC's was disconnected.

One of the VMware administrators tried to get all dvUplinks "green" again by simply removing and re-adding the vmmic from the dvUplink, this seemed to work all dvUplinks came back "green" again. Unfortunately the Orchestrator workflow persisted, after the actions above and none of the VMware administrators (me included) had any idea's on how to solve this issue so a support case was opened with GSS.
After the usual "please upload logfiles" steps, the problem was quickly solved during a Webex session. The solution was to force update the dvSwitch configuration across all hosts connected to this dvSwitch.
So how to you push the configuration or how do you forcefully update the configuration on ESXi hosts, simple just add a temporary new dvPortgroup to the dvSwitch. By adding a dvPortgroup all connected ESXi hosts get a updated dvSwitch configuration.
This solved the Orchestrator workflow issues finally, I can imagine that this updating of the dvSwitch configuration could also be of help in other dvSwitch "out of sync" kind of issues.
I will be trying next time I run into such a issue.

18 June, 2014

Improve vSphere Webclient performance

Several other blogs have posted ways to improve the speed and responsiveness of the vSphere webclient. I was browsing through the different blogs post in addition to the official VMware installation and migration documentation and Knowledge Base articles, while writing a Migration Plan.
This Migration Plan must guide a customer through the replacement of vCenter and upgrade of ESXi hosts, both need to get to vSphere 5.5.
Grabbing the different adjustments to improve the speed and responsiveness of the vSphere webclient, was not that difficult but it became clear that there is one single place that has the complete list (at least I didn't find it).
So I thought to write a post with all changes that improve the "look and feel" of the vSphere webclient that I know of.
Let's get started, changes to the JVM settings of various vCenter components are the ones that make the biggest improvement.

VirtualCenter Management WebServices
Configuration file location:
installation_directory\VMware\Infrastructure\tomcat\conf\wrapper.conf
Heap size parameter:
wrapper.java.additional.9="-Xmxheap_3072M"

vCenter Inventory Service
Configuration file location:
installation_directory\VMware\Infrastructure\Inventory Service\conf\wrapper.conf
Heap size parameter:
# Maximum Java Heap Size (in MB)
wrapper.java.maxmemory=12288

vSphere Profile-Driven Storage
Configuration file location:
installation_directory\VMware\Infrastructure\Profile-Driven Storage\conf\wrapper.conf
Heap size parameter:
# Maximum Java Heap Size (in MB)
wrapper.java.maxmemory=2048

vSphere Web Client
Configuration file location:
installation_directory\Program Files\VMware\Infrastructure\vSphereWebClient\server\bin\service\conf\wrapper.conf
Heap size parameter:
########
# JVM Memory
########
wrapper.java.maxmemory=3072

Some other changes to the vSphere Web Client, which improve the usage of the webclient by changing for instance the page timeout and disabling the animations within the webclient.

The list of things I would change you find below, but please keep in mind that the settings may differ for your or your customers' environment, Please adjust accordingly !

vSphere Web Client

Configuration file location:

%ALLUSERSPROFILE%\VMware\vSphere Web Client\webclient.properties

session.timeout = 0

navigator.disableAnimation = true

refresh.rate = 600

feature.facetedSearch.enabled = true

If you have read this post and find there is a update of modification missing please leave a comment and I will review it and update the post.

29 April, 2014

Removing orphaned replica VM's

I have been working on a SAN replacement project with a customer, for me this also meant moving all VM workloads from the old SAN to the new SAN. Usually I work on VMware vSphere environments (Data Center Virtualisation) but this project also involved moving their VMware View 5 environment and View workloads.
With this View environment not using linked clones, the storage migration was pretty straight forward even for a DCV guy like me. At least until I came across VM's called Replica-GUID with which I could not do much as these VM's no longer existed within the View environment (assuming that they were leftovers from a linked clone experiment), but where still registered with vCenter.
So I thought, this should be easy just right click and "Remove from inventory" or "Delete from disk", but both of these "solutions" were grayed-out.
When I searched the VMware KB I came across a very detailed article on how to manually remove replica virtual machines KB1008704
Before I could start removing the replicas I wanted to double check if the VM's were really obsolete. The View admin thought / assumed that they were leftovers from a experiment or on-site training, as we all know assumptions are the mother of all #$&@.
Given the fact that they were powered off I figured why not check for how long they have been in this state. By using the datastore browser to locate the files corresponding with the Replica VM's and checking the "Modified" column I found these Replica VM's have not been altered / powered-on for over a year. This is what I was expecting and aligns with the thought of the View admin.
I started manually deleting the Replica VM's by following the steps outlined in KB 1008704, although this is pretty straightforward please do note that the used command SviConfig is case-sensitive.

At first I had absolutely not luck with removing the Replica VM's it looked like there was a permission / rights issue some where. But I used a user account with full Administrator access to vCenter, used the correct SQL user and pasword to access the View database and still no luck.
As it turns out, the vCenter user account you use, does not only need full Administrator privileges on the vCenter but it also needs Administrator privileges on the View environment (Composer, broker)

When searching for a solution to SviConfig not working at first, I came a cross a blog post of Terence Luk he explains in good detail how to use the SviConfig command. He even provides some extra info on top of the VMware KB article.

Change the name of a ESXi host

Recently I needed to rename a considerable amount of ESXi 5.x hosts, VMware has published a KB article KB1010821 that describes the various way's of doing this very well. But for me there are two things that are missing in this information. First of it being that there is nothing written specifically regarding the consequences of a host renaming action on the distributed vSwitche(s) the hosts' Physical Adapters where used as dvUplinks.
When I followed the manual steps of the KB to test run the procedure I got error's regarding the adapters on the dvUplinks when removed the host from vCenter (after I first put the host in maintenance mode and disconnected it).
These error's came back after the renaming was done and I added the host again to vCenter. The host added successfully to the HA-Cluster it was previously part of, but failed to reconnect it's management, NFS and VM network through the distributed vSwitches. I had to manually run the "add host" procedure to add the physical adapters to the correct distributed vSwitches, the physical adapters used for vmkernel ports where pre-selected the adapters used for VM network I had to manually select the physical adapters which I wanted to use. With the additional steps the procedure was successful.
The second thing that not mentioned, although understandable is that you lose all historic (performance, event, task, etc.) data from the renamed host because of the "remove from vCenter" step mentioned in the KB.

When I read a blog post by Reuben Stump on the Virtuin blog called "Rename ESXi Hosts in vCenter (Without Losing Historical Data)" which describes a way of renaming a host without removing it from vCenter which let's you keep the historic data. I started thinking it could also work for the distributed vSwitch issue of having to re-add the uplinks. The way Reuben describes it is by the use of a Perl script. Running a Perl script can be done in different ways, one I like is to have VMware vCLI ( VMware vSphere Command-Line Interface) installed on a Windows computer. Especially if you have it installed on the same computer as you have PowerCLI installed, because you can then easily use PowerCLI scripting to invoke a Perl script. Please take a look at the blog post of Robert van den Nieuwendijk, How to run VMware vSphere CLI perl scripts from PowerCLI on a PowerCLI function he has written to do this.
With the host no longer being disconnected during the renaming proces you besides keeping historical data, do not have to re-add the uplinks to the distributes vSwitches.

Of course you will need to take care that the DNS records are also updated so they reflect the new host name. vCenter will try to resolve the DNS name upon adding it to the inventory, so make sure that the DNS records on the vCenter server are refreshed / updated before you run the script.

If you don't want to resort to using Perl, please have a look at blog post Rename an ESXi 5.x Host of Luc Dekens As always he has a PowerCLI solution to almost everything, although this script does remove and re-add the host from vCenter.

28 April, 2014

vMotion fails on MAC-address of virtual NIC

During one of my recent projects (replacing ESXi hosts, from rack servers to blades) there was also a second project ongoing that touched the VMware environment. The current EMC SAN solution was being replaced by a new EMC SAN solution comprised of VPLEX and VMAX components.
One of the inevitable tasks involved is moving VM's and Templates to datastores that reside on the new SAN. After all VM's of a particular datacenter were moved successfully, it was time to move the templates.
As templates cannot be moved by the use of Storage vMotion, the customer first converted them to normal VM's. In this way they could leverage the ease of migrating them by Storage vMotion. Well so much for the idea, about 80% of the former template VM's failed the storage migration task. They failed at 99% with a "invalid configuration for device 12" error.
When I looked at this issue at first I had no idea what could be the cause of this issue, although it looked like it had something to do with the VM virtual hardware. I took a look at the former template VM's that did go through a successful storage migration and compared the virtual hardware to the ones that failed. There was no difference between the two. The only thing different was the OS used, this was also pointed out by the customer. Now the difference in OS is not what is important, but the point in time the template was created is!.
It stood out that the former template VM's with the older OS's where failing, so I asked to customer if he knew when these templates were created on more importantly on which version of vSphere.
As you might know the MAC-address of virtual NIC's has a relation to the vCenter which is managing the virtual environment, I don't know the exact details but there is a relation. And I remembered reading a old blog post about a invalid configuration for virtual hardware device 12, this post related device 12 to the Virtual NIC of the VM. The templates where originally created on a vSphere 4.1 environment of which the vCenter was decommissioned instead of upgraded along with the rest of the environment. When you put this information (or assumptions) together it could very well be that the MAC-address of the virtual NIC was not in a "good" relation with the current vCenter and that this resulted in failing Storage vMotion tasks. I know it was a bit far fetched, but still I gave it a go and removed the current vNIC from one of the failed VM's and added a new vNIC. I checked and the replacement changed the MAC-address of the vNIC.
After the replacement I re-tried to Storage vMotion and this time it succeeded!. Did the same replacement action of the remaining failed VM's and they all could now successfully be migrated to the new datastores.
So for some reason when doing a Storage vMotion vCenter needs a VM to have a "compatible" MAC-address to succeed.
In short if you ever run into a error "invalid configuration for device 12" when trying to perform a Storage vMotion, check if the MAC-address of this VM "aligns" with the MAC-adresses of VM's that can be Storage vMotioned.
If they don't, replacing the virtual NIC might solve your issue.

04 April, 2014

Adding new datastores to a existing vSphere environment

Today I got asked by a VMware admin at a customer how he could prevent or maybe schedule storage rescans.
He asked me this because he was adding 25 new datastores to 12 ESXi 5.1 hosts in a existing cluster and every time he added a datastore a rescan of the HBA adapters is automatically initated. As the cluster already was under a pretty heavy workload, the "rescan storm" started by his actions were having a impact on the performance on most of the VM's running in the cluster.
As far as I know it is not possible to schedule storage rescans, I don't see any added value to such a feature anyway.
But what is possible is disabling automatic host rescan of HBA adapters, this is done on a vCenter level with a advanced setting "config.vpxd.filter.hostRescanFilter" together with the value "False"

VMware has a KB article about this, so if you want to have a reference or want to know how to make this advanced setting from the webclient please have a look at KB1016873
One very important thing not to forget, change the value of the advanced setting to "True" as soon as you have finished adding the datastores !

19 July, 2013

Virtual Machine Disk Consolidation is needed !?

vSphere 5 introduced a new feature to clean up VM snapshot "left-overs" which could be the result of a snapshot removal action where the consolidation step has gone bad. This would result in the snapshot manager interface telling you there are no more snapshots present, but at datastore level they still exist and could even be still in use / still be growing. This could case all kinds of problems, first of all VM performance issues as the VM is still running in snapshot mode, secondly you are not able to alter the virtual disks of this VM and in the long run your could potentiality run out of space on your datastore because the snapshot keeps on growing.
Prior to vSphere 5 there was a possibility to fix such a situation thru the CLI, now with vSphere 5 you get a extra option in the VM - Snapshot menu called "Consolidate" this feature should clean up any discrepancies between the Snapshot Manager interface and the actual situation at datastore level.
I'm always a little reluctant when I'm at a customer and they use a lot of snapshotting, it is a very helpful and useful tool but you have to use it with caution otherwise it could cause big problems. Usually the problems start if snapshots are kept for a longer period of time or if the snapshot a layered onto each other, but even if you are aware of the problems it can cause when it's used wrongly it can still happen that you run into issues when using snapshots.
That being said and when we look at features offered by the various Vendors of storage devices, I'm pointing to the VM backup solutions that they offer. When a VM is running when being backed-up they all use the vSphere snapshot (with or without the Quiescing of the guest file system option). Basically if your company uses a SAN that leverages this functionality and it's configured to backup your VM's on a daily basis you have a environment that uses snapshotting a lot (on a daily basis) and therefore you could possibly run into more snapshot / consolidation issues then when you would not have a SAN with this functionality (nobody snapshots all it's VM's manually on a daily basis, I hope).
When I recently was at a large customer (+2000 VM's) that uses their storage device feature to backup complete datastores daily and also uses vSphere snapshots to get a consistent backup of running VM's.
For some reason they run into snapshot / consolidation issues pretty often and they explained to me that the Consolidate feature did work ok on VM with a Linux guest OS , but they almost always had a problem when trying to consolidate on a VM with a Windows guest OS it would simply fail with a error.
So I had a look at one of their VM's that could not consolidate although the vCenter client was telling it did need it.

When looking at the VM properties I cloud see that it was still had a "snapshot" (delta file) as a virtual disk, as it had a -000001.vmdk as virtual disk instead of a "normal" .vmdk

When a VM is in this state and it is still operational there is a issue, but the uptime is not directly affected, but most of the time the VM will be down and it will not power on again because of the issue. It will simply report a error of missing a snapshot on which the disk is depending.

The way I solved it at this customer (multiple times) is by editing the VM's configuration file .vmx and re-registering the VM to vCenter and after manually cleanup the remaining snapshot files. Please note that if the VM was running in snapshot mode all changes written in the snapshot will be lost using this procedure, in other words the VM will return to it's "pre snapshot" situation. For this particular customer this was not a issue, because the failed snapshots where initiated for backup purposes so no changes where made to the VM when it ran in snapshot mode.

So if you run into this issue and you know that their where no changes made to the VM or the losing the changes is a acceptable loss you could solve it by these steps.

Download the .vmx and open it with a text editor (I prefer Notepad++ for this kind of work) find the line that has the virtual disk files configured scsi0:0.fileName = "virtual-machine-000001.vmdk" and remove "-000001" so you are left with scsi0:0.fileName = "virtual-machine.vmdk" save the file.
Rename the .vmx file on the datastore to .old and upload the edited .vmx file.
Either reload the VM by using PowerCLI** or remove the VM from the Inventory and re-add it again.
Power on the VM
If you get the "Virtual Disk Consolidation needed" message, go to the Snapshot menu and click "Consolidate" it should run correctly now and remove the message.
Manually remove the unused files from the datastore .old, -000001.vmdk and -000001-flat.vmdk (I use Winscp to do this kind of work on a datastore)

** you could do this by using the following one-liner:

Get-View -ViewType VirtualMachine -Filter @{"Name" = "VM name"} |%{$_.reload()}

05 July, 2013

New Fling from VMwareLABS called "VisualEsxtop"

Most of the Fling's coming from VMwareLABS are worth trying and every once in a while there is a Fling that is really cool and above all useful (like InventorySnapshot I wrote a post on a while back), just a couple of days ago some of the engineers from the VMware performance group release VisualEsxtop.
The name says it all, it is a graphical version of esxtop which can be run on Microsoft OS, Linux OS and on Mac OS** so it is really "cross-platform".
It works remotely and can be run on any computer which has network access to ESX(i) host or vCenter (although I haven't been able to connect it to a vCenter successfully).

When you run it, it looks like a enhanced version of esxtop

It will color coat important counters and issues automatically. Further more it has the ability to record and playback batch output, it can create line charts for selected counters and it has counter descriptions when you "mouse-over" them.

To download this Fling, which I recommend as it is a very useful tool to have please go to VMwareLABS

** As said before it is cross-platform, but to get it to run on Mac OS you need to take some extra steps. For details on how to do this please read How to Run VMware's New Fling VisualEsxtop on Mac OS X from the virtuallyGhetto blog of William Lam

05 June, 2013

vSphere 5.X Storage vMotion

VMware has addressed a lot of bugs with their Update 1a for vSphere (ESXi and vCenter), one of them being the long awaited "renaming" feature when SvMotioning a VM. This "feature" slipped into vSphere somewhere in version 4.X really as a undocumented feature, as it turned out it was pretty useful for online renaming of VM's. The team responsible for Storage vMotion thought differently and reported this as a bug that needed to be fixed, which they did with the introduction of vSphere 5.0.
After a lot of "complaints" from VMware customers around the globe they re-introduced the bug / feature again with Update 2 for vSphere 5.0, but the feature acted differently then before. The feature would now only rename the folder of the VM on the Datastore, but would not rename the files that make up the VM.
Now with vSphere 5.1 and Update 1a the latter is possible again, but not out of the box. You will need to add a advanced feature to the vCenter settings for it to work again. So now you have a choice if you want to use this renaming feature or not, which is a nice gesture but why leave it disabled by default. In my opinion it would have been better to have it enabled by default with the option to disable it. For sure there VMware will have a good reason why they didn't do it.
Anyway, if you want to use the feature and keep your VM names consistent with the corresponding folders and files on the Datastore you will have to add the following key "provisioning.relocate.enableRename" to the "Advanced Settings" in "vCenter Server Settings" the key needs to get the value "true".
After adding the key it will show up as "config.provisioning.relocate.enableRename"

And after closing "vCenter Server Settings" the renaming of VM files during a SvMotion should work.

30 May, 2013

Where is my VM ?

I think every VM admin has experienced the following situation; For some reason planned or unplanned your vCenter server is down, no real issue because it will not affect the VM's running in your vSphere environment. Even HA will continue to work.
But how do you know where a specific VM is located (on which host) when you don't have vCenter ?
I already had a small script which allowed me to search for a VM on a selection of hosts or all active hosts of a environment, because if you vCenter is down unplanned their might be some other servers / VM's that are also down with the same root cause. If this is the case you will not only have to worry about getting your vCenter back up-and-running but you will also get a lot of questions on how to find and access other unresponsive VM's. This will be a challenge for any Admin which solely relies on the vSphere (web) client, because this Admin will have manually and separately logon to every host with his vSphere (web) client and search the inventory per host to find the VM he is looking for.
This is not a big problem if the environment has 5 or 6 hosts but the problem gets bigger as the environment gets bigger, just imagine the amount of work and time it will take when you have to search thru 50 hosts !.
I had a "vCenter down" situation recently with a customer and some other VM's became unresponsive. These VM's did not respond to RDP and because vCenter was down there was no direct way to know on which of the 58 hosts ! the VM's were running. So I got my old small script to lookup some of these VM's and one of the Admins saw this and he asked if he could also run it to find some VM's. This Admin had little experience with vSphere so he had some trouble to provide the needed input into the script before it would work (ESXi hosts IP or DNS to connect to and name of the VM which it was registered with in vSphere) but with a little help he managed.
After this event I thought I would simplify the script by using pre-created files with host information of the vSphere environment and have a selection menu to choose from, the only info that you need to provide is the full name of the VM.
Of course you will need to create the files, but you can do this when vCenter is up and it is easy to retrieve this kind of information.

The files you need to create are one file per search selection, in my script I have 1 file for every Cluster and 1 file for the complete environment. The file is a plain text file with 1host IP address or DNS name per line.
For me this is a script (with or without the menu) that you want to have available to you in case of a emergency, it take little time to setup and will help you big time in case of a issue.

27 May, 2013

VM's grayed out (Status Unknown) after a APD (All Paths Down) event on NFS datastores

Last week during a change on one of the core switches of the NFS storage network at a customer, we ran into a big problem causing a outage of 50 % of all VM's for around 4 hours.
The problem started with a network related error, on which I will not elaborate other then the result was a unstable NFS network causing random disconnected NFS datastores on the majority of the customers' ESXi hosts. On top of that it also caused latencies, which triggered vMotion actions causing to ramp-up the latencies even more and resulting in a storm of failing vMotions.
In theory this would never have happend as the NFS network of the customer is completely redundant, but in real life it turned out completely different in this particular case.
After putting DRS into "partially automated" the vMotion storm stopped, the latency continued on the NFS network and this also had it's effect on the responsiveness of the ESXi hosts. Only after powering down the core switch (the one which had the change) all returned to normal status, datastores were connected to ESXi hosts again and latency disappeared. When looking into the vSphere client I found lots and lots of VMs that had a inaccesible or invalid status. When trying to power-on such a VM it would not work and you would get a "action not allowed in this state" message. The only way I knew to get them accessible again at the time was to unregister the VMs from vCenter (Remove from Inventory) and add them again to browsing to the .vmx file with the Datastore Browser and selecting "Add to Inventory". This was time consuming and tedious work, but the only quick fix in getting those VMs back into vCenter. Mind you, most of the VMs where still up-and-running but in no way manageable thru vCenter.
By the time I had all VMs registered again, some also needed a reboot as their OS crashed thru to high disk latencies. I was contacted by the vCloud admin, he had also lost around 100 VMs from his vCloud environment. It looked to be a other long task of getting those VMs back, but we faced a extra problem. vCloud relies heavily on MoRef Id's for identification of VMs, in other words if the MoRef Id changes vCloud will no longer recognise this VM as it cannot match it to anything in its database.
But removing a VM from Inventory and re-adding it changes / updates its MoRef Id, so even if we wanted this quick fix I had could not be used on the VMs in vCloud. Luckily the vCloud admin found VMware kb1026043 it looked like VMware had the solution to our problem, but for some reason this solution was not working for us and it needed to have the host of the affected VMs in maintenance mode. It did help us with the search for a working solution, which was quickly after found by the vCloud admin on www.hypervisor.fr a French VMware related blog of Raphael Schitz. He wrote a article "Reload du vmx en Powershell" (Reload a vmx with Powershell) on how to reload VMs into Inventory without having the need for maintenance mode on your host(s), it all comes down to a PowerCLI one-liner that does the trick. You can alter the command to run it against entire Datacenter or just a Cluster.
In the end it saved our day by reloading all inaccesible and invalid VMs within just 5 minutes, this is a very useful one-liner as NFS is getting more and more used as preferred storage.

17 May, 2013

Database redundancy for your vCenter database(s)

The most important database within a vSphere environment is the vCenter database without a doubt. VMware therefore has enclosed detailed instructions on how to setup and configure this database, they have a guide for every supported database type. Recently I ran into a situation which made me believe that VMware "forgot" some details on this database configuration guide, at least when you have your vCenter database running on Oracle.
A customer has chosen to put their vCenter database on Oracle as this was their preferred database knowledge wise. And they set it up to also be resilient, the way they achieved this was by having a active and a standby database placed on 2 different database server in separated datacenters. To me it looked like a very solid solution. On the vCenter part they modified the TNSNames.ora in such a way it now included 2 database server addresses and also contained the parameters for connect-time failover and load balancing.
By doing this they made sure that vCenter could (almost) always connect to one of the two database servers, it would simply do a failover when the connection time would expire. In this case the failover would not have been quick enough to keep vCenter up-and-running but it would need a reboot (or at least a restart of services) to get connection again. But this would not affect the running VMs at all.
For maintenance purposes to the database servers, we had to switch from the active server to the backup server. As this was a planned action, we could first gracefully stop the vCenter services and after switch to the standby database server. After the switch all vCenter services were started again and vCenter went up-and-running like it supposed to do.
One issue that occurred during this database server switch was that VMware Orchestrator, which was installed on a separate server stopped working, logging all kinds of database related error's. With a quick look at the database configuration of Orchestrator I remembered that it could not cope with multiple database server addresses and was set to connect to the database server that now had become the standby. By changing the database server and starting the Orchestrator services again this problem was solved.
At least until the next day when I took a look at the vCenter Operations dashboard and found that the health of vCenter was 0

When I looked into more detail on what caused this I found VMware vCenter Storage Monitoring Service - Service initalization failed on only thing I found that could link this alert to the database failover was the timestamp, it was recorded right at the same time the failover had happened.

Not really knowing where to start investigating on the vCenter server, I first tried to find some information on the VMware KB and the first article that came up described the exact same error message. When reading kb2016472 I quickly found confirmation that this issue was related to the database failover although it refers to vCenter 4.X and 5.0 with the use of a SQL database instead of vCenter 5.1 / Oracle database.
It appears that this vCenter Storage Monitoring Service does not use the TNSNames.ora for the database connection, it has it's own configuration / connection file called vcdb.properties. This file has only the first of the two database server addresses.
Thru the information in the KB article I knew what to change to get the connection set to the backup database server, and after a restart of the vCenter Server service the vCenter Storage Monitoring Service initialized ok and started without any error.
So my conclusion is that even when you have redundancy or failover setup on vCenter database level, there are still some vCenter related products and services that need some manual action to continue to work in case of a (planned) database failover.

26 April, 2013

vCloud loses sync with the vCenter Inventory Service

Yesterday the vCloud admin of a customer I am working for on a other project, had a strange problem. He told me that it became impossible to deploy new vApps from the catalog, this proces would stop with all kinds of error's.
A few days before when we where testing the deployment of vApps on NFS datastores that where on new storage devices he also ran into a strange problem which looked quite similar. When deploying vCloud would first generate a other set of "shadow VMs" before actually deploying the vApp. This was strange because the vApp already had a set of running "shadow VMs" and it should have been "using" these.
Because the issue of yesterday had stopped production on the vCloud environment, the vCloud admin opened a support request with VMware GSS.
Once they had a look at the issue, it became quite quickly clear what was causing these strange problems. It looked like the vCloud Director cell had lost sync with the vCenter Inventory Service, this is not uncommon and you can find several "solutions" to this problem when searching thru some blogs.
In short the steps you need to take to re-start the syncing process again (If you are running a multi-cell environment):

1. First disable the cell and pass the active jobs to the other cells.

2. Display the current state of the cell to view any active jobs.
# /opt/vmware/vcloud-director/bin/cell-management-tool -u <username> cell --status

3. Then Quiesce the active jobs.
# /opt/vmware/vcloud-director/bin/cell-management-tool -u <username> cell --quiesce true

4. Confirm the cell isn’t processing any active jobs.
# /opt/vmware/vcloud-director/bin/cell-management-tool -u <username> cell --status

5. Now shut the cell down to prevent any other jobs from becoming active on the cell.
# /opt/vmware/vcloud-director/bin/cell-management-tool -u <username> cell --shutdown

6. Then restart the services.
# service vmware-vcd restart

If you are not running a multi-cell environment, you can just restart the services but it will incorporate loss of service for a couple of minutes. If you want to keep the loss of services to a minimum, you can "monitor" or tail the cell.log file (when it reads 100%, it's done)

# tail -f /opt/vmware/vcloud-director/logs/cell.log

In case that the above does not work, you can also reboot the complete cell (In a multi-cell environment first pass all active tasks to other cells). Upon reboot the vCloud cell will reconnect and will sync again.

Ok back to the issue, in this case this did not work, vCloud did not start the sync again in either way tried.

The support engineer wanted to restart all vCenter service to make sure that they were all running ok. Unfortunately this did not help, but with in this specific environment the Inventory Service is run from a separate server (VM) and after restarting the Inventory Service and after a other restart of the cell services vCloud did sync upon starting.

After this when talking to the vCloud admin, he told me that he had found other minor issues that probably could be interpreted as signs that vCloud and Inventory Service were getting out of sync again.

He found that some VMs where present in vCloud but not in vCenter, hence you could not find them when using the search feature (which is driven by Inventory Service) in the vSphere client. And he found that the "VMs and Clusters" view of the vSphere client had become very very slow upon to unresponsive. All other views in the client were working as usual.

As this issue can occur again, we decided to keep an eye out if we would detect either of these "signs" and when we do, do a restart of the Inventory Service ASAP.

Better to be safe then sorry.

22 April, 2013

Great (new) fling from VMware Labs

So the title reads that it is a new fling, but this is not completely true. It has been launched in it's current version over a year ago. During that time I did not have the time to check it out, recently when searching for something else I stumbled upon this fling again and thought why not give it a go.
So what does this InventorySnaphot fling do exactly, well it gives you the possibility to "snapshot" a vCenter inventory and use this to reproduce this inventory. This can be used as backup and restore or use the snapshot as a template inventory. A other purpose for this which I came up with is when you run a home lab and you use the 60 days evaluation licenses for the VMware part of it, which is done a lot for sure. You could build you home lab and configure it as you want it / need it and take a "snapshot" with this fling. All your re-builds afterwards will be a lot quicker as you can restore your complete vCenter inventory very easy and fast, this will save you a lot of time configuring.
The snapshot can be taken at any level within the vCenter inventory.
To be able to run this fling on your computer you need two things installed, first PowerCLI (duh!!) and second Java.
You can download the binaries in a zip package thru this link
After downloading it, unpack it and run "InventorySnapshot.bat" if it doesn't work run it from a command line and see if there is a error. I got a error first time round about java, it could not be found. After editing "InventorySnapshot.bat" to add the full path to Java it started working.
The usage of the tool is pretty straight forward and all is documented on the VMware Lab website, there is even a video that shows you how to use the tool.
So how does it work, the tool creates a PowerCLI script crafted from the vCenter inventory it is connected to and once you run this script it will re-create the vCenter inventory.
You can restore the complete vCenter inventory or you can select parts of the inventory that you want to restore.

13 December, 2012

Protecting against Denial Of Service attacks with new VDS feature

One of the VMware Virtual Distributed Switch enhancements is the BPDU-Filter. BPDU stands for Bridge Protocol Data Unit, these packets are exchanged between physical switches as part of the Spanning Tree Protocol. STP is used to prevent network loops and is used on physical switches.
A physical switch determent's on BPDU exchange if a specific port should be in a forwarding- or blocking state.
VMware's virtual switches do not support STP and therefore doesn't exchange BPDU packets, the VDS will simply drop the packet.
A best practice for VMware host facing ports is to enable portfast and BPDU guard. With this best practice the following scenario could cause to a complete uplink to fail. If a VM is compromised in some way and this VM starts to generate BPDU packets it will travel to the physical switch and this will block the port as a result of the BPDU guard settings it has.
The result is a uplink down and the vSphere host will try to move the VM to a other uplink which will result in a other uplink down and in worst case scenario it could cause a cluster wide fail.
The BPDU-Filter feature will make the VDS (VSS as well) drop the BPDU-packet coming from the VM.

For more details and how to configure it please visit the VMware blog