07 December, 2015

Updates failing on VSAN hosts

A while back one of my customers ran into a issue when they wanted to install Update 1 for ESXi 6.0. Initially the ESXi hosts updated as expected, except five hosts.
For these five hosts, the first difference was that the hypervisor is installed on rack servers instead of blade servers. Main reason for this was, these hosts needed to accommodate local storage. This local storage is used for VMware Virtual SAN.
The five hosts where in the same VSAN cluster, and this cluster is used as management cluster for the customers entire vSphere environment.
 So these five hosts needed to get updated with Update 1 for ESXi 6.0, but VUM (VMware Update Manager) failed to install this update with a somewhat strange error message. With the help of the VMware Knowledge Base it became clear it had something to do with staging the Update before installing it on the ESXi host.
Because this customer runs ESXi from a USB flash device, the scratch location is redirected to shared storage. This configuration is similar for both the blade and rack servers. My first thought was that there was something wrong with this redirection for these specific five hosts. But after reviewing the advanced settings and verifying the time and date stamps on the various logfiles of the hosts within the .locker folders located on shared, all looked fine.
To my believe the update and patch staging location moved along with the logfile location when the scratch location was changed. A excellent blog post on logfile redirection and VSAN considerations when booting of a USB flash device is written by Cormac Hogan.
So what do VSAN, ESXi hosts booting from a USB flash device and scratch folder redirection have to do with failing updates? I will come to that, please bear with me as first give you some background on the intended use and deployment method of the hosts concerned.
Because these hosts are used as resources to run the management cluster on, these hosts where the first ESXi hosts to be deployed within the customers data center. No shared storage was available at the time of deployment. This imposes a challenge, to be able to use VSAN you need vCenter and to be able to deploy vCenter (as appliance or Windows based) you need storage accessible by one or more ESXi host(s).
A solution is to bootstrap vCenter on a single VSAN node, yes a single ESXi hosts that runs VSAN with only it's own local storage!.
If you want more information on how to bootstrap vCenter on a single VSAN node, please have a look at this 2-part blog post of William Lam on his VirtuallyGhetto blog
With the use of bootstrapping the vCenter onto a single ESXi host using it's local storage to create the VSAN datastore I could build the VSAN cluster and add the remaining four ESXi hosts to complete the cluster.
When the shared NFS storage became available later on in the project the scratch folder was redirected just like with all of the customers other ESXi hosts.
And here is the catch, the time the ESXi hosts have been running without the redirection, they have been logging to the local flash device. Usually this is not a big issue, other then the risks mentioned in Cormacs' blog post. But when you use VSAN, there will be additional log or trace files written (vsantraces). And in this customers case there were also VSAN Observer files written to the local flash device, VSAN Observer is a tool used to monitor and capture performance statistics, originally only used by the VMware VSAN engineering team. More information on VSAN Observer can be found here.
Vsantrace files can grow quickly up to 500 Mb and VSAN Observer trace files even larger, as I explained previously the scratch folder redirection was done some time (days) after the VSAN cluster became operational. When the redirection is done, the various trace files that are on the local flash device are NOT removed, these files do take up a considerable amount of space. In fact they take up so much space that there is not enough space left for staging Update 1 for ESXi 6.0 on the host.
Manually removing the old VSAN related trace files from the local flash devices was what solved the VUM issue. After the files where deleted the remediation of the ESXi hosts using VSAN ran without any issue.

No comments:

Post a Comment