Last week during a change on one of the core switches of the NFS storage network at a customer, we ran into a big problem causing a outage of 50 % of all VM's for around 4 hours.
The problem started with a network related error, on which I will not elaborate other then the result was a unstable NFS network causing random disconnected NFS datastores on the majority of the customers' ESXi hosts. On top of that it also caused latencies, which triggered vMotion actions causing to ramp-up the latencies even more and resulting in a storm of failing vMotions.
In theory this would never have happend as the NFS network of the customer is completely redundant, but in real life it turned out completely different in this particular case.
After putting DRS into "partially automated" the vMotion storm stopped, the latency continued on the NFS network and this also had it's effect on the responsiveness of the ESXi hosts. Only after powering down the core switch (the one which had the change) all returned to normal status, datastores were connected to ESXi hosts again and latency disappeared. When looking into the vSphere client I found lots and lots of VMs that had a inaccesible or invalid status. When trying to power-on such a VM it would not work and you would get a "action not allowed in this state" message. The only way I knew to get them accessible again at the time was to unregister the VMs from vCenter (Remove from Inventory) and add them again to browsing to the .vmx file with the Datastore Browser and selecting "Add to Inventory". This was time consuming and tedious work, but the only quick fix in getting those VMs back into vCenter. Mind you, most of the VMs where still up-and-running but in no way manageable thru vCenter.
By the time I had all VMs registered again, some also needed a reboot as their OS crashed thru to high disk latencies. I was contacted by the vCloud admin, he had also lost around 100 VMs from his vCloud environment. It looked to be a other long task of getting those VMs back, but we faced a extra problem.  vCloud relies heavily on MoRef Id's for identification of VMs, in other words if the MoRef Id changes vCloud will no longer recognise this VM as it cannot match it to anything in its database.
But removing a VM from Inventory and re-adding it changes / updates its MoRef Id, so even if we wanted this quick fix I had could not be used on the VMs in vCloud. Luckily the vCloud admin found VMware kb1026043 it looked like VMware had the solution to our problem, but for some reason this solution was not working for us and it needed to have the host of the affected VMs in maintenance mode. It did help us with the search for a working solution, which was quickly after found by the vCloud admin on www.hypervisor.fr a French VMware related blog of Raphael Schitz. He wrote a article "Reload du vmx en Powershell" (Reload a vmx with Powershell) on how to reload VMs into Inventory without having the need for maintenance mode on your host(s), it all comes down to a PowerCLI one-liner that does the trick. You can alter the command to run it against entire Datacenter or just a Cluster.
In the end it saved our day by reloading all inaccesible and invalid VMs within just 5 minutes, this is a very useful one-liner as NFS is getting more and more used as preferred storage.
 
No comments:
Post a Comment