17 April, 2013

Is rescanning storage adapters something you can do during production hours ?

First off, why do you initiate a rescan action ? Usually after you have made modifications on the storage / datastore part of your vSphere environment, like adding a new datastore. In general we can presume that a "Rescan All" can be done during production hours. In theory you could do this action without disrupting your environment, therefore it is acceptable to allow it during production hours. But in reality this is not always the case, so as with many decisions and actions on vSphere (and vSphere design) the correct answer to the question should be; It depends....on numerous factors.
I will try to clarify my "statement" by the hand of a use case combined with information gathered from a real life incident at one of the customers I worked for.
All I am writing in this article is related to vSphere 5.1 and up, why not the older versions ? For 2 reasons, first I did not investigate it on older versions and second starting version 5.1 the new Inventory Service was introduced. This service basically is there to make vCenter perform better (on several parts, for instance on the search feature) as it has it's own database which is used by the search feature. The Inventory Service is closely linked with Profile-Driven Storage, this manages the feature know as "Storage Profiles". If Inventory Service is not working then "Storage Profiles" will not be usable, please keep this in mind.
Back to the rescanning actions on storage adapters, this usually not something you do a daily basis. It will concentrate around changes to the storage area of your vSphere environment as mentioned before. Now you can imagine that it will be heavily used when move from one storage device to a other storage device. A action I recently worked on, the customer needed to change his SAN for performance and stability reasons. They decided to no longer go with FC storage but instead go and start use NFS storage, this resulted in a large migration action of all VM's in their vSphere environment.
When all VM's where moved to the new NFS backed datastores, the next step was to remove all FC backed datastores. One of the VM admins started this action deleting the FC datastores from the vSphere client host by host, what happend was that with every removal automatically a rescan was initiated on that host.
During this time the vCloud environment (runs on the same backend and vCenter as the "normal" vSphere environment) started to show very strange behaviour, It was not possible to deploy a vApp from the catalog, adding aVM to a vApp did not work also. And on the vSphere / vCenter part the search feature stopped working, deploying a new VM failed on most occasions.
With the search feature not working, there should be something wrong with the Inventory Service. I opened up my vSphere webclient to see if I could use search from there and I was presented with a error notification "Connection lost to Inventory Service". I check the services on the server that was running Inventory Service and was showing "started". After a reboot of the server (Inventory Service ran on a separate VM because of the size of the environment) and a long wait before the Inventory Service went form "starting" to "started" (about 40 min.) the search feature started working. But vCloud still did not operate normally and all Storage Profiles where gone from vCloud aswel as from vCenter. After the Profile-driven Storage service was restarted they re-appeared and vCloud started acting normal again.
Every admin returned to his normal work after this incident, so again one of the VM admins went back to removing the FC datastores and within a short time the strange behaviour on vCloud and vCenter appeared again. After opening a support request with VMWare GSS and having a talk with one of the support engineers, it became clear that the rescan action was putting to much stress on both Inventory Service and Profile-Driven Storage. In such a way that it slowed down all other tasks and eventually crashing the Inventory Service completely.
I had never seen or heard of such a incident with other customers or colleagues, so I started investigating what combination of factors are responsible for this to happen and soon it became clear and started to make sense.
The customer uses Storage Profiles on both vCloud and on vCenter (for all VM's +/- 1700) and it is known that vCloud can create a fair amount of load on Inventory Service by itself. Rescanning puts a extra load on Inventory Service and also puts a strain on the host agents aswel this combined generates a load of storage traffic which results in a service to fail / crash.
Advice of VMware GSS was to reduce rescanning and datastore manipulation to a minimum during production hours and at best to execute these actions during quietest hours of the day, or even better during a maintenance window.
And if rescanning is needed, to do it more localised instead of host, cluster or even datacenter wide. This last part is not possible thru the vSphere client, but on the DCUI of a host you can run a rescan per host and even per storage adapter. This will reduce the load by a big amount and makes it possible to run rescan actions during production hours without disrupting your vCloud and vSphere environment.

Example:
esxcli storage core adapter rescan -A vmhba0 
This will rescan vmhba0 on this specific host, leaving all other adapters (and hosts) alone.

So you can do a rescan action during production hours but you can never doing this without thinking it over and without knowing your environment. So please use it with caution and only execute it during production hours when it's absolute needed.

More information on rescanning can be found at VMware KB1003988

1 comment: