26 April, 2013

vCloud loses sync with the vCenter Inventory Service

Yesterday the vCloud admin of a customer I am working for on a other project, had a strange problem. He told me that it became impossible to deploy new vApps from the catalog, this proces would stop with all kinds of error's.
A few days before when we where testing the deployment of vApps on NFS datastores that where on new storage devices he also ran into a strange problem which looked quite similar. When deploying vCloud would first generate a other set of "shadow VMs" before actually deploying the vApp. This was strange because the vApp already had a set of running "shadow VMs" and it should have been "using" these.
Because the issue of yesterday had stopped production on the vCloud environment, the vCloud admin opened a support request with VMware GSS.
Once they had a look at the issue, it became quite quickly clear what was causing these strange problems. It looked like the vCloud Director cell had lost sync with the vCenter Inventory Service, this is not uncommon and you can find several "solutions" to this problem when searching thru some blogs.
In short the steps you need to take to re-start the syncing process again (If you are running a multi-cell environment):


1. First disable the cell and pass the active jobs to the other cells.

2. Display the current state of the cell to view any active jobs.
#  /opt/vmware/vcloud-director/bin/cell-management-tool -u <username> cell --status

3. Then Quiesce the active jobs.
#  /opt/vmware/vcloud-director/bin/cell-management-tool -u <username> cell --quiesce true

4. Confirm the cell isn’t processing any active jobs.
#  /opt/vmware/vcloud-director/bin/cell-management-tool -u <username> cell --status

5. Now shut the cell down to prevent any other jobs from becoming active on the cell.
#  /opt/vmware/vcloud-director/bin/cell-management-tool -u <username> cell --shutdown

6. Then restart the services.
# service vmware-vcd restart

If you are not running a multi-cell environment, you can just restart the services but it will incorporate loss of service for a couple of minutes. If you want to keep the loss of services to a minimum, you can "monitor" or tail the cell.log file (when it reads 100%, it's done)
# tail -f /opt/vmware/vcloud-director/logs/cell.log

In case that the above does not work, you can also reboot the complete cell (In a multi-cell environment first pass all active tasks to other cells). Upon reboot the vCloud cell will reconnect and will sync again. 

Ok back to the issue, in this case this did not work, vCloud did not start the sync again in either way tried.
The support engineer wanted to restart all vCenter service to make sure that they were all running ok. Unfortunately this did not help, but with in this specific environment the Inventory Service is run from a separate server (VM) and after restarting the Inventory Service and after a other restart of the cell services vCloud did sync upon starting.
After this when talking to the vCloud admin, he told me that he had found other minor issues that probably could be interpreted as signs that vCloud and Inventory Service were getting out of sync again.
He found that some VMs where present in vCloud but not in vCenter, hence you could not find them when using the search feature (which is driven by Inventory Service) in the vSphere client. And he found that the "VMs and Clusters" view of the vSphere client had become very very slow upon to unresponsive. All other views in the client were working as usual.
As this issue can occur again, we decided to keep an eye out if we would detect either of these "signs" and when we do, do a restart of the Inventory Service ASAP.

Better to be safe then sorry.



No comments:

Post a Comment