21 February, 2013

VMware Inventory Service causes vCenter server to run out of diskspace

Recently I ran into a issue with a vCenter server, this server apparently crashed because it ran out of diskspace on it's system disk. This particular vCenter server was running on physical hardware. When the system disk was investigated in a attempt to find what caused the disk to fill up 18 GB of free space in less then 48 hours, 4 large log files where found in the Inventory service database logs
 folder. After a second look, they where all created within 20 minutes and for some reason the logfiles weren't flushed to the database. Usually there is a new log file created every 10 minutes or so and / or if the file size is around 4 GB.
In search for a explanation and a solution a good look in the VMware vCenter documentation, resulted in finding a part in the "After you install the vCenter server" chapter called "Back up the Inventory Service Database on Windows" and in this part a "scripts"  folder is mentioned. One of the scripts is intended to backup the Inventory Database.
I presume everyone that has something to do vSphere knows how important the vCenter database is and I hope that all backup this accordingly. But I personally never read any article or best practice that stated to also backup the Inventory Service database separately, this includes the VMware documentation that is provided as a guideline to upgrade from vSphere 5.0 to 5.1.

Additional information:
After an other day of testing, the excessive growth of the logs could be leaded back to vCloud deploying vApps. For test we had 5 vApps deployed and monitored the xhive log files, this started growing during these deployment at a rate of 2 GB per minute.
If this can be addressed as "normal" behaviour....I am not sure, neither is GSS. The case which investigates this is still open at the time of me writing this article.

So the big question is / was, what caused the log files to grow this fast and why weren't these logs written to the database as it normally does? For now the only explanation is hardware malfunction.
This particular vCenter server was a physical server and it had a malfunctioning disk controller cache battery, therefore it had disabled the write cache to prevent data corruption.
This limited the write performance for the usual 100 MB/s to only 5MB/s, the server just could not keep up with writing log entries during the vCloud deployments.
The battery has been replaced and for now vCloud keeps on generating huge amount of log entries during deploy actions, but for now the server is able to keep up and write them to the database is a timely fashion.