Recently I ran into a issue when doing some work on a ESXI 5.1 Cluster, I needed to put the hosts in maintenance mode one by one. When I put the first host in maintenance I assumed that the host would be evacuated by migrating all VM's with the use of vMotion as Enterprise plus licenses where in place. But when the progress bar hit 13% the vMotion process stopped with a error. The error referred to "ramdisk (root) is full". When I checked with the customer they told me that this started happening after they configured snmp. They found that it would for some reason fill up the disk containing /var and they also found that sometimes those hosts became unresponsive to SSH and or DCUI.
After looking up the error message it quickly became clear what the relation was between snmp and "ramdisk (root) is full", the snmp service generated a .TRP file for every snmp trap sent. I believe this is not normal behavior, when I checked the functionality of snmp "esxcli system snmp test" it reported a error "Agent not responding, connect uds socket(/var/run/snmp.ctl) failed 2, err= No such file or directory" this proved my assumption was right (a successful test should result in "Comments: There is 1 target configured, send warmStart requested, test completed normally.". These files are stored in /var/spool/snmp and this location was located on non-persistent storage, in fact it was located on a 4GB ramdisk. Please check VMware KB2042772 for details on the error related to scratch location.
The snmp service will write a maximum of 8191 .TRP files, if the /var/spool/snmp location runs out of space before hitting this number you will have a host which is no longer able to vMotion, it can also become disconnected / unresponsive. And in some cases you are not able to start DCUI as there are no free inodes on the host. In this case connect to the host console (iLO, DRAC,....) and make sure you can login, then stop the vpxa service this will free up a inode and you will be able to start DCUI from the host's console (Troubleshooting options).
Now you need to remove the files that fill up the ramdisk, but first be sure that snmp is the cause of the issue by checking the file count in /var/spool/snmp "ls /var/spool/snmp | wc -l" if the result is above 2000 files snmp is most likely the cause.
To remove the files you can go 2 ways, move to the /var/spool/snmp dir and remove all .TRP files "for i in $(ls | grep trp); do rm -f $i; done" but I also found that stopping the snmp service "esxcli system snmp -e No" also clears the dir most of the times.
When the files are removed the host will start responding normally again, you will be able to start the vpxa service again "/etc/init.d/vpxa start" if you had to stop it previously.
The permanently fix this issue you need, as stated earlier to change to scratch location preferably to a local or shared datastore (VMFS or NFS). You can do this by editing the advanced settings (software) of the host “ScratchConfig -> ScratchConfig.ConfiguredLocation” after changing a reboot is mandatory to apply the change.
If you have to go thru a number of hosts , you might want to do this by using PowerCLI. If your lucky and the naming convention of the (local) datastores is uniform you will be able to automate all actions. If not (like in my particular case) you either go it host by host with the use of the vSphere (web) client. Or you could use a small script to look up all datastores, let you select the (local) datastore and update the advanced settings for you. sample script below.
VMware KB used as reference : KB2001550 KB2040707 KB1010837
No comments:
Post a Comment