A couple of days ago I wrote about changing the scratch location on ESXi hosts in relation to the use of SNMP "Enabling SNMP on ESXI 5.1 host results in "The ramdisk 'root' is full" events".
In addition to this, the same customer ran into a other issue which also relates back to having the scratch location on non-persistent storage.
For hardware maintenance hosts where put into maintenance mode and "handed over" to datacenter engineers, they needed to update firmwares and bios of these hosts. When they finished the first host, they powered it on and it booted up ESXi as normal.
When they came up to find out if all was OK with this host, the VM admin looked up the host and found it in fully operational state ! And not in maintenance mode as expected, when looking at the host tasks and events it looked like "system" had taken the host out of maintenance mode after it came back online in vCenter. This could cause some serious issues, if the host is a member of a Cluster with HA and DRS (full automated) enabled but lacks the network uplinks that provide the VMnetwork(s). VM's would be vMotioned to this hosts, these VM's will lose their network connection !
For this to happen, the host has to be in operational state before it re-connected to vCenter. So why did this host "forget" it was in maintenance mode during the hardware maintenance ?
When I saw this happening I remembered that in the past with patching ESXi 4.1 hosts similar events happened and this was caused by that during the patching the scratch location (/tmp/scratch) got damaged / not accessible. When this happened the host booted normally, re-connected again to vCenter and got taken out of maintenance mode by the system account.
So a quick check learned that the host we where working on now had it's scratch location on non-persistent storage, next I checked in what way the datacenter engineers shutdown or reboot a ESXi host. I learned that as they don't have the rights to do shutdown or reboot a host through the vSphere client (either connected directly to the host or connected to vCenter), they used the out-of-band management (iLO, Drac, iRMC etc.). They always tried to do a graceful shutdown or reboot, but for this agents need to be installed on the host. As this is not the case this does not work for them, the other option is hard reset or power cycle. Let's be clear this is not a good thing, in my opinion you always need to do a clean and graceful shutdown or reboot ! Especially if the host concerned has its scratch location on non-persistent storage (read ramdisk), as this type of storage will act like if the host is experiencing power failure. And therefore not writing anything to a persistent location as it would during a clean reboot or shutdown.
VMware KB on changing the scratch location KB1033696