29 October, 2013

20th Belgium VMUG meeting (agenda updated !)

VMUG meetings are always interessting and I think that the Belgium VMUG have very good sessions scheduled for their 20th VMUG meeting.
Please check out the details below ! For those who are attending this VMUG, see you December 5th !

If you are a VMUG member you can register here for this free meeting. You can also find the location details on that page.

Belgium VMUG 20th meeting agenda:

9:00Erik SchilsWelcome
9:15Chad SakacSoftware Defined Storage – State of the Art, and the Art of the Possible (EMC Gold Sponsor)
10:00Lee DilworthStretched Storage Clusters, Best PracticesSeb HakielUser Case: A 4K VDI 
11:00Dheeraj Pandey“Bringing Google-like Infrastructure to the Enterprise “ (Nutanix Gold Sponsor)
11:45Chris Van Den AbbeeleTurning the Tables on Cyber Attacks (Trend Micro Silver Sponsor)Gilles ChekrounApplication Centric Infrastructure: Redefine the Power of IT
(Cisco Silver Sponsor)
13:45Joe BaguleyThoughts from a CTO - Where's it all going?" (VMware Gold Sponsor)
14:30Luc DekensManage and Automate VMware View with PowerCLIViktor van den Berg and Arnim van LieshoutvCac
15:30Vaughn StewartAccelerate vSphere with Flash Storage Technologies
Bouke GroenescheijPractical Vscsistats Result Analysis
16:15Hans & GuestsPanel Discussion: "How Does the Future of Enterprise IT Impact Us?" (Pure Storage Silver Sponsor)Kris VandermeulenEverything you always wanted to know about Microsoft licensing on VMware but were afraid to ask
16:45Cormac HoganvSANPeter vandermeulenUser Case: Datacenters, Branch Office en Process Controle MIZ with VSA and How We Want to Integrate

24 October, 2013

VCAP5 exam experiences

I have been looking into getting VCAP5-DCA and VCAP5-DCD certified for some time now, when I was making plans to attend VMworld 2013 in Barcelona those 2 came together. I was lucky enough to have the opportunity to get some study time beside my work for a couple of weeks just before VMworld.
For the DCA exam I thought I did not have to study that much, so I just did some fleshing up on my vMA "skills" (don't use it regularly) and on configuring SATP rules by CLI.
On the other hand I did study a fair amount on the DCD exam as I knew it would be a real challenge.
I planned my DCA exam on Monday (Partner / TAM day) in the afternoon, because in the morning I was still travelling to Barcelona. And I planned my DCD exam on Tuesday morning, to be sure I could spend the rest of my time at VMworld attending sessions (and going to the various parties).
When I sat for my DCA exam I found it a very good exam, I would almost say I enjoyed taking it. I think that for enterprise admin's it is the same as it is within your comfort zone. Nevertheless it is a challenging exam where you are really tested for your hands-on knowledge. I did manage to complete all questions / tasks within the given time. I knew that I would not receive the results immediately after completing the last question it could take up to 15 days before you got the results back by email. But to my surprise VMware has recently changed this and they will get you the results within 8 hours ! As I did not know this I was surprised to find a email "Notice of VMware Technical Certification" the next morning and the best part I passed !
That day started very good, and with a boost in confidence I sat for my DCD exam. Now this is a whole different kind of exam, I didn't like it as much as the DCA exam for mainly 2 reasons. The first is that the design questions you get you need to do in a environment somewhat similar to MS Visio, but is feels like it is in Beta. It just isn't user friendly especially if you need to make modifications along the way, I was struggling more with the user interface as I was with the actual content of the question itself.
And the second reason for me is that a lot of questions have vague descriptions, I understand that when you are designing in real life there will also be points that are not really clear or vague but in this case you could ask to get clarification. This is not a option during the exam of course.
I have to say that the DCD exam is a difficult exam, in which time is your biggest enemy. It's just a huge amount of questions that you need to answer and the questions do cover every object in the blueprint !
When I got to the last question I had 5 minutes left, luckily I was able to answer within a blink of an eye. After answering the last question I had some exciting seconds before the results where shown, I passed this exam as well.
Needless to say that this has been the best VMworld ever for me !

Both exams are difficult and need some sort of exam strategy if you ask me, but both strategies are very different. If you are pursuing either or both of them, please take a look at my VCAP5-DCA and VCAP5-DCD study pages for useful links and tips.

11 October, 2013

vCenter 5.1 takes ESXi 5.1 host automatically out of maintenance mode

A couple of days ago I wrote about changing the scratch location on ESXi hosts in relation to the use of SNMP "Enabling SNMP on ESXI 5.1 host results in "The ramdisk 'root' is full" events".
In addition to this, the same customer ran into a other issue which also relates back to having the scratch location on non-persistent storage.
For hardware maintenance hosts where put into maintenance mode and "handed over" to datacenter engineers, they needed to update firmwares and bios of these hosts. When they finished the first host, they powered it on and it booted up ESXi as normal.
When they came up to find out if all was OK with this host, the VM admin looked up the host and found it in fully operational state ! And not in maintenance mode as expected, when looking at the host tasks and events it looked like "system" had taken the host out of maintenance mode after it came back online in vCenter. This could cause some serious issues, if the host is a member of a Cluster with HA and DRS (full automated) enabled but lacks the network uplinks that provide the VMnetwork(s). VM's would be vMotioned to this hosts, these VM's will lose their network connection !
For this to happen, the host has to be in operational state before it re-connected to vCenter. So why did this host "forget" it was in maintenance mode during the hardware maintenance ?
When I saw this happening I remembered that in the past with patching ESXi 4.1 hosts similar events happened and this was caused by that during the patching the scratch location (/tmp/scratch) got damaged / not accessible. When this happened the host booted normally, re-connected again to vCenter and got taken out of maintenance mode by the system account.
So a quick check learned that the host we where working on now had it's scratch location on non-persistent storage, next I checked in what way the datacenter engineers shutdown or reboot a ESXi host. I learned that as they don't have the rights to do shutdown or reboot a host through the vSphere client (either connected directly to the host or connected to vCenter), they used the out-of-band management (iLO, Drac, iRMC etc.). They always tried to do a graceful shutdown or reboot, but for this agents need to be installed on the host. As this is not the case this does not work for them, the other option is hard reset or power cycle. Let's be clear this is not a good thing, in my opinion you always need to do a clean and graceful shutdown or reboot ! Especially if the host concerned has its scratch location on non-persistent storage (read ramdisk), as this type of storage will act like if the host is experiencing power failure. And therefore not writing anything to a persistent location as it would during a clean reboot or shutdown.

VMware KB on changing the scratch location KB1033696


07 October, 2013

Enabling SNMP on ESXI 5.1 host results in "The ramdisk 'root' is full" events

Recently I ran into a issue when doing some work on a ESXI 5.1 Cluster, I needed to put the hosts in maintenance mode one by one. When I put the first host in maintenance I assumed that the host would be evacuated by migrating all VM's with the use of vMotion as Enterprise plus licenses where in place. But when the progress bar hit 13% the vMotion process stopped with a error. The error referred to "ramdisk (root) is full". When I checked with the customer they told me that this started happening after they configured snmp. They found that it would for some reason fill up the disk containing /var and they also found that sometimes those hosts became unresponsive to SSH and or DCUI.
After looking up the error message it quickly became clear what the relation was between snmp and "ramdisk (root) is full", the snmp service generated a .TRP file for every snmp trap sent. I believe this is not normal behavior, when I checked the functionality of snmp "esxcli system snmp test" it reported a error "Agent not responding, connect uds socket(/var/run/snmp.ctl) failed 2, err= No such file or directory" this proved my assumption was right (a successful test should result in "Comments: There is 1 target configured, send warmStart requested, test completed normally.". These files are stored in /var/spool/snmp and this location was located on non-persistent storage, in fact it was located on a 4GB ramdisk. Please check VMware KB2042772 for details on the error related to scratch location.
The snmp service will write a maximum of 8191 .TRP files, if the /var/spool/snmp location runs out of space before hitting this number you will have a host which is no longer able to vMotion, it can also become disconnected / unresponsive. And in some cases you are not able to start DCUI as there are no free inodes on the host. In this case connect to the host console (iLO, DRAC,....) and make sure you can login, then stop the vpxa service this will free up a inode and you will be able to start DCUI from the host's console (Troubleshooting options).
Now you need to remove the files that fill up the ramdisk, but first be sure that snmp is the cause of the issue by checking the file count in /var/spool/snmp "ls /var/spool/snmp | wc -l" if the result is above 2000 files snmp is most likely the cause.
To remove the files you can go 2 ways, move to the /var/spool/snmp dir and remove all .TRP files "for i in $(ls | grep trp); do rm -f $i; done" but I also found that stopping the snmp service "esxcli system snmp -e No" also clears the dir most of the times.
When the files are removed the host will start responding normally again, you will be able to start the vpxa service again "/etc/init.d/vpxa start" if you had to stop it previously.
The permanently fix this issue you need, as stated earlier to change to scratch location preferably to a local or shared datastore (VMFS or NFS). You can do this by editing the advanced settings (software) of the host “ScratchConfig -> ScratchConfig.ConfiguredLocation” after changing a reboot is mandatory to apply the change.
If you have to go thru a number of hosts , you might want to do this by using PowerCLI. If your lucky and the naming convention of the (local) datastores is uniform you will be able to automate all actions. If not (like in my particular case) you either go it host by host with the use of the vSphere (web) client. Or you could use a small script to look up all datastores, let you select the (local) datastore and update the advanced settings for you. sample script below.

VMware KB used as reference : KB2001550 KB2040707 KB1010837