29 August, 2013

vSphere 5.1 All Paths Down (APD) warning messages Part-2

In my previous post I wrote about troubleshooting APD and finding the root cause. At that time the root cause turned out to be a IP address conflict which bound 2 different NFS networks to 1 vmkernel interface, causing random APD events on all hosts connected to either of the NFS networks.
Besides these random APD events the same customer also had APD events that appeared to be a set times, mainly around the time backups where running (Netapp backups).
So for these APD events it looked like the NAS head becomes overloaded when it has to run a backup task on top of the normal load. When talking to the storage admin's, they wanted to see if any other background process (like deduplication) could trigger the same issue.
When looking into this we found that indeed deduplication could trigger APD events on vSphere and write "NFS slow" events in the Netapp logs. Not sure why this happened, a support case was opened with Netapp. Within this case all known performance where looked at and from the perfstat captures they could tell there were misaligned VM's. There is only one way to measure the effect of misaligned IO on your Netapp is by looking at the pw.over_limit counter. This counter is only available in priv set advanced command line mode.
So we ran the Scan Manager from the Netapp plugin for vCenter to see how many VM's where misaligned and we found there where a lot, due to the vCloud environment which has a lot of misaligned base VM's with multiple linked clones (which are automatically also misaligned).
During the search one of the storage admin's found a relation between a deduplication task on NAS head "A" in Datacenter A causing not only load on NAS head "A" in Datacenter A but also on NAS head "B" in Datacenter B. This was caused by a feature called "alternated write" this combined with the used storage design had a negative effect on the load.
All these factors turned a more then capable storage system into a stressed out, overloaded storage system. Like they say "The devil is in the details"
The vSphere environment suffered from these storage performance issues, in a massive way as you can imagine. This specific customer had Enterprise+ licenses and had SIOC (Storage I/O Control) enabled on all datastores, but even with SIOC they still experienced unresponsive and crashing VM's.

12 August, 2013

vSphere 5.1 All Paths Down (APD) warning messages Part-1

As you might know there has been some changes to APD and PDL behaviour in vSphere version 5.1.
APD got a new way of handling I/O's during APD scenarios and by setting a advanced option you can now even choose between the old way and the new I/O's "fast fail" way.
In short, the old way attempted to retry failed I/O's for a period of 140 seconds. After it would stop, the new way is to stop all I/O's directly. There are situations that the old way could cause the host(s) to disconnect from vCenter or even become unresponsive, this is something you want to prevent. By using the new way it will prevent these issues.
If you want to know more about ADP and PDL behaviour you should read the different articles on these subjects on Duncan Epping's blog Yellow Bricks or to be more specific start with this blog post.

So way I am writing a post about this subject when there is already a lot of good information out there ? A couple of days ago I have been asked to troubleshoot APD warning events in a vCenter log of a customer and I found that it was very difficult to pinpoint the cause of these APD messages.

There where multiple factors that made the troubleshooting difficult, one being that the customer has a stretched metro-cluster setup and a other one being that a part of the messages would appear on recurring times and the other part at "random" times.
The origin of the APD messages on recurring times was quickly found, when these would only occur if there where "extra" background processes running on the Netapp NAS heads. With background processes I mean processes like backup or de-duplication tasks. And the timeslot these APD events occurred made it a not so urgent issue. On how the APD events related to the background processes where solved I will write a other blog post as soon as all is double checked and confirmed both by VMware and Netapp.
On the other-hand the "random" APD events where a lot more difficult to pin point and the issue was a lot more high-profile as customers where complaining about slowness and unresponsiveness of their VM's and vApp's during the APD events. The customer used HP blades and Flex 10 modules for connecting the enclosures to the core network and NFS network. After troubleshooting and ruling-out all enclosure and network related possible causes, only the Netapp NAS heads or ESXi hosts could be the root cause of the APD events. These APD events occurred at random times, and most of these times the Netapp NAS heads didn't have any background processes running, nor did we find any information in the system logs of the Netapp pointing to the cause. Last place to look where the ESXi hosts, first checked all physical NIC's (which are actually virtual NIC's as they are presented to the Blade by the Flex 10 module) no issues there. Next checked network config of all hosts, luckily 1 host within a HA cluster assigned to a vCloud environment wrote warning messages of a duplicate IP address being used on 1 of it's VMkernel interfaces.
When I checked the network config of this host I saw nothing strange, so I started checking all other hosts with in the same cluster finding nothing... I continued checking a other HA cluster assigned to the same vCloud environment, finally I found a other host which had a VMkernel interface configured with the same IP address. Both VMkernel interfaces where used for NFS, but this IP address was not from the same subnet the Netapp NAS heads where in. They where in a seperate subnet in which a other NFS NAS was connected, this was used by only 1 of the 2 hosts. But on the hosts is was not used it was configured on the same dVswitch as the NFS network to the Netap NAS heads.
I updated the network config for the unused VMkernel interface and the "random" APD events disappeared. So I guess having a IP address conflict on a interface not used but within the same (d)Vswitch with a interface that is being used for NFS could cause APD events for multiple hosts and even multiple HA clusters, in fact it even affected hosts outside the vCloud environment the only thing in common was that they all where connected to the same Netapp NAS heads.