12 August, 2013

vSphere 5.1 All Paths Down (APD) warning messages Part-1

As you might know there has been some changes to APD and PDL behaviour in vSphere version 5.1.
APD got a new way of handling I/O's during APD scenarios and by setting a advanced option you can now even choose between the old way and the new I/O's "fast fail" way.
In short, the old way attempted to retry failed I/O's for a period of 140 seconds. After it would stop, the new way is to stop all I/O's directly. There are situations that the old way could cause the host(s) to disconnect from vCenter or even become unresponsive, this is something you want to prevent. By using the new way it will prevent these issues.
If you want to know more about ADP and PDL behaviour you should read the different articles on these subjects on Duncan Epping's blog Yellow Bricks or to be more specific start with this blog post.

So way I am writing a post about this subject when there is already a lot of good information out there ? A couple of days ago I have been asked to troubleshoot APD warning events in a vCenter log of a customer and I found that it was very difficult to pinpoint the cause of these APD messages.

There where multiple factors that made the troubleshooting difficult, one being that the customer has a stretched metro-cluster setup and a other one being that a part of the messages would appear on recurring times and the other part at "random" times.
The origin of the APD messages on recurring times was quickly found, when these would only occur if there where "extra" background processes running on the Netapp NAS heads. With background processes I mean processes like backup or de-duplication tasks. And the timeslot these APD events occurred made it a not so urgent issue. On how the APD events related to the background processes where solved I will write a other blog post as soon as all is double checked and confirmed both by VMware and Netapp.
On the other-hand the "random" APD events where a lot more difficult to pin point and the issue was a lot more high-profile as customers where complaining about slowness and unresponsiveness of their VM's and vApp's during the APD events. The customer used HP blades and Flex 10 modules for connecting the enclosures to the core network and NFS network. After troubleshooting and ruling-out all enclosure and network related possible causes, only the Netapp NAS heads or ESXi hosts could be the root cause of the APD events. These APD events occurred at random times, and most of these times the Netapp NAS heads didn't have any background processes running, nor did we find any information in the system logs of the Netapp pointing to the cause. Last place to look where the ESXi hosts, first checked all physical NIC's (which are actually virtual NIC's as they are presented to the Blade by the Flex 10 module) no issues there. Next checked network config of all hosts, luckily 1 host within a HA cluster assigned to a vCloud environment wrote warning messages of a duplicate IP address being used on 1 of it's VMkernel interfaces.
When I checked the network config of this host I saw nothing strange, so I started checking all other hosts with in the same cluster finding nothing... I continued checking a other HA cluster assigned to the same vCloud environment, finally I found a other host which had a VMkernel interface configured with the same IP address. Both VMkernel interfaces where used for NFS, but this IP address was not from the same subnet the Netapp NAS heads where in. They where in a seperate subnet in which a other NFS NAS was connected, this was used by only 1 of the 2 hosts. But on the hosts is was not used it was configured on the same dVswitch as the NFS network to the Netap NAS heads.
I updated the network config for the unused VMkernel interface and the "random" APD events disappeared. So I guess having a IP address conflict on a interface not used but within the same (d)Vswitch with a interface that is being used for NFS could cause APD events for multiple hosts and even multiple HA clusters, in fact it even affected hosts outside the vCloud environment the only thing in common was that they all where connected to the same Netapp NAS heads.

No comments:

Post a Comment