When resolving issues with on-premises Exchange sometimes the issue may be directly within Exchange, other times the root cause may lie outside Exchange. Depending upon the exact nature of the case we may have to investigate network switches, load balancers or storage. When Exchange is virtualized then the hypervisor and it's configuration also may require attention.
This was the case with a recent customer engagement. Initially the scope was upon Exchange with symptoms including Exchange servers dropping out of the DAG, databases failing over and poor performance for users. As with most cases that get escalated to me, there is rarely only a single issue in play and multiple items have to be addressed. The customer was using ESX 5 update 1 as a hypervisor solution, and Exchange 2010 SP3. Exchange was deployed in a standard enterprise configuration with a DAG, CASArray and a third party load balancer.
In this case, one of the biggest issues was that of the hypervisor discarding valid packets. Within this environment an Exchange DAG server that was restarted had discarded ~ 35,000 packets during the restart. Exchange servers that had been running for a couple of days had discarded 500,000 packets. That's a whole lot of packets to lose. This was the cause of servers dropping out of the cluster and generating EventID 1135 errors. This is issue is discussed in detail in this previous post, which also contains a PowerShell script that will easily retrieve the performance monitor counter from multiple servers. The script allows you to monitor and track the impact of the issue easily.
Yay – we found the issue and all was well. Time to close the case? NO!
There were multiple other issues involved here and not all of them were immediately obvious when troubleshooting so I wanted to share these notes for awareness purposes. All software needs maintenance, Exchange itself is no exception and it is critical to keep code maintained with the vendors' updates. This ensures that you address known issues, and proactively maintain the system. As always this must be tempered with adequately testing any update in your lab prior to deploying it in production.
This post is only to raise awareness of the below issues and is not intended to be negative to the hypervisor in question. As stated above Exchange, Windows and Hyper-V all require updates. Hyper-V experienced network connectivity issues previously and required an update.
Duplicate DAG IP
The customer reported that the DAG IP address was causing conflicts on the network. The typical cause for this is for the administrator to manually add the DAG IP to one or more cluster nodes manually. This is an IP address that can be bound to any node and the cluster service will perform the required steps, and the administrator should only add it as a DAG IP address and do no more. The DAG was correctly configured and servers only had their unique host IP address assigned.
Initially there seemed to be a correlation with the duplicate DAG IP address and backups. However this was quickly discarded as the duplicate IP issue would only happen once every several weeks and could not be reproduced on demand by initiating a backup.
There is an issue documented in KB 1028373– False duplicate IP address detected on Microsoft Windows Vista and later virtual machines on ESX/ESXi when using Cisco devices on the environment. This issue occurs when the Cisco switch has gratuitous ARPs enabled or the ArpProxySvc replied to all ARP requests incorrectly
Large Guest OS Packet Loss
This was the initial issue discussed above and is covered here.
It is always prudent to keep working an issue until it is proven that the root cause has been addressed. In this case additional research was done to investigate networking issues on the hypervisor and the below links are included for reference.
Reporting On EventID 1135
The symptom of large guest OS packet loss can include servers being dropped from the cluster. When a node is removed from cluster membership, EventID 1135 is logged into the system event log.
To report on such errors, I wrote a script to enumerate the instances of this EventID. Please see this post for details on the script.
KB 2055853– VMXNET3 resets frequently when RSS is enabled in a Windows virtual machine
Disabling RSS within the guest OS is not ideal for high volume machines as this could lead to CPU contention on the first core. Please work to install the requisite update for the hypervisor.
Virtual NIC E1000E Potential Data Corruption Issue
KB 2058692– Possible data corruption after a Windows 2012 virtual machine network transfer
Modern versions of Windows will typically not be using this virtual NIC – currently they will typically use VMXNet3. However be aware of the other issues on this page affecting VMXNet3 vNICs.
vShield Filter Driver
When installing the VMware tools in ESXi5, selecting the FULL installation option will also install the vShield filter driver. There is a known issue with this filter driver that is discussed in KB 2034490– Windows network file copy performance after full ESXi 5 VMware Tools installation.
Starting with ESXi 5.0, VMware Tools ships with the vShield Endpoint filter driver. This driver is automatically loaded when VMware Tools is installed using the Full option, rather than the Typical default.
VMware And Windows NLB
I also saw this TechNet forum post with a related issue to what was observed onsite. Servers would discard a very high number of packets which would severely impact the application users were trying to access.
There are some important items to review when configuring NLB on VMware.
KB 1556 – Microsoft NLB not working properly in Unicast Mode
KB 1006558 – NLB on VMware: Example configuration of NLB Multicast Mode
KB 1006778 – NLB Unicast example (though VMware typically recommends multicast)
It is critical to discuss the NLB implementation with the hypervisor team and also the network team. Be very specific with what is being implemented and what is expected of both of these teams. Some network teams do not like NLB unicast as it leads to switch flooding, whilst others do not appreciate having to load static ARP entries into routers to ensure remote users can access the NLB VIP. Cisco has Catalyst NLB documentation here. Avaya has some interesting documentation on this page.
For this and other reasons Exchange recommends the use of a third party load balancer. This could be a physical box in a rack or a VM which can run inside Hyper-V or ESX. Please consult with your load balancer vendor so they can best meet your business, technical and price requirements.