This is an interesting tale, and is something I have never seen done before in all my time working with Exchange. The client had managed to tweak the configuration of Exchange and modify the CAS configuration so significantly that Exchange was absolutely confused. Specifically, the CAS Array was modified so much that Autodiscover was no longer handing out the CASArray name to clients. This was only one of the issues, there were many others. I must give them full marks for creating an interesting case for me to look at! Though this won me a free plane ticket to go and fix it as a critsit. Never a dull day in this job!
In case you were wondering about the title of this post, it relates to where I used to work 16 years ago. The protocol used to indicated that you were a numpty and had left your mobile phone at home was to email out with the subject of "I am Zorg – Forgettor of Mobile Phones". That way everyone knew not to bother calling your mobile that day.
The tale begins with some Exchange servers, a firewall and some connectivity issues.
In this environment, all the Exchange servers initially had been deployed into a single AD site. This site also had dedicated domain controllers for the Exchange servers. Clients were on multiple remote subnets. This picture is like most large-scale enterprise deployments which have a consolidated administration model and distributed clients. So far so good. What was not so good was that their security team was now forcing deployment into tiered zones. This means that between Exchange and DCs there were now firewalls blocking network traffic. Between DCs in different subnets there were firewalls blocking network traffic. Between clients and DCs there were firewalls blocking network traffic.
This all sort of worked until a major network issue hit, and the network configuration was changed so that network connectivity was severely impacted and further restricted. Exchange at this point experienced multiple issues. The network team was not able to resolve this in a timely manner, so the decision was made to modify the security zones. To effect the proposed solution, the Exchange servers would change AD sites. Unfortunately, the Exchange move was not fully completed, and this initiates the tale of woe I will now bestow upon you….
AD Site Changes Issues
The belief was that Exchange now had the required access to AD after changing AD site membership. Further to this, the thought was that simply changing the IP addresses assigned to the Exchange servers and all would be fine. This is not the case with the Exchange CAS role. Manual intervention is required to complete the AD site move. There were two distinct misses:
Not updating the Autodiscover AutoDiscoverSiteScope attribute
Not updating CASArray AD site assignment
When the Exchange CAS role is installed, it will write into AD the value of the current AD site. This value is not automatically updated if the Exchange server is then moved to a different AD site. One common scenario where this ins noted is when installing Exchange into a separate install AD site so that clients do not contact the server until the CAS URLs and certificates have been updated. Once the CAS certificate and URL configuration has been updated, the server can be moved to the production AD site. This removes the risk of users getting certificate errors from the default self-signed certificates.
The Exchange 2010 CASArray is bound to a specific AD site. There can be only one CASArray per AD site. The Name, FQDN and AD Site are specified at creation time.
Despite these two items not being updated, Exchange still continued to function. Outlook 2010 was deployed on the clients. Everything seemed fine until there was an outage at the DR location. DR for this customer was considered cold DR, and no active users or databases were to be running in the DR location. However, what happened was totally unexpected. Half of the 30,000 users immediately lost connectivity to Exchange when the DR site went offline. This was a little puzzling as all the databases were mounted in the production site and all production site CAS servers were all running just fine. There were no DNS or load balancer issues either. Once the power outage was corrected in the DR site, the affected users immediately regained access to Exchange.
During this outage, there was a focus upon messaging services due to the volume of helpdesk tickets. It was noted that user satisfaction with the environment had previously started to wane, and users were creating many more helpdesk tickets. Incident management started to pay more attention to the tickets being created. It was noted that Exchange performance was severely impacted, and prior existing issues were brought up as people were now creating tickets to escalate all issues. The environment had been designed so that all users could run out of either datacenter and still afford to lose one mailbox server. This was not the observed behavior, and it was realized there was a serious performance issue. And before you ask, there was no monitoring in place. All of this was discovered in a reactive manner…
Exchange Performance Issues
All the production servers were running, and only half the users were connected. Yet Exchange and Outlook performance was impacted. Remember that the other half of the users went offline when the "cold" DR site became unavailable. In this environment Exchange was deployed on VMs. To troubleshoot the performance issues the Exchange VMs were moved from the shared production pool, to dedicated hypervisor hosts. That way there would be no contention for resources and therefore no performance issues, right?
Hypervisor Performance Issues
Well no. The exact opposite happened. The Exchange 2010 servers were always statically assigned 64 GB of RAM. This is great as it prevents issues with other applications or the hypervisor taking memory away from Exchange. However, the VMs were moved to a host which had 64GB of RAM. Since we cannae * change the laws of physics, were exactly do you think the hypervisor code is going to run? Holy memory pressure batman!
In fact, this was the environment which prompted writing several scripts which were previously published:
As you can imagine, there were quite a few issues going on with necessitated the above automation.
The hypervisor had no memory to operate, which created tremendous packet loss. As a direct result, DAG Mailbox servers were dropping in and out of the cluster dozens of times a day since heartbeats were not received due to packet loss. The cluster heart beat mechanism was operating as expected as a Windows Cluster will only tolerate a certain amount of lost connectivity before it pronounces a node as unavailable. The hypervisor packet loss was caused by the customer's virtualization team. In the executive meeting to review the onsite visit, when explainingthe 64GB VM on a 64GB host the lead for the virtualization team did a faceplam as one of their junior team members had made that change without consulting the team lead and did not fully consider all aspects.
Various issues have occurred with virtualized network cards. Reviewing the packets received discarded perf counter immediately showed that there was tremendous packet loss experienced by the Exchange servers. Some servers had discarded 3.5 million valid packets in the week which they had been running.
Whilst the script is keyed to look for Exchange servers, it is trivial to modify the collection to be a list of servers or pulled in from a CSV file. We used the latter to review the stats on all the customer's virtualized servers. most non-Exchange systems also had issues.
There was an update from the hypervisor vendor to help resolve some of the networking issues. To allow the hypervisor memory to function, the Exchange servers were moved back to the standard production pool after the below CPU issues were resolved. Moving the servers back to the regular production pool reduced the number of packets being discarded to a manageable level.
Performance issues are like fighting the hydra. There are multiple heads, and when one is removed there will be another to deal with.
After getting the packet loss to a manageable level, the next performance issue revealed was excessive CPU load on all the Exchange servers.
Exchange CPU Performance Issues
Exchange CPU usage was almost 50% higher than what the Exchange calculator indicated. The Exchange calculator was initially blamed. Poor calculator!
We checked the number of messages sent/received and average message size. The numbers which had been entered into the calculator did not match the reality of the environment. I am always suspicious when I see an average message size of 75K – that is the calculator's default value and is most likely not correct. A separate customer sized their environment using this value and immediately ran into issues. When we verified what the avg. message size was over the last month, their result was 300Kb. A slight increase….
In this customer's case, they were off on the average message size. Incorrect data had been entered. Though that would not account for the 50% increase in CPU.
What had caused the performance degradation which had initially prompted the move to different hypervisor hosts though? The high CPU issue had started prior to the move to the new hosts. Unfortunately, the cause was one which we have seen before and greatly pains Exchange administrators and Microsoft support personnel. The file-system antivirus exclusions were not correctly configured. In fact, there was a huge misalignment with the exclusions. The issue was not with the 3rd party antivirus tool. The issues were due to the customer's antivirus team.
In this case the customer's Exchange team was aware of the requirement to exclude Exchange from being scanned by file-system antivirus. The paths to the Exchange databases and other content locations had been defined, but the customer's antivirus team then implemented a feature that made all that configuration irrelevant. The customer's antivirus team did not understand how their tool should be configured. There were two main issues:
Mount point exclusions
Low risk process scanning was enabled, which also enables high risk process scanning. Exclusions were not added to the correct process level. In effect, all the required file system antivirus exclusions were missing, and all Exchange content was being scanned by antivirus. This included all databases, transaction logs.
Mount points were not correctly excluded due to the customer's antivirus team again not understanding how their tool should be configured. This was rapidly addressed with a conference call to the antivirus vendor with all the relevant parties on the call. The support we received from the vendor was top-notch, and they quickly educated the client on what should be done within the antivirus tool.
The necessary configuration changes were made to the file system AV policies. Note that Microsoft states WHAT should be excluded, not HOW it should be implemented. This is because every antivirus product has their own implementation and recommended practices. That guidance must come from the 3rd party vendor, not from Microsoft.
Unfortunately, damage had been done, and the mailbox databases had been scanned by file system antivirus. Multiple changes could have been made to the databases outside of Exchange. In order to fully correct this, new mailbox databases must be created (with the required exclusions) and then all mailboxes moved to these new databases. This is a very time consuming task, and is something that could have been so easily avoided.
After correcting the antivirus configuration, and restarting the servers the CPU load dropped by 50%. Who knew?
Outlook Disconnected When DR Offline
Now that the CPU and networking issues had been mitigated, attention was then turned to why users were disconnected when the cold DR site went offline. As the name implies this was a cold DR location, and users should only be connected to that infrastructure after the decision had been taken to manually activate the DR location.
On one of the client machines in question it was noted that Outlook was indeed connecting directly to a DR server. We could see this easily in netstat using the required options. The TCP connections were made from the local Outlook process to that specific destination CAS server. Checking the Outlook profile showed a server FQDN as the RPC endpoint instead of the expected CASArray name.
Initially all of the CAS namespaces were checked. The OWA, EWS, ECP, EAS, OAB, Autodiscover namespaces were all set as expected. None of the CAS namespaces were set to server FQDNs. The CASArray name was also set as expected for both the name and FQDN attributes. However, the Test Email AutoConfiguration tool on the Outlook client also returned the server FQDN. Hmm, why are we not getting the expected CASArray name back as the RPC endpoint? We can check the Name and FQDN associated to the CASArray using the below PowerShell;
Get-ClientAccessArray | Select Name, FQDN
Why would Exchange be returning server FQDNs and not the expected CASArray value? Time to go back and pick through the CASArray output again. As opposed to just looking for the Name, and FQDN attributes, the cmdlet was executed without any filtering. This provided the necessary data. The Get-ClientAccessArray was listed with no members. Note that this is separate from the Load Balancer configuration. This is the logical configuration inside of Exchange. Exchange knows all of the CAS servers deployed into a given ADsite, and automatically list them as members of the CASArray. That was not the case in this scenario, and was the clue as to why the clients had been configured the way they had. It also explained the behavior noted when DR went offline.
Since there were no members present in the CASArray, Exchange was unable to pick servers out of the list of CASArray members to include as the RPC endpoint in Autodiscover responses. It appears that "out of site" CAS servers were then used. There were servers not listed as members in the CASArray, but they should have been. This list included the production and DR CAS servers, and the individual server FQDNs were included in the Autodiscover XML response. This is why Outlook was instructed to connect to the individual DR servers. Since only the DR server's FQDN was present there was no high availability or failover. Outlook could resolved the FQDN name to an IP address. This mean that the full Autodiscover process did not kick in. Even if it did, there was a very high change the client would again be given a DR server FQDN. Please see this post for Autodiscover details.
To correct this issue, the CASArray AD site assignment had to be corrected. Once the correct AD subnets had been updated/confirmed as necessary the next stage was to update the Exchange CAS configuration. The AutoDiscoverSiteScope coverage was updated so that the relevant AD sites were now listed. The CASArray configuration was also corrected so that there was now a CASArray in the AD site the Exchange servers were in.
AD DS replication was forced throughout the forest, and replication verified on all DCs. This was to ensure that Exchange would then pick up the change. On each Exchange server, the NetLogon service was restarted, followed by an IISReset to flush cached values. All Exchange servers were then restarted.
Upon server restart the expected values were now returned from Autodiscover. The RPC endpoint was now the CASArray for internal clients. Outlook should process the fact that there is an updated endpoint, and the Outlook profile should be updated. Note that this says 'should' – in some cases this may not happened an either an Outlook profile repair or profile rebuild is required.
Almost out of the woods, but not yet. Correcting the CASArray exposed yet another issue….
Load Balancer – Not Able to Load Balance
After verifying that Autodiscover was indeed returning the CASArray name, I thought it was almost time to fly home. Alas no. There was yet another issue.
Now that the Outlook clients were connecting to the CASArray endpoint, this mean that they were using the load balancer VIP. While this is the expected and desired behavior it was noticed that the CPU load was significantly higher on three of the CAS servers.
The AV configuration was checked – no unwanted changes had been made. All servers had the same correct configuration. What was driving the CPU load on these servers? In order to verify that the load was equally distributed, I wrote a quick PowerShell script to report on the active CAS connections. This was previously published to TechNet:
It was immediately apparent that the three servers with the high CPU load had the vast majority of the client connections. All other CAS servers had minimal connection counts.
Since it is the responsibility of the load balancer to balance the load (hence its name), it was time to call the network team. The weighting and configuration of the pool was checked. All the configuration was as expected, none of the nodes were weighted differently.
The networking team wanted to know what Microsoft was doing to fix the issue. My reply was "nothing". Since the load balancer is responsible for the load distribution please contact your vendor. We are more than happy to assist with the investigation, but the load balancer vendor must drive the issue. For reference, there were no clients clumped behind NAT devices, all clients were routing to the load balancer on internal subnets. The load balancer was able to see all of the client IPs address which were connecting. The source IP affinity should be working, and the load balancer vendor needs to review why this was not the case.
Leaving on A Jet Plane
As you can imagine the above was a lot of work to investigate and correct. This was a full 5 days of work, where we would work from 09:00 until 21:00. There were still some other minor issues and items to address. Since they were not business impacting that work was scheduled for a separate engagement.
*– That is Scottish. Spellcheckers will not understand the implicit subtleties in the language. Also a nod to a well known "Scottish" Canadian, who will be born in Linlithgow in 2222.