In most enterprise customers there is a segregation of duties between multiple teams. This could be networking and desktop. Or Windows Server platform and messaging. It was the split in these roles, and especially a dearth of communication which led to this tale of woe with TLS 1.2 and Exchange.
The reasons for moving to TLS 1.2 and avoiding SSL2, SSL3, TLS 1.0 and TLS 1.1 should be well understood by now. This applies to cloud services and also on-premises. TLS 1.3 has been ratified as a standard though is not supported on all fronts at this time. That will change.
In order to help with the security posture for their on-premises server assets, the security team rolled out an updated Schannel configuration which modified the TLS configuration on all of their Windows Server 2016 machines. Specifically they wanted to disable TLS 1.0 and TLS 1.1 from being used as either a client or server. In order words, completely disable TLS 1.0 and TLS 1.1 on all of those servers.
So the registry keys were rolled out and the the fun ensued. The first issue was that the servers were not restarted at that time. In order to process Schannel configuration changes, the server must be restarted. Since this was not done, it made troubleshooting issues harder as and issues related to the TLS change would only manifest themselves after a reboot. Typically the last action performed gets assigned the blame for causing the issue. Initially Windows updates were blamed as the server was restarted and the application failed on restart. On the Exchange servers, Exchange 2016 CU18 was blamed as the server was restarted after the CU was installed.
The customer had a documented maintenance process with detailed verification steps to ensure the server was healthy after an update was installed. As part of this, they run the Get-ServerHealth cmdlet to ensure that Managed Availability is OK.
Normally things are pretty good, but they were greeted with output like this:
As you can see in the repro in my lab, Managed Availability is not happy.
Note that we use the handy filtering as mentioned in this post.
Exchange CU18 "Broke" The Server
As noted above, the last action to be performed gets tarred as the root cause. At least initially. So much for being innocent until proven guilty...
Looking into the Managed Availability logs we can see that there are multiple probes which are failing.
As an example of one of these errors, this is one of the instances where ActiveSync was reported with an issue.
The error logged was: "The underlying connection was closed: An unexpected error occurred on a receive."
The detailed event message is shown below:
System.ApplicationException: The underlying connection was closed: An unexpected error occurred on a receive. at Microsoft.Exchange.Monitoring.ActiveMonitoring.ClientAccess.CafeLocalProbe.DoWork (CancellationToken cancellationToken) at System.Threading.Tasks.Task.Execute() --- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at
Nothing in the error message they initially looked at specifically noted that there was a TLS issue, just that connectivity was failing.
What was especially annoying whilst troubleshooting was that connections made on the sever to OWA and EAC worked just fine in Internet Explorer. More on that later.
Blaming The Guilty
Tracking the true root cause was made easier with the customer's detailed change control records. I asked for the details and dates of the last changes on the Exchange environment. This search brought up the record of TLS configuration two weeks prior.
Operating an enterprise IT environment is about three main things: People. Process. Technology.
It was the second item that helped to pinpoint the issue as we had the records for previous changes. After consulting with the various teams and reviewing the change details it became clear what had gone awry.
The Devil Is In The Protocol Details
In the previous change the customer's security team implemented the registry keys to disable TLS 1.0 and TLS 1.1 client and server protocols. They set the DisabledByDefault = 1 and Enabled = 0 values in the registry for the below keys:
These keys reflect the main OS level configuration for TLS and SSL on a Windows server. Note that this assumes the application uses Schannel and not other components which may be application specific.
In the post to enable TLS 1.2 on Exchange servers, the Exchange team certainly does require the above Schannel keys to be set. That is not disputed (well maybe just a comment about the DisabledByDefault). But that alone is not enough. Since Exchange now relies on the .NET Framework, the article also states that .NET must be set to use TLS 1.2 and that is done by another registry setting. This critical step was missed...
From the Exchange team post:
Enable TLS 1.2 for .NET 4.x
This step is only required for Exchange Server 2013 or later installations where .NET 4.x is relied upon. The SystemDefaultTlsVersions registry value defines which security protocol version defaults will be used by .NET Framework 4.x. If the value is set to 1, then .NET Framework 4.x will inherit its defaults from the Windows Schannel DisabledByDefault registry values. If the value is undefined, it will behave as if the value is set to 0. By configuring .NET Framework 4.x to inherit its values from Schannel we gain the ability to use the latest versions of TLS supported by the OS, including TLS 1.2.
Resolving The Issue
Now that we know that some of the required TLS 1.2 keys were not set, we can either back out the previous TLS change or go forward and complete all required steps for TLS 1.2 in the environment. The latter was chosen as the ultimate goal was to disable legacy TLS versions.
The SystemDefaultTlsVersions key was set to be enabled in both of these locations:
After the server was rebooted, Managed Availability was happy. No Unhealthy items were reported:
There was a comment above about IE continuing to work with no issues and that we would get back to that.
We saw different behaviour for IE and the automated Exchange probes. Why? Well, IE was picking up and using the Schannel settings for the OS. Unfortunately .NET was not configured to use the OS settings and this caused the mismatch. The OS was blocking legacy SSL and TLS protocols but .NET was attempting to use those protocols since it was using it's default settings which did not permit .NET to use TLS 1.2.
By setting .NET Framework to inherit the OS settings, that overrode its defaults and allowed .NET to negotiate an allowed TLS version.