Kris Waters, one of my great colleagues from the US, originally posted a really neat list of items which can mitigate issues found in a lot of large scale Exchange deployments. Please take the time to review her post here. There are a lot of valuable pointers in her blog, so definitely check it out!
In addition there are a couple of other items listed below that you may also want to review.
As Kris states, please ensure that you carefully review and test any items mentioned here prior to placing them into production! As I like to say, some of these items follow the Captain Jack Sparrow words of wisdom – the pirate’s code is more what you’d call guidelines than actual rules… In other words carefully consider each on its on merits and how it relates to your organisation!
Update 27-11-2013: Added Logging section
Update 8-1-2014: Added RPC Client Access detail to Logging section
Update 21-6-2014: Added additional cluster hotfix information
Cluster Hotfixes
This hotfix is strongly recommended for DAG servers, and has been for some time now. This resolves several issues in Windows 2008 R2 SP1. Exchange 2010 SP3 will prompt to install this update if it is not installed. The is the GUI view, and the same message is also displayed using command line setup.
Most Exchange admins will be aware of this issue, but what is sometimes then missed is the other base Cluster Hotfixes that are recommended by the cluster team. For example:
Recommended hotfixes for Windows Server 2008-based server clusters
Recommended hotfixes and updates for Windows Server 2008 R2 SP1 Failover Clusters
Recommended hotfixes and updates for Windows Server 2012-based failover clusters
Exchange 2010 is typically installed onto Windows 2008 R2, well at least most of the customers I visit do this, so looking at the 2008 R2 cluster updates in detail we see that in the general section there is an “interesting” hotfix contained in KB 2524478 The network location profile changes from "Domain" to "Public" in Windows 7 or in Windows Server 2008 R2. This causes the Windows Firewall to block traffic, as is something that you can find a previous post on here. On a recent case I also saw this change once the server had been running for a wile. This customer had various network issues that seemed to exacerbate this issue.
Networking
Make sure that the network card drivers and firmware are at the correct build level. This can be a tricky one as you do not necessarily want to just stick on the latest available driver as it was just released. By carefully testing and evaluating releases you can determine the appropriate builds in conjunction with your hardware vendor of choice.
One other item that is now critical is the firmware of the blade chassis. Also make sure that its firmware and management components are also at the correct build level.
Networking – Sleepy NIC
There are also issues with NIC reverting to a power save state and dropping traffic.
Please see the original post here.
OS Updates
In addition to ensuring that the monthly security updates are installed we sometimes see issues with the following items and it can pay off to keep them in mind when troubleshooting:
- TCPIP.sys
- AFD.sys
- Ntoskrnl.exe
- Storport.sys
An issue with modern Exchange servers installed on blades is that the blade can have 256GB of memory, but only 146GB drives installed locally so how to configure the pagefile? Pagefile still needs to be RAM + 10 MB for performance and dumpfile reasons. Where to put it then and preserve the ability to capture complete memory dumps?
Traditionally if you select the Complete memory dump option, you must have a paging file on the boot volume that is sufficient to hold all the physical RAM plus 1 megabyte (MB). That does not work in the scenario above!
In Windows Vista, in Windows 7, in Windows Server 2008, and in Windows Server 2008 R2, this paging file can be on a partition that differs from the partition on which the operating system is installed as discussed in Overview of memory dump file options. There is also another hotfix available that allows you to create a dump file even if you have no pagefile configured at all! No Exchange admin should be doing this as Exchange requires the pagefile configuration mentioned above!
One note on storport that needs to be called out. You must check with the storage vendor, especially when SAN storage is used, to ensure the storage vendor supports the version of storport. The last thing you want is to have performance issues, call them for support and be told that you are in an unsupported position. That will spoil your day in a hurry!
One interesting issue I did see was around very slow access to performance counters. The underlying issue was with the Remote Registry service as it was leaking resources. This is resolved with hotfix 2699780.
Large Memory issue
Windows 2008 R2 has an networking issue when a server has more than 32GB RAM. This is covered in KB 2634907
.NET Update
Hotfix 2497453 is required to resolve an issue with the .NET Framework. This issue manifests itself when Exchange 2010 SP1 is installed due to the Free/Busy intercept mechanism which was introduced in Exchange 2010 SP1. This issue is discussed here.
Exchange Service Pack
Exchange 2010 SP3 should be installed or you are in the planning stages to install it. Exchange 2010 SP2 will move out of support on the 8th of April 2014.
Exchange Logging
Note that in Exchange 2010 not all logging is enabled by default. So if an issue occurs you may need to enable logging and then wait for the issue to reoccur.
IMAP Logging
Note that the log location must be set first, and then the logging can be enabled
Set-ImapSettings -Server Exch-1 -LogFileLocation D:\Logs\IMAP
Set-ImapSettings -Server Exch-1 -ProtocolLogEnabled $true
POP Logging
Note that the log location must be set first, and then the logging can be enabled
Set-PopSettings -Server -LogFileLocation D:\Logs\POP
Set-PopSettings -Server <servername> -ProtocolLogEnabled $True
SMTP Receive Connector Logging
Note that the format is serverconnector name. In the below example Exch-1 is the server, and "Default Exch-1" is the default receive connector on that server.
Set-ReceiveConnector "EXCH-1\Default EXCH-1" -ProtocolLoggingLevel Verbose
If you wanted to process all receive connectors:
Get-ReceiveConnector | Set-ReceiveConnector -ProtocolLoggingLevel Verbose
SMTP Send Connector Logging
Set-SendConnector Interwebs -ProtocolLoggingLevel Verbose
If you wanted to process all send connectors:
Get-SendConnector | Set-SendConnector -ProtocolLoggingLevel Verbose
SMTP Implicit Intra-Organisation Send Connector Logging
Set-TransportServer Exch-1 -IntraOrgConnectorProtocolLoggingLevel Verbose
RPC Client Access Logging
By default, throttling logging is disabled for the RPC client access service. Therefore, you will not see throttling information in the RPC Client Access logs. To enable throttling logging, follow these steps:
- Open the following file in a text editor, such as Notepad: C:\Program Files\Microsoft\Exchange Server\V14\Bin\Microsoft.Exchange.RpcClientAccess.Service.exe.config
- In the file, locate the <add key="LoggingTag" value="ConnectDisconnect, Logon, Failures, ApplicationData, Warnings" /> section.
- Type Throttling in the comma-separated string. For example, type Throttling in the string that resembles the following: <add key="LoggingTag" value="ConnectDisconnect, Logon, Failures, ApplicationData, Warnings, Throttling" />.
Save and then close the file. - Restart the RPC Client Access service.
Cluster Log Wrap
Be aware of the wrapping issue with the Windows 2008/2008 R2 cluster log .ETL files
While we do NOT support directly manipulating the DAG’s underlying cluster, it is very useful to look at the cluster logs if there is an issue. By ensuring that the cluster logs are sized correctly there is less risk of losing valuable troubleshooting data.
The default cluster log is 100 MB. In the examples below the new size is indicated by XXX. Size this so you have sufficient cluster log data retention.
To modify using PowerShell
Set-ClusterLog –Size XXX
To Modify using cluster.exe
Cluster.exe LOG /Size:XXX
Cluster Log Generation
In the newer versions of Windows, the failover cluster human readable log is not present on disk and must be explicitly generated. This is different from Windows 2003 and 2000 Failover Clustering where the readable log file was present without any intervention.
This can be done via the command prompt or PowerShell.
Command Prompt
Cluster.exe LOG /GEN
Look for the log on each cluster member in the local C:WindowsClusterReports folder.
Sometimes you may want to look at logs individually, but typically the command will look like this to dump the cluster log from all notes to a specified central directory so you do not have to manually pull them together:
Cluster.exe LOG /GEN /COPY:<Directory>
If you want to get the logs only for the last 90 minutes then we can add the SPAN parameter. The below example copies the logs from all servers to the C:Clustlog folder on the local server executing the command:
Cluster.exe LOG /GEN /COPY:"C:Clustlog" /SPAN:90
Additional information can be found on TechNet.
PowerShell
Get-ClusterLog
Typically the command will look like this to dump the cluster log from all notes to a specified directory:
Get-ClusterLog –Destination ‘directory’
Additional information can be found on TechNet.
Cluster Heartbeat
We do not recommend changing the default cluster inter and intra subnet heartbeat intervals as a means to resolve underlying network issues. The network issue should be fixed. Increasing the heartbeat settings merely masks the underlying issue.
To check what is currently set we can run:
Cluster.exe /cluster:<ClusterName> /prop
This will return the following entries:
CrossSubnetDelay 1000
CrossSubnetThreshold 5
SameSubnetDelay 1000
SameSubnetThreshold 5
Or in the land of PowerShell, we can use Get-Cluster to see the properties but make sure that the PowerShell module is loaded up first:
Import-Module FailoverClusters
Then we can run:
Get-Cluster | Format-List *
Note that there is an asterisk after the Format-List command. That always gets me!!
Cluster Logging – RouteHistoryLength
If the cluster thresholds are increased, then you will also have decide if you also want to also change the RouteHistoryLength logging option in the cluster.
In Windows Server 2012 there is additional logging to the Cluster.log for heartbeat traffic when heartbeats are dropped. By default the RouteHistoryLength setting is set 10, which is two times the number of default thresholds. If you increase the SameSubnetThreshold or CrossSubnetThrehold values, it is recommended to increase the RouteHistoryLength value to be twice the value to ensure that if the time arises that you need to troubleshoot heartbeat packets being dropped that there is sufficient logging. This can be done with the following syntax:
(Get-Cluster).RouteHistoryLength = 20
Office Filter Pack
Office 2010 Filter pack SP2 is available. This should be deployed with all available updates for the Filter Pack from Microsoft Update.
Throttling Policy
Ensure that your users receive the appropriate throttling policy, and the same applies to service accounts!
The default throttling policy should remain unchanged, and you should create new throttling policies for each of the groups of users you wish to have different settings.
One other item worth mentioning is that some of the throttling infrastructure changes have gone unnoticed. This is generally when I see folks running:
Set-Mailbox mailboxname –ThrottlingPolicy MyCustomPolicy
That’s great, but that is only a mailbox. What about machines that need to interact with Exchange where the content is a computer object rather than a mailbox?
To enhance this the Get-ThrottlingPolicyAssociation and Set-ThrottlingPolicyAssociation cmdlets were adding in Exchange 2010 SP1. Use the Set-ThrottlingPolicyAssociation cmdlet to associate a throttling policy with a specific object. The object can be a user with a mailbox, a user without a mailbox, a contact, or a computer account
Change Mailbox Quarantine Duration
Update 9-4-2014: Please see this other post for a more detailed discussion on quarantine.
The default time out for mailbox quarantine in 6 hours in Exchange 2010. In the scenario where an exec’s mailbox gets quarantine at 09:00 local time then it will stay quarantined until 15:00 unless you take action.
This may not be acceptable for some organisations, and the default value can be changed.
The settings for the amount of failures that lead to quarantining a mailbox and also for the amount of time that a mailbox should stay quarantined are stored in the MailboxQuarantineCrashThreshold and MailboxQuarantineDurationInSeconds keys in:
HKLM\SYSTEM\CurrentControlSet\Services\MSExchangeIS\<Server Name>\Private-{db guid} subkey.
The default values for these keys are three failures for MailboxQuarantineCrashThreshold and 21,600 seconds (six hours) for MailboxQuarantineDurationInSeconds.
KB 2603736 discusses the issue.
SCOM Management Pack
I wish I had money for every time I asked if SCOM is monitoring Exchange and the reply is yes! Then we find out that the MP was imported and nothing else was done. That is not really what we need. Make sure all the events that you are interested in are actually monitored. Do not assume the default MP is all you need.
You will also find that overrides will be necessary to suppress items that are not relevant to your business. For Example:
- POP and IMAP are disabled by default in Exchange 2007 and 2010. Yet the MP seeks to monitor them by default
- You may not have an Internet accessible CAS servers due to an array of reasons, so external URLs may not be populated. Those external monitors will need to be overridden to disable them.
CAS Namespaces
Be sure to set the CAS URLs as per design and not overlook any.
Also ensure that when new Exchange servers are deployed that their URLs are changed immediately to the correct values and not left at the default ones.
Exchange OAB Configuration
Please review the OAB to ensure that legacy elements have been removed.
Outlook Configuration
Be sure to manage Outlook settings fully via GPO. You have AD and GPOs so use them to your full advantage! Do not let users create Personal Storage Tribbles (PSTs), lock those settings down!
Outlook Build Level
Ensure that Outlook is properly patched to mitigate any security issues, and also to provide fixes to issues! Neglecting client maintenance will lead to end user impact and should not be over looked.
Mailbox Auditing
Understand that mailbox auditing is disabled for all mailboxes by default. If you need, or ever will need, the ability to audit activity against mailboxes then you must manually enable this *BEOFORE* an incident ever happens. If you do not enable this, then there will be no audit data to review.
Administrator Audit logging
Administrator audit logging is enabled by default. This now saves to an arbitration mailbox in Exchange 2010 SP1 and beyond. It will log all changes made to the environment. Get cmdlets are not saved in the audit log.
Cheers,
Rhoderick