Global Update Manager is a feature that was introduced in Windows Server 2012 R2 for fail-over clustering, which enables configuration of how the cluster database is updated.  The three configuration for a fail-over cluster are as follows:

 

ValueDescription
0All nodes in the cluster receive and process an update before the cluster commits the change to the database.

Cluster database is always consistent as it is only updated after all nodes have processed the update.

Database reads occur locally on the node.
1Only a majority of nodes in the cluster must process an update before it is then committed to the cluster database.

For a database read, the cluster compares the latest time stamp from a majority of the running nodes, and uses the data with the latest time stamp.
2Only a majority of nodes in the cluster must process an update before it is then committed to the cluster database.

Database data maybe out of date compared to the latest data on the majority of nodes.

Database reads occur locally, but data maybe stale.

 

By default, value 1 is automatically set for a Hyper-V fail-over cluster, which means that the cluster is operating in an asynchronous mode.  A write only needs to be committed to a majority of nodes and then the cluster moves forward.  When there is a read request from a node though, it will need to verify with the other nodes that the data is accurate, because any single node is not guaranteed to have the most current data.  Therefore, a read request will need to confirm with a majority of nodes in the cluster before satisfying the request.

 

This seems like a perfectly reasonable option because the cluster does not have to wait for all nodes to be notified of and acknowledge the state change before it is ready to process the next transaction, which can help prevent delays.

 

The issue with this option is that when requests for information are made against a node in the cluster, the node has to communicate with a majority number of nodes in the cluster, to get confirmation, before it can send a response to the request.  For adhoc requests, this is fine, however when requests are constantly being put the cluster, this puts a massive communication load on the cluster.

 

So, what can possibly send enough requests for information that it actually affects the performance of the Hyper-V cluster??  The tools being used to manage and monitor it… Specifically SCOM and SCVMM!!

 

Warning Tape

 

In larger Hyper-V clusters, for example over 10 nodes, the load that requests for information that SCOM and SCVMM put on the cluster can seriously start to effect performance.  The reason for this is the default configuration of Global Update Manager.  By having it in the default mode of “1”, every time a request is made for information on a resource group (such as a virtual machine), a majority number of nodes all have to checked in order to verify that the most up to date information is being returned.  And, while this wouldn’t be too much of an issue if requests weren’t being made constantly, it is a huge issue if requests are being made on a regular and frequent basis, which is just what SCVMM and SCOM do… and specifically, the Cluster management pack!!

 

At this point, you may be thinking why does this cause a performance issue as it is just communication across the cluster to validate the state of the information.  The reason this causes performance problems is because there is a CPU overhead on the majority of nodes being queried to provide the relevant information, and if utilizing VMQ, there is a huge CPU overhead on the logical processors that have been assigned.

 

During my performance investigations, I saw that the two logical processors allocated for VMQ for the live migration network were running permanently at anywhere between 70% and 95% usage on all nodes in the cluster, and all other logical processors looked fine.  Because it was only these logical processors that were being so highly utilized, that told me that there must be a lot of traffic going across the live migration network.  At this point, I didn’t know what this traffic was, or what was causing it, but as chance had it, all cluster nodes were put into maintenance mode in SCOM in order to patch them, and at that point CPU usage dropped massively!  Not only did the usage of the highly utilized logical processors reduce by over 90%, the load on all other logical processors dropped too.

 

So, I knew at this point it had to be something configured in SCOM causing the performance issues, but I had no idea what.  Was it something in the Hyper-V management pack, Core OS management pack, cluster management pack etc etc.??  Therefore, I decided to take the hosts out of maintenance mode and start removing management packs.  As soon as I removed the Cluster management pack, and that change got applied to the host monitoring agents, the CPU load drastically dropped again just as it did when the agent was placed into maintenance mode.

 

Now, knowing it was the Cluster management pack, I wanted to figure out what it was in the management pack that caused this, and it was this investigation that led me to find monitors that query cluster resource group states every 15 minutes.  On the Hyper-V cluster I was working on, there was over 26 hosts and almost 400 virtual machines, which means almost 400 requests for state are being made against the cluster every 15 minutes, and every request has to communicate with at least 14 nodes to verify valid data is being returned.  This is what was causing all the pain!!

 

Woohoo

 

With the cause of the performance issue found, what could I do to fix it other than disabling cluster resource monitoring??  Change the mode of Global Update Manager!!  When realizing that the traffic load is ultimately the problem, and that this was because of calls against the cluster for resource state, I found the TechNet article on Global Update Manager, https://technet.microsoft.com/en-us/library/dn265972.aspx#BKMK_GUM.  When seeing the line stating that the cluster is in mode “1” by default for Hyper-V, and what that meant, it was logical to think that changing the mode to “0” would resolve the issue, as communication across the cluster nodes would no longer be required.

 

And, VIOLA!!  That was the fix… Changing the cluster DatabaseReadWriteMode to 0 instantly removed the traffic issue, and therefore improved performance.  This was also verified by Microsoft, which also provided more confidence that this was the right solution.

 

During my investigations, I did find a few forum posts about changing the DatabaseReadWriteMode to improve performance, but there was no real detail behind why.  So, I hope this post provides the detail around what this setting is and why it is important.

 

 

Happy virtualising!!

David