clusters and not for any other type of workload. Remember also that Windows Server
2016 has site awareness and will utilize the site thresholds instead of subnet
thresholds, but by default the values are the same.
Windows Server 2016 introduces additional resiliency options to solve many of the
causes of service interruption today that are related to the factors previously
mentioned (suboptimal cluster configurations). A transient failure in network or
storage connectivity would cause VMs to be moved or crash, resulting in outages far
longer than the actual transient network or storage issue.
Compute resiliency is enabled by default and changes the behavior of nodes when they
lose connectivity to the rest of the cluster. Prior to Windows Server 2016, if a node lost
connectivity to other nodes in the cluster, it would become partitioned, lose quorum,
and no longer be able to host services. This would result in any VMs currently running
on the node to be restarted on other nodes in the cluster, resulting in an outage for
many minutes, whereas the actual network interruption may have been over in 15
seconds. In this scenario, clustering is causing a greater outage than the underlying
network issue. Compute resiliency enables the node to behave differently when it
loses connectivity to other nodes in the cluster that could be caused by networking
issues or even a crash of the cluster service.
Compute resiliency introduces two new cluster node states and a new VM state. The
new cluster node states are Isolated and Quarantined. Isolated reflects that the node
is no longer communicating with other nodes in the cluster and is no longer an active
member of the cluster; however, it can continue to host VM roles (this is the
important part). I cover Quarantine later in this section. The new VM state
corresponds to the Isolated node state on which the VM runs and is Unmonitored,
which reflects that the VM is no longer being monitored by the cluster service.
While in the Isolated state, a cluster node will continue to run its virtual machines,
with the idea being that within a short time any transient problems such as a network
blip or cluster service crash will be resolved and the cluster node can rejoin the
cluster, all without any interruption to the running of the VM. By default, a node can
stay in the Isolated state for 4 minutes, after which time the VMs would be failed over
to another node, as the node itself would go to a down state and no longer be able to
host any cluster services.
The following two configurations are applicable to compute resiliency:
ResiliencyLevel Defines whether compute resiliency is always used (which is the
default, and denotes a value of 2) or is used only when the reason for failure is
known (for example, if the cluster service crashes, which is configured by setting
the value to 1). This is configured at the cluster level by using (Get-
Cluster).ResiliencyLevel = <value>.
ResiliencyPeriod The number of seconds, that a node can stay in Isolated mode.
A value of 0 configures the pre-2016 behavior of not going into Isolated state. The
default value is 240 (4 minutes), which is just less than the average time for a