With QoS correctly configured, you no longer have to use a dedicated network just for
clustering and can take advantage of converged environments without sacrificing
performance.
I’ve mentioned several times the heartbeat frequency of once a second and that if five
consecutive heartbeats are missed, then a node is considered unavailable and removed
from the cluster and any resources it owns are moved to other nodes in the cluster.
Remember that the goal of clustering is to make services as available as possible; a
failed node needs to be detected as quickly as possible so that its resources and
therefore its workloads are restarted on another node as quickly as possible. The
challenge here, though, is that if the networking is not as well architected as you
would like, 5 seconds could be just a network hiccup and not a host failure (which
with today’s server hardware is far less common, as most components are redundant
in a server and motherboards don’t catch fire frequently). The outage caused by
moving virtual machines to other nodes and then booting them (because remember,
the cluster considered the unresponsive node gone and so it could not live-migrate the
VMs) is far bigger than the few seconds of network hiccup. This is seen commonly in
Hyper-V environments, where networking is not always given the consideration it
deserves, which makes 5 seconds very aggressive.
The frequency of the heartbeat and the threshold for missed heartbeats is
configurable:
SameSubnetDelay: Frequency of heartbeats, 1 second by default and maximum of 2
SameSubnetThreshold: Number of heartbeats that can be missed consecutively, 5 by
default with maximum of 120
You should be careful when modifying the values. Generally, don’t change the delay of
the heartbeat. Only the threshold value should be modified, but realize that the
greater the threshold, the greater the tolerance to network hiccups but the longer it
will take to react to an actual problem. A good compromise threshold value is 10,
which happens automatically for a Hyper-V cluster. As soon as a virtual machine role
is created on a cluster in Windows Server 2012 R2 or above, the cluster goes into a
relaxed threshold mode (instead of the normal Fast Failover); a node is considered
unavailable after 10 missed heartbeats instead of 5. The value can be viewed using
PowerShell:
(Get-Cluster).SameSubnetThreshold
10
Without any configuration, the Hyper-V cluster in Windows Server 2012 R2 will
automatically use the relaxed threshold mode, allowing greater tolerance to network
hiccups. If you have cluster nodes in different locations, and therefore different
subnets, there is a separate value for the heartbeat delay, CrossSubnetDelay (new
maximum is 4), and the threshold, CrossSubnetThreshold (same maximum of 120).
Once again, for Hyper-V, the CrossSubnetThreshold value is automatically tuned to 20
instead of the default 5. Note that the automatic relaxed threshold is only for Hyper-V