Traditionally, a physical location was the boundary for a cluster for the following
reasons:
Cluster applications historically required access to shared storage that was
facilitated via SAN storage and connected using technologies such as iSCSI and
Fibre Channel. Making shared storage available to a remote SAN was typically not
possible because of the latencies introduced with remotely accessing storage, and
having a remote site dependent on storage in another remote site defeated the
point of having a multisite cluster, which was to protect from a site failure. The
solution was to have SAN-level replication, which historically was not available or
was prohibitively expensive. However, Windows Server 2016 does offer an in-box
Storage Replica capability in the Datacenter SKU that can help offset this
limitation.
Nodes in a cluster required a high-quality connection between nodes that was not
tolerant to latency. This network was used for heartbeats between nodes to ensure
that all nodes were healthy and available. Cluster resources required an IP address
that could not be changed between locations. Most multisite environments use
different IP networks at the different locations, which meant that using clustering
in a multisite environment, complex VLAN configurations, and geo-networks were
required.
Clusters used a special quorum disk that provided the foundation for partitioning
protection. This quorum disk always had to be available, which typically meant
that it was located in one physical location.
Windows Server 2008 and a new shift in many datacenter applications removed these
barriers for enabling multisite clusters. Key datacenter applications, such as SQL
Server and Exchange, introduced options that did not require shared storage and
instead leveraged their own data-replication technologies. Failover Clustering
introduced changes that enabled multiple IP addresses to be allocated to a resource,
and whichever IP address was required for the site in which the resource was active
was used. Failover Clustering also enabled more-flexible heartbeat configurations,
which tolerated higher-latency networks; in addition, the reliance on a quorum disk
was removed, offering additional quorum models based on the number of nodes and
even a file share located at a third site. Being able to run clusters over multiple
locations without shared storage enables certain disaster-recovery options that will be
discussed.
When designing a disaster-recovery solution, many options typically are available that
offer different levels of recoverability. The recovery point objective (RPO) is the point
to which you want to recover in the event of a disaster. For example, only 30 minutes
of data should be lost. The recovery time objective (RTO) is how quickly you need to
be up-and-running in the event of a disaster. For example, the systems should be
available within four hours in the event of a disaster. It’s important to understand the
RPO and RTO requirements for your systems when designing your disaster-recovery
solution. Also, different systems will likely have different requirements.