Internet of Things Architecture

(Elliott) #1

Design for failure


The overall tactic can be further devided into more specific tacticts that are
presented as design choices here. The first sub-tactic is ̳Acquiring more
resources than needed and replace failed ones (DC A.8)‘. By applying this tactic
more resources are allocated for task execution than normally required. Besides
allocating the resources essentially necessary spare resources are reserved
that could execute the same task as the essential ones but are kept on hold.
This is a precaution in case a resource fails during runtime and a spare
resource can take over the task of the one that failed. Resource in this sense
includes all computational resources, network resources and IoT Resources,
meaning all FCs in the ARM. A typical FG that implements resource reservation
is Service Organisation that is responsible for allocating IoT Services to service
requests Section 4.2.2. Applying this tactic requires a higher number of
resources essentially required.


Another approach is to aim at having ̳No FC or centralised FCs (DC A.9)‘. The
goal is to develop designs that avoid single points of failure, like centralised FCs
or FCs with just one instance. If a single FC fails no other instance was able to
replace its functionality. By applying this tactic more than one instance of FCs
are provided by the system so that their functionality can still be assured in case
one instance becomes unavailable. For Service Organisation FG the
decentralised Service Choreography FC can be preferred over Service
Orchestration which requires a central orchestration engine Section 4.2.2. The
decentralised choreography approach reduces the risk for a single point of
failure.


To apply the design choice ̳Treat Long Latency as potential failure (DC A.10)‘
the system design provides an FC that treats any long latency as a potential
failure. For instance the round-trip-time for request-response-protocols is
measured and a deadline is set as acceptable. After the deadline has passed
the system treats the behaviour as potential failure and reacts in an appropriate
manner, e.g., by querying another instance of the same FC.


Allowing component replication


The design choice ̳State-machine (active) replication (DC A.11)‘ allows
detection of faults by replicating service requests and comparing the service
results to each other. If all results are identical no fault is assumed, if they are
different it still needs to be analysed which of the results is faulty and which is
correct [Wikipedia 2013d]. To apply this technique some replication
functionality needs to be implemented that multiplies the request to different
instances of FCs. To assure fault detection 2F+1 replicas of the tested FC need
to be held where F is the number of faults to be detected. The fault detection
algorithm requires the tested FC to be modelled as state-machine.


̳Transactional replication (DC A.12)‘ is used in server-to-server environments
typically, in which incremental information changes need to be propagated to
subscribers in nearly real-time [Microsoft 2013].

Free download pdf