258 CATALYZING INQUIRY
In fact, many of the techniques described as self-healing are familiar to the decades-old hardware
field of reliable systems, also known as fault tolerance or high availability. These techniques, such as
fault detection, fault masking, and fault tolerance, are in common use when designing hardware to
improve the reliability and availability of large systems. This is most likely because hardware designers,
unlike software programmers, long ago accepted the unavoidable reality that components of their
designs will fail at some point. (It also helps immeasurably that hardware failures are often easier to
characterize than software failures.) In areas with extremely high demands for reliability, such as
aerospace or power plants, these fault-tolerance techniques have become quite sophisticated, as have
mechanisms for testing system operation. The oldest and most accepted use of the term self-healing is
found in networking;^34 networks from the original ARPANET (and even the public switched telecom-
munications network) to modern peer-to-peer embedded networks are self-healing in the sense that
traffic is routed around unresponsive nodes.
In contrast, until quite recently, software quality has focused on producing bug-free products, by an
intensive effort of careful design, code review, and extensive prerelease testing. However, when bugs
do occur, software typically has no ability to detect or react to them, or to continue to operate. This was
a workable strategy for much of the history of modern software, but the continuing rise of the complex-
ity of software applications has made formal review or correctness proofs inadequate to provide mini-
mum levels of reliability.^35
This rise in complexity and the resulting rise in human cost of configuration and maintenance of
software applications has spurred interest in self-healing, hoping to shift much of the burden of this
configuration and maintenance back to the software. The idea is that, like its biologically analogous
namesake, a self-healing system would detect the presence of nonfunctioning (or, more challengingly,
malfunctioning) components and initiate some response to continue proper overall functionality, pref-
erably without any centralized or external force (such as a system administrator) required. The most
common implementation today seems to be one of reconfiguration: if a fault is detected, a spare hard-
ware component is brought into play. This is “healing” only in the loosest sense, although it certainly is
a valid fault tolerance technique. However, it doesn’t translate well to software-only failures.
None of the systems that describe themselves as self-healing (such as Microsoft Windows 2000, IBM
DB/2, or Sun’s Jini) seem to actually employ biological principles, other than in the grossest sense of
having redundancy. However, one research project that is inspired very explicitly by biology is Swarm
at the University of Virginia.^36 The Swarm programming model defines units as individual cells, which
can both reproduce through cellular division and die. Additionally, they can emit signals at various
strengths and respond to the aggregate strength of signals in the environment. For example, a system
set to grow to a certain size would start with a single cell that emitted a small amount of signal and with
a program set to reproduce if the aggregate signal was at a certain threshold. Until the total amount of
signal exceeded that threshold, the cells would continue to divide, but they would stop once the
threshold was exceeded. If cells were to fail or otherwise be deleted, other cells would respond by
dividing again to bring the signal back to the threshold. This is indeed a primitive form of self-healing.
However, this programming model is unlikely to catch on for complex tasks without significant higher-
level abstractions available.
(^34) W.D. Grover, “The Self-healing Network: A Fast Distributed Restoration Technique for Networks Using Digital Cross-
connect Machines,” Proceedings of the IEEE Global Telecommunications Conference, Tokyo, 1987, pp. 1090-1095.
(^35) In his lecture on receiving the ACM Turing Award in 1980, C.A.R. Hoare said, “There are two ways of constructing a
software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so
complicated that there are no obvious deficiencies.” Lecture available at http://www.braithwaite-lee.com/opinions/p75-
hoare.pdf.
(^36) G. Selvin, D. Evans, and L. Davidson, “A Biologically Inspired Programming Model for Self-healing Systems,” Proceedings of
the First Workshop on Self-Healing Systems, November 2002, available at http://www-2.cs.cmu.edu/~garlan/woss02/.