Catalyzing Inquiry at the Interface of Computing and Biology

(nextflipdebug5) #1
COMPUTATIONAL TOOLS 87

4.4 Algorithms for Operating on Biological Data,


4.4.1 Preliminaries: DNA Sequence as a Digital String,


The digital nature of DNA is a central evolutionary innovation for many reasons—that is, the
“values” of the molecules making up the polymer are discrete and indivisible units. Just as an electronic
digital computer abstracts various continuous voltage levels as 0 and 1, DNA abstracts a three-dimen-
sional organization of atoms as A, T, G, and C. This has important biological benefits, including very
high-accuracy replication, common and simplified ways for associated molecules to bind to sites, and
low ambiguity in coding for proteins.
For human purposes in bioinformatics, however, the use of the abstraction of DNA as a digital
string has had other equally significant and related benefits. It is easy to imagine the opposite case, in
which DNA is represented as the three-dimensional locations of each atom in the macromolecule, and
comparison of DNA sequences is a painstaking process of comparing the full structures. Indeed, this is
very much the state of the art in representing proteins (which, although they can be represented as a
digital string of peptides, are more flexible than DNA, so the digital abstraction leaves out the critically
important features of folding). The digital abstraction includes much of the essential information of the
system, without including complicating higher- and lower-order biochemical properties.^81 The com-
parison of the state of the art in computational analysis of DNA sequences and protein sequences speaks
in part to the enormous advantage that the digital string abstraction offers when appropriate.
The most basic feature of the abstraction is that it treats the arrangement of physical matter as
information. An important advantage of this is that information-theoretic techniques can be applied to
specific DNA strings or to the overall alphabet of codon-peptide associations. For example, computer
science-developed concepts such as Hamming distance, parity, and error-correcting codes can be used
to evaluate the resilience of information in the presence of noise and close alternatives.^82
A second and very practical advantage is that as strings of letters, DNA sequences can be stored
efficiently and recognizably in the same format as normal text.^83 An entire human genome, for example, can
be stored in about 3 gigabytes, costing a few dollars in 2003. More broadly, this means that a vast array of
tools, software, algorithms, and software packages that were designed to operate on text could be adapted
with little or no effort to operate on DNA strings as well. More abstract examples include the long history of
research into algorithms to efficiently search, compare, and transform strings. For example, in 1974, an
algorithm for identifying the “edit distance” of two strings was discovered,^84 measuring the minimum
number of changes, transpositions, and insertions necessary to transform one string into another. Although
this algorithm was developed long before the genome era, it is useful to DNA analysis nonetheless.^85
Finally, the very foundation of computational theory is the Turing machine, an abstract model of
symbolic manipulation. Some very innovative research has shown that the DNA manipulations of some
single-celled organisms are Turing-complete,^86 allowing the application of a large tradition of formal
language analysis to problems of cellular machinery.


(^81) A. Regev and E. Shapiro, “Cellular Abstractions: Cells as Computation,” Nature 419(6905): 343, 2002.
(^82) D.A. MacDonaill, “A Parity Code Interpretation of Nucleotide Alphabet Composition,” Chemical Communications 18:2062-
2063, 2002.
(^83) Ideally, of course, a nucleotide could be stored using only two bits (or three to include RNA nucleotides as well). ASCII
typically uses eight bits to represent characters.
(^84) R.A. Wagner and M.J. Fischer, “The String-to-String Correction Problem,” Journal of the Association for Computing Machinery
21(1):168-173, 1974.
(^85) See for example, American Mathematical Society, “Mathematics and the Genome: Near and Far (Strings),” April 2002.
Available at http://www.ams.org/new-in-math/cover/genome5.html; M.S. Waterman, Introduction to Computational Biology:
Maps, Sequences and Genomes, Chapman and Hall, London, 1995; M.S. Waterman, “Sequence Alignments,” Mathematical Methods
for DNA Sequences, CRC, Boca Raton, FL, 1989, pp. 53-92.
(^86) L.F. Landweber and L. Kari, “The Evolution of Cellular Computing: Nature’s Solution to a Computational Problem,”
Biosystems 52(1-3):3-13, 1999.

Free download pdf