Catalyzing Inquiry at the Interface of Computing and Biology

COMPUTATIONAL TOOLS 89

A significant computational aspect of this example is that since the general problem of identifying
subgraphs is NP-complete,^91 the mere inspiration of using graph theory to represent proteins is insuffi-
cient; sophisticated algorithmic research is necessary to develop appropriate techniques, data representa-
tions, and heuristics that can sift through the enormous datasets in practical times. Similarly, the problem
involves subtle biological detail (e.g., what distance represents a significant spatial proximity, which
amino acids can be classified together), and could not be usefully attacked by computer scientists alone.

4.4.3 Algorithms and Voluminous Datasets,

Algorithms play an increasingly important role in the process of extracting information from large
biological datasets produced by high-throughput studies. Algorithms are needed to search, sort, align,
compare, contrast, and manipulate data related to a wide variety of biological problems and in support
of models of biological processes on a variety of spatial and temporal scales. For example, in the
language of automated learning and discovery, research is needed to develop algorithms for active and
cumulative learning; multitask learning; learning from labeled and unlabeled data; relational learning;
learning from large datasets; learning from small datasets; learning with prior knowledge; learning
from mixed-media data; and learning causal relationships.^92
The computational algorithms used for biological applications are likely to be rooted in mathematical
and statistical techniques used widely for other purposes (e.g., Bayesian networks, graph theory, principal
component analysis, hidden Markov models), but their adaptation to biological questions must address
the constraints that define biological events. Because critical features of many biological systems are not
known, algorithms must operate on the basis of working models and must frequently contend with a lack
of data and incomplete information about the system under study (though sometimes simulated data
suffices to test an algorithm). Thus, the results they provide must be regarded as approximate and provi-
sional, and the performance of algorithms must be tested and validated by empirical laboratory studies.
Algorithm development, therefore, requires the joint efforts of biologists and computer scientists.
Sections 4.4.4 through 4.4.9 describe certain biological problems and the algorithmic approaches to
solving them. Far from giving a comprehensive description, these sections are intended to illustrate the
complex substrate on which algorithms must operate and, further, to describe areas of successful and
prolific collaboration between computer scientists and biologists.
Some of the applications described below are focused on identifying or measuring specific at-
tributes, such as the identity of a gene, the three-dimensional structure of a protein, or the degree of
genetic variability in a population. At the heart of these lines of investigation is the quest to understand
biological function, (e.g., how genes interact, the physical actions of proteins, the physiological results
of genetic differences). Further opportunities to address biological questions are likely to be as diverse
as biology itself, although work on some of those questions is only nascent at this time.

4.4.4 Gene Recognition,

Although the complete genomic sequences of many organisms have been determined, not all of the genes
within those genomes have been identified. Difficulties in identifying genes from sequences of uncharacterized
DNA stem mostly from the complexity of gene organization and architecture. Just a small fraction of the
genome of a typical eukaryote consists of exons, that is, blocks of DNA that, when arranged according to their
sequence in the genome, constitute a gene; in the human genome, the fraction is estimated at less than 3 percent.

(^91) The notion of an NP-complete problem is rooted in the theory of computational complexity and has a precise technical
definition. For purposes of this report, it suffices to understand an NP-complete problem as one that is very difficult and would
take a long time to solve.
(^92) S. Thurn, C. Faloutsos, T. Mitchell, and L. Wasseterman, “Automated Learning and Discovery: State-of-the-Art and Research
Topics in a Rapidly Growing Field,” Summary of a Conference on Automated Learning and Discovery, Center for Automated Learn-
ing and Discovery, Carnegie Mellon University, Pittsburgh, PA, 1998.

Catalyzing Inquiry at the Interface of Computing and Biology

COMPUTATIONAL TOOLS 89

4.4.3 Algorithms and Voluminous Datasets,

4.4.4 Gene Recognition,

Get our desktop app

Company

Features

Documentation

Resources