Catalyzing Inquiry at the Interface of Computing and Biology

(nextflipdebug5) #1
COMPUTATIONAL TOOLS 73

readable form and making use of databases of biological data and inferred networks, software based on
artificial intelligence research can make complex inferences using these encoded relationships, for ex-
ample, to consider statements written in that ontology for consistency or to predict new relationships
between elements.^34 Such new relationships might include new metabolic pathways, regulatory rela-
tionships between genes, signaling networks, or other relationships. Other approaches rely on logical
frameworks more expressive than database queries and are able to reason about explanations for a
given feature or suggest plans for intervention to reach a desired state.^35
Developing an ontology for automated reasoning can make use of many different sources. For
example, inference from gene-expression data using Bayesian networks can take advantage of online
sources of information about the likely probabilistic dependencies among expression levels of various
genes.^36 Machine-readable knowledge bases can be built from textbooks, review articles, or even the
Oxford Dictionary of Molecular Biology. The rapidly growing volume of publications in the biological
literature is another important source, because inclusion of the knowledge in these publications helps to
uncover relationships among various genes, proteins, and other biological entities referenced in the
literature.
An example of ontologies for automated reasoning is the ontology underlying the EcoCyc database.
The EcoCyc Pathway Database (http://ecocyc.org) describes the metabolic transport, and genetic regu-
latory networks of E. coli. EcoCyc structures a scientific theory about E. coli within a formal ontology so
that the theory is available for computational analysis.^37 Specifically, EcoCyc describes the genes and
proteins of E. coli as well as its metabolic pathways, transport functions, and gene regulation. The
underlying ontology encodes a diverse array of biochemical processes, including enzymatic reactions
involving small molecule substrates and macromolecular substrates, signal transduction processes,
transport events, and mechanisms of regulation of gene expression.^38


4.2.9 Annotations and Metadata,


Annotation is auxiliary information associated with primary information contained in a database.
Consider, for example, the human genome database. The primary database consists of a sequence of
some 3 billion nucleotides, which contains genes, regulatory elements, and other material whose func-
tion is unknown. To make sense of this enormous sequence, the identification of significant patterns
within it is necessary. Various pieces of the genome must be identified, and a given sequence might be
annotated as translation (e.g., “stop”), transcription (e.g., “exon” or “intron”), variation (“insertion”),
structural (“clone”), similarity, repeat, or experimental (e.g., “knockout,” “transgenic”). Identifying a
particular nucleotide sequence as a gene would itself be an annotation, and the protein corresponding
to it, including its three-dimensional structure characterized as a set of coordinates of the protein’s
atoms, would also be an annotation. In short, the sequence database includes the raw sequence data,
and the annotated version adds pertinent information such as gene coded for, amino acid sequence, or
other commentary to the database entry of raw sequence of DNA bases.^39


(^34) P.D. Karp, “Pathway Databases: A Case Study in Computational Symbolic Theories,” Science 293(5537):2040-2044, 2001.
(^35) C. Baral, K. Chancellor, N. Tran, N.L. Tran, A. Joy, and M. Berens, “A Knowledge Based Approach for Representing and
Reasoning About Signaling Networks,” Bioinformatics 20(Suppl. 1):I15-I22, 2004.
(^36) E. Segal, B. Taskar, A. Gasch, N. Friedman, and D. Koller, “Rich Probabilistic Models for Gene Expression,” Bioinformatics
17(Supp. 1):S243-S252, 2001. (Cited in Hunter, “Ontologies for Programs, Not People,” 2002, Footnote 32.)
(^37) P.D. Karp, “Pathway Databases: A Case Study in Computational Symbolic Theories,” Science 293(5537):2040-2044, 2001; P.D.
Karp, M. Riley, M. Saier, I.T. Paulsen, J. Collado-Vides, S.M. Paley, A Pellegrini-Toole, et al., “The EcoCyc Database,” Nucleic
Acids Research 30(1):56-58, 2002.
(^38) P.D. Karp, “An Ontology for Biological Function Based on Molecular Interactions,” Bioinformatics 16(3):269–285, 2000.
(^39) See http://www.biochem.northwestern.edu/holmgren/Glossary/Definitions/Def-A/Annotation.html.

Free download pdf