58 CATALYZING INQUIRY
These examples are drawn largely from the area of cell biology. The reason is not that these are the
only good examples of computational tools, but rather that a great deal of the activity in the field has
been the direct result of trying to make sense out of the genomic sequences that have been collected to
date. As noted in Chapter 2, the Human Genome Project—completed in draft in 2000—is arguably the
first large-scale project of 21st century biology in which the need for powerful information technology
was manifestly obvious. Since then, computational tools for the analysis of genomic data, and by
extension data associated with the cell, have proliferated wildly; thus, a large number of examples are
available from this domain.
4.2 Tools for Data Integration,
As noted in Chapter 3, data integration is perhaps the most critical problem facing researchers as
they approach biology in the 21st century.
Box 4.1
Tool Challenges for Computer Science
Data Representation
- Next-generation genome annotation system with accuracy equal to or exceeding the best human
predictions - Mechanism for multimodal representation of data
Analysis Tools
- Scalable methods of comparing many genomes
- Tools and analyses to determine how molecular complexes work within the cell
- Techniques for inferring and analyzing regulatory and signaling networks
- Tools to extract patterns in mass spectrometry datasets
- Tools for semantic interoperability
Visualization - Tools to display networks and clusters at many levels of detail
- Approaches for interpreting data streams and comparing high-throughput data with simulation output
Standards - Good software-engineering practices and standard definitions (e.g., a common component architecture)
- Standard ontology and data-exchange format for encoding complex types of annotation
Databases
- Large repository for microbial and ecological literature relevant to the “Genomes to Life” effort.
- Big relational database derived by automatic generation of semantic metadata from the biological literature
- Databases that support automated versioning and identification of data provenance
- Long-term support of public sequence databases
SOURCE: U.S. Department of Energy, Report on the Computer Science Workshop for the Genomes to Life Program, Gaithersburg, MD,
March 6-7, 2002; available at http://DOEGenomesToLife.org/compbio/.
(^2) Sections 4.2.1, 4.2.4, 4.2.6, and 4.2.8 embed excerpts from S.Y. Chung and J.C. Wooley, “Challenges Faced in the Integration of
Biological Information,” in Bioinformatics: Managing Scientific Data, Z. Lacroix and T. Critchlow, eds., Morgan Kaufmann, San
Francisco, CA, 2003. (Hereafter cited as Chung and Wooley, 2003.)