366 CATALYZING INQUIRY
mental technique and laboratory procedure and instrumentation—much more so than most computer
scientists pay to the comparable areas in computer science. Thus, a computer scientist with insufficient
awareness of experimental design may not be accustomed to or even aware of techniques of formal
model or simulation validation.
In addition, biology has not traditionally looked to engineering for insight or inspiration. For
example, proteins come in an endless variety with many variations and do not necessarily have straight-
forward analogues to engineering parts. Experimental biologists often focus on discovering new pieces
of cellular machinery and on how defective behavior stems from broken or missing pieces (e.g., muta-
tions). Experimental work is aimed at proving or disproving specific hypotheses, such as whether or not
a particular biochemical pathway is relevant to some cellular phenomena.
The training that computer scientists receive also emphasizes general solutions that give guarantees
about events in terms of their worst-case performance. Biologists are interested in specific solutions that
relate to very particular (although voluminous) datasets. (A further complication is that biological data
are often erroneous and/or inconsistent, especially when collected in large volume.) By recognizing and
exploiting special characteristics of biologically significant datasets, special-purpose solutions can be
crafted that function much more effectively than general-purpose solutions. For example, in the prob-
lem of genomic sequence assembly, it turns out that by exploiting the information available concerning
the size of fragments, the number of choices for where a fragment might fit is sharply restricted.
The central role that experimental data plays in biology is responsible for the fact that, to date,
computer scientists have been able to make their most important contributions in areas in which the
details of some biological phenomena can be neglected to some important extent. Thus, the abstraction
of DNA as merely a string of characters derived from a four-letter alphabet is a very powerful notion,
and considerable headway in genomics can be made knowing little else. To be sure, there are experi-
mental errors to take into account, and a model of the noisiness of the data must be developed, but the
underlying problem is pretty clear to a computer scientist.
On the other hand, as the discussion in Section 4.4.1 makes clear, there are limits to this abstraction
that arise from just such “details.” Also, proteomics—in which the three-dimensional structure of a pro-
tein, rather than the linear sequence, determines its function—presents even greater challenges. To under-
stand the geometry of a three-dimensional structure, discrete mathematics—the stock in trade of the
computer scientist—is far less useful than continuous mathematics.^68 Furthermore, the properties and
characteristics of the specific amino acids in a protein matter a great deal to its structure and function,
whereas the various nucleotide bases are more or less equivalent from an informational standpoint. In
short, proteomics involves a much more substantial body of domain knowledge than does genomics.
One illustration related by a senior computer scientist working in biology is his original dream that,
with enough data,^69 it would be computationally straightforward to understand the mechanisms of
gene regulation. That is, with sufficient data on regulatory pathways, cascades, gene knockouts, expres-
sion levels, and their dependencies on environmental factors, how genetic regulatory networks work
would become reasonable clear. With the hindsight of several years, he now believes that this dream
was hopelessly naïve in that it did not account for the myriad exceptions and apparent special cases
inherent in biological data that make the biologist’s intellectual life very complicated indeed.
Finally, consider that many biologists are suspicious—or at least not yet persuaded—of the value and
importance of high-throughput measurement of biological systems (Section 7.2). Because many biologists
were educated and worked in an era in which data were scarce, experiments in biology have historically
been oriented toward hypothesis testing. High-throughput data collection drives in the opposite direction,
(^68) The reason is that geometric descriptions naturally involve continuous variables such as lengths and angles, and functions of
those variables.
(^69) Richard Karp, University of California, Berkeley, personal communication, July 29, 2002.