Catalyzing Inquiry at the Interface of Computing and Biology

(nextflipdebug5) #1
COMPUTATIONAL TOOLS 59

4.2.1 Desiderata,


If researcher A wants to use a database kept and maintained by researcher B, the “quick and dirty”
solution is for researcher A to write a program that will translate data from one format into another. For
example, many laboratories have used programs written in Perl to read, parse, extract, and transform
data from one form into another for particular applications.^3 Depending on the nature of the data
involved and the structure of the source databases, writing such a program may require intensive
coding.
Although such a fix is expedient, it is not scalable. That is, point-to-point solutions are not sustain-
able in a large community in which it is assumed that everyone wants to share data with everyone else.
More formally, if there are N data sources to be integrated, and point-to-point solutions must be
developed, N (N – 1)/2 translation programs must be written. If one data source changes (as is highly
likely), N – 1 programs must be updated.
A more desirable approach to data integration is scalable. That is, a change in one database should
not necessitate a change on the part of every research group that wants to use those data. A number of
approaches are discussed below, but in general, Chung and Wooley argue that robust data integration
systems must be able to



  1. Access and retrieve relevant data from a broad range of disparate data sources;

  2. Transform the retrieved data into a common data model for data integration;

  3. Provide a rich common data model for abstracting retrieved data and presenting integrated data
    objects to the end-user applications;

  4. Provide a high-level expressive language to compose complex queries across multiple data
    sources and to facilitate data manipulation, transformation, and integration tasks; and

  5. Manage query optimization and other complex issues.


Sections 4.2.2, 4.2.4, 4.2.5, 4.2.6, and 4.2.8 address a number of different approaches to dealing with
the data integration problem. These approaches are not, in general, mutually exclusive, and they may be
usable in combination to improve the effectiveness of a data integration solution.
Finally, biological databases are always changing, so integration is necessarily an ongoing task. Not
only are new data being integrated within the existing database structure (a structure established on the
basis of an existing intellectual paradigm), but biology is a field that changes quickly—thus requiring
structural changes in the databases that store data. In other words, biology does not have some “classi-
cal core framework” that is reliably constant. Thus, biological paradigms must be redesigned from time
to time (on the scale of every decade or so) to keep up with advances, which means that no “gold
standards” to organize data are built into biology. Furthermore, as biology expands its attention to
encompass complexes of entities and events as well as individual entities and events, more coherent
approaches to describing new phenomena will become necessary—approaches that bring some com-
monality and consistency to data representations of different biological entities—so that relationships
between different phenomena can be elucidated.
As one example, consider the potential impact of “-omic” biology, biology that is characterized by
a search for data completeness—the complete sequence of the human genome, a complete catalog of
proteins in the human body, the sequencing of all genomes in a given ecosystem, and so on. The
possibility of such completeness is unprecedented in the history of the life sciences and will almost
certainly require substantial revisions to the relevant intellectual frameworks.


(^3) The Perl programming language provides powerful and easy-to-use capabilities to search and manipulate text files. Because
of these strengths, Perl is a major component of much bioinformatics programming. At the same time, Perl is regarded by many
computer scientists as an unsafe language in which it is easy to make programs do dangerous things. In addition, many regard
the syntax and structure of most Perl programs to be of a nature that is hard to understand much after the fact.

Free download pdf