Catalyzing Inquiry at the Interface of Computing and Biology

(nextflipdebug5) #1
80 CATALYZING INQUIRY

Ecological databases also rely on metadata to improve interoperability and compatibility among
disparate data collections.^62 Ecology is a field that demands access to large numbers of independent
datasets such as geographic information, weather and climate records, biological specimen collections,
population studies, and genetic data. These datasets are collected over long periods of time, possibly
decades or even centuries, by a diverse set of actors for different purposes. A commonly agreed-upon
format and vocabulary for metadata is essential for efficient cooperative access.
Furthermore, as data increasingly are collected by automated systems such as embedded systems
and distributed sensor networks, the applications that attempt to fuse the results into formats amenable
to algorithmic or human analysis must deal with high (and always on) data rates, likely contained in
shifting standards for representation. Again, early agreement on a basic system for sharing metadata
will be necessary for the feasibility of such applications.
In attempting to integrate or cross-query these data collections, a central issue is the naming of
species or higher-level taxa. The Linnean taxonomy is the oldest such effort in biology, of course, yet
because there is not yet (nor likely can ever be) complete agreement on taxa identification, entries in
different databases may contain different tags for members of the same species, or the same tag for
members that were later determined to be of different species. Taxa are often moved into different
groups, split, or merged with others; names are sometimes changed. A central effort to manage this is
the Integrated Taxonomic Information System (ITIS),^63 which began life as a U.S. interagency task force,
but today is a global cooperative effort between government agencies and researchers to arrive at a
repository for agreed-upon species names and taxonomic categorization. ITIS data are of varying qual-
ity, and entries are tagged with three different quality indicators: credibility, which indicates whether or
not data have been reviewed; latest review, giving the year of the last review; and global completeness,
which records whether all species belonging to a taxon were included at the last review. These measure-
ments allow researchers to evaluate whether the data are appropriate for their use.
In constructing such a database, many data standards questions arise. For example, ITIS uses naming
standards from the International Code of Botanical Nomenclature and the International Code of Zoologi-
cal Nomenclature. However, for the kingdom Protista, which at various times in biological science has
been considered more like an animal and more like a plant, both standards might apply. Dates and date
ranges provide another challenge: while there are many international standards for representing a calen-
dar date, in general these did not foresee the need to represent dates occurring millions or billions of years
ago. ITIS employs a representation for geologic ages, and this illustrates the type of challenge encountered
when stretching a set of data standards to encompass many data types and different methods of collection.
For issues of representing observations or collections, an important element is the Darwin Core, a
set of XML metadata standards for describing a biological specimen, including observations in the wild
and preserved items in natural history collections. Where ITIS attempts to improve communicability by
achieving agreement on precise name usage, Darwin Core^64 (and similar metadata efforts) concentrates
the effort on labeling and markup of data. This allows individual databases to use their own data
structures, formats, and representations, as long as the data elements are labeled by Darwin Core
keywords. Since the design demands on such databases will be substantially different, this is a useful
approach. Another attempt to standardize metadata for ecological data is the Access to Biological
Collections Data (ABCD) Schema,^65 which is richer and contains more information. These two ap-
proaches indicate a common strategic choice: simpler standards are easier to adopt, and thus will likely
be more widespread, but are limited in their expressiveness; more complex standards can successfully


(^62) For a more extended discussion of the issues involved in maintaining ecological data, see W.K. Michener and J.W. Brunt,
eds., Ecological Data: Design, Management and Processing, Methods in Ecology, Blackwell Science, Maryland, 2000. A useful online
presentation can be found at http://www.soest.hawaii.edu/PFRP/dec03mtg/michener.pdf.
(^63) See http://www.itis.usda.gov.
(^64) See http://speciesanalyst.net/docs/dwc/.
(^65) See http://www.bgbm.org/TDWG/CODATA/Schema/default.htm.

Free download pdf