74 CATALYZING INQUIRY
Although the genomic research community uses annotation to refer to auxiliary information that
has biological function or significance, annotation could also be used as a way to trace the provenance
of data (discussed in greater detail in Section 3.7). For example, in a protein database, the utility of an
entry describing the three-dimensional structure of a protein would be greatly enhanced if entries also
included annotations that described the quality of data (e.g., their precision), uncertainties in the data,
the physical and chemical properties of the protein, various kinds of functional information (e.g., what
molecules bind to the protein, location of the active site), contextual information such as where in a cell
the protein is found and in what concentration, and appropriate references to the literature.
In principle, annotations can often be captured as unstructured natural language text. But for
maximum utility, machine-readable annotations are necessary. Thus, special attention must be paid to
the design and creation of languages and formats that facilitate machine processing of annotations. To
facilitate such processing, a variety of metadata tools are available. Metadata—or literally “data about
data”—are anything that describes data elements or data collections, such as the labels of the fields, the
units used, the time the data were collected, the size of the collection, and so forth. They are invaluable
not only for increasing the life span of data (by making it easier or even possible to determine the
meaning of a particular measurement), but also for making datasets comprehensible to computers. The
National Biological Information Infrastructure (NBII)^40 offers the following description:
Metadata records preserve the usefulness of data over time by detailing methods for data collection and
data set creation. Metadata greatly minimize duplication of effort in the collection of expensive digital
data and foster sharing of digital data resources. Metadata supports local data asset management such as
local inventory and data catalogs, and external user communities such as Clearinghouses and websites. It
provides adequate guidance for end-use application of data such as detailed lineage and context. Metada-
ta makes it possible for data users to search, retrieve, and evaluate data set information from the NBII’s
vast network of biological databases by providing standardized descriptions of geospatial and biological
data.
A popular tool for the implementation of controlled metadata vocabularies is the extensible markup
language (XML).^41 XML offers a way to serve and describe data in a uniform and automatically parsable
format and provides an open-source solution for moving data between programs. Although XML is a
language for describing data, the descriptions of data are articulated in XML-based vocabularies.
Such vocabularies are useful for describing specific biological entities along with experimental
information associated with those entities. Some of the vocabularies have been developed in association
with specialized databases established by the community. Because of their common basis in XML,
however, one vocabulary can be translated to another using various tools, for example, the XML style
sheet language transformation, or XSLT.^42
Examples of such XML-based dialects include the BIOpolymer Markup Language (BIOML),^43 de-
signed for annotating the sequences of biopolymers (e.g., genes, proteins), in such a way that all infor-
mation about a biopolymer can be logically and meaningfully associated with it. Much like HTML, the
language uses tags such as
along with a series of attributes.
The Microarray Markup Language (MAML) was created by a coalition of developers
(www.beahmish.lbl.gov) to meet community needs for sharing and comparing the results of gene
expression experiments. That community proposed the creation of a Microarray Gene Expression Data-
base and defined the minimum information about a microarray experiment (MIAME) needed to enable
(^40) See http://www.nbii.gov/datainfo/metadata/.
(^41) H. Simon, Modern Drug Discovery, American Chemical Society, Washington, DC, 2001, pp. 69-71.
(^42) See http://www.w3c./TR/xslt.
(^43) See http://www.bioml.com/BIOML.