COMPUTATIONAL TOOLS 75
sharing. Consistent with the MIAME standards proposed by microarray users, MAML can be used to
describe experiments and results from all types of DNA arrays.
The Systems Biology Markup Language, (SBML) is used to represent and model information in
systems simulation software, so that models of biological systems can be exchanged by different soft-
ware programs (e.g., E-Cell, StochSim). The SBML language, developed by the Caltech ERATO Kiranto
systems biology Project,^44 is organized around five categories of information: model, compartment,
geometry, specie, and reaction.
A downside of XML is that only a few of the largest and most used databases (e.g., a GenBank)
support an XML interface. Other databases whose existence predates XML keep most of their data in
flat files. But this reality is changing, and database researchers are working to create conversion tools
and new database platforms based on XML. Additional XML-based vocabularies and translation tools
are needed.
The data annotation process is complex and cumbersome when large datasets are involved, and
some efforts have been made to reduce the burden of annotation. For example, the Distributed Annota-
tion System (DAS) is a Web service for exchanging genome annotation data from a number of distrib-
uted databases. The system depends on the existence of a “reference sequence” and gathers “layers” of
annotation about the sequence that reside on third-party servers and are controlled by each annotation
provider. The data exchange standard (the DAS XML specification) enables layers to be provided in real
time from the third-party servers and overlaid to produce a single integrated view by a DAS client.
Success in the effort depends on the willingness of investigators to contribute annotation information
recorded on their respective servers, and on users’ learning about the existence of a DAS server (e.g.,
through ad hoc mechanisms such as link lists). DAS is also more or less specific to sequence annotation
and is not easily extended to other biological objects.
Today, when biologists archive a newly discovered gene sequence in GenBank, for example, they
have various types of annotation software at their disposal to link it with explanatory data. Next-
generation annotation systems will have to do this for many other genome features, such as transcrip-
tion-factor binding sites and single nucleotide polymorphisms (SNPs), that most of today’s systems
don’t cover at all. Indeed, these systems will have to be able to create, annotate, and archive models of
entire metabolic, signaling, and genetic pathways. Next-generation annotation systems will have to be
built in a highly modular and open fashion, so that they can accommodate new capabilities and new
data types without anyone’s having to rewrite the basic code.
4.2.10 A Case Study: The Cell Centered Database,
To illustrate the notions described above, it is helpful to consider an example of a database effort
that implements many of them. Techniques such as electron tomography are generating large amounts
of exquisitely detailed data on cells and their macromolecular organization that have to be exposed to
the greater scientific community. However, very few structured data repositories for community use
exist for the type of cellular and subcellular information produced using light and electron microscopy.
The Cell Centered Database (CCDB) addresses this need by developing a database for three-dimen-
sional light and electron microscopic information.^46
(^44) See http://www.cds.caltech.edu/erato.
(^45) Section 4.2.10 is adapted largely from M.E. Martone, S.T. Peltier, and M.H. Ellisman, “Building Grid Based Resources for
Neurosciences,” National Center for Microscopy and Imaging Research, Department of Neurosciences, University of California,
San Diego, unpublished and undated working paper.
(^46) M.E. Martone, A. Gupta, M. Wong, X. Qian, G. Sosinsky, B. Ludascher, and M.H. Ellisman, “A Cell-Centered Database for
Electron Tomographic Data,” Journal of Structural Biology 138(1-2):145-155, 2002; M.E. Martone, S. Zhang, S. Gupta, X. Qian, H.
He, D.A. Price, M. Wong, et al., “The Cell Centered Database: A Database for Multiscale Structural and Protein Localization Data
from Light and Electron Microscopy,” Neuroinformatics 1(4):379-396, 2003.