60 CATALYZING INQUIRY
4.2.2 Data Standards,
One obvious approach to data integration relies on technical standards that define representations of
data and hence provide an understanding of data that is common to all database developers. For obvious
reasons, standards are most relevant to future datasets. Legacy databases, which have been built around
unique data definitions, are much less amenable to a standards-driven approach to data integration.
Standards are indeed an essential element of efforts to achieve data integration of future datasets,
but the adoption of standards is a nontrivial task. For example, community-wide standards for data
relevant to a certain subject almost certainly differ from those that might be adopted by individual
laboratories, which are the focus of the “small-instrument, multi-data-source” science that characterizes
most public-sector biological research.
Ideally, source data from these projects flow together into larger national or international data
resources that are accessible to the community. Adopting community standards, however, entails local
compromises (e.g., nonoptimal data structuring and semantics, greater expense), and the budgets that
characterize small-instrument, single-data-source science generally do not provide adequate support
for local data management and usually no support at all for contributions to a national data repository.
If data from such diverse sources are to be maintained centrally, researchers and laboratories must have
incentives and support to adopt broader standards in the name of the community’s greater good. In this
regard, funding agencies and journals have considerable leverage and through techniques such as requiring
researchers to deposit data in conformance to community standards may be able to provide such incentives.
At the same time, data standards cannot resolve the integration problem by themselves even for
future datasets. One reason is that in some fast-moving and rapidly changing areas of science (such as
biology), it is likely that the data standards existing at any given moment will not cover some new
dimension of data. A novel experiment may make measurements that existing data standards did not
anticipate. (For example, sequence databases—by definition—do not integrate methylation data; and yet
methylation is an essential characteristic of DNA that falls outside primary sequence information.) As
knowledge and understanding advance, the meaning attached to a term may change over time. A second
reason is that standards are difficult to impose on legacy systems, because legacy datasets are usually very
difficult to convert to a new data standard and conversion almost always entails some loss of information.
As a result, data standards themselves must evolve as the science they support changes. Because
standards cannot be propagated instantly throughout the relevant biological community, database A
may be based on Version 12.1 of a standard, and database B on Version 12.4 of the “same” standard. It
would be desirable if the differences between Versions 12.1 and 12.4 were not large and a basic level of
integration could still be maintained, but this is not ensured in an environment in which options vary
within standards, different releases and versions of products, and so on. In short, much of the devil of
ensuring data integration is in the detail of implementation.
Experience in the database world suggests that standards gaining widespread acceptance in the
commercial marketplace tend to have a long life span, because the marketplace tends to weed out weak
standards before they become widely accepted. Once a standard is widely used, industry is often moti-
vated to maintain compliance with this accepted standard, but standards created by niche players in the
market tend not to survive. This point is of particular relevance in a fragmented research environment and
suggests that standards established by strong consortia of multiple players are more likely to endure.
4.2.3 Data Normalization,
An important issue related to data standards is data normalization. Data normalization is the process
through which data taken on the “same” biological phenomenon by different instruments, procedures, or
researchers can be rendered comparable. Such problems can arise in many different contexts:
(^4) Section 4.2.3 is based largely on a presentation by C. Ball, “The Normalization of Microarray Data,” presented at the AAAS
2003 meeting in Denver, Colorado.