62 CATALYZING INQUIRY
of these approaches are labor-intensive, and an outstanding challenge in the area of data normalization
is to develop approaches to minimize systematic bias that demand less labor and expense.
4.2.4 Data Warehousing,
Data warehousing is a centralized approach to data integration. The maintainer of the data ware-
house obtains data from other sources and converts them into a common format, with a global data
schema and indexing system for integration and navigation. Such systems have a long track record of
success in the commercial world, especially for resource management functions (e.g., payroll, inven-
tory). These systems are most successful when the underlying databases can be maintained in a con-
trolled environment that allows them to be reasonably stable and structured. Data warehousing is
dominated by relational database management systems (RDBMS), which offer a mature and widely
accepted database technology and a standard high-level standard query language (SQL).
However, biological data are often qualitatively different from the data contained in commercial
databases. Furthermore, biological data sources are much more dynamic and unpredictable, and few
public biological data sources use structured database management systems. Data warehouses are often
troubled by a lack of synchronization between the data they hold and the original database from which
those data derive because of the time lag involved in refreshing the data warehouse store. Data ware-
housing efforts are further complicated by the issue of updates. Stein writes:^6
One of the most ambitious attempts at the warehouse approach [to database integration] was the Inte-
grated Genome Database (IGD) project, which aimed to combine human sequencing data with the multi-
ple genetic and physical maps that were the main reagent for human genomics at the time. At its peak,
IGD integrated more than a dozen source databases, including GenBank, the Genome Database (GDB)
and the databases of many human genetic-mapping projects. The integrated database was distributed to
end-users complete with a graphical front end.... The IGD project survived for slightly longer than a
year before collapsing. The main reason for its collapse, as described by the principal investigator on the
project (O. Ritter, personal communication, as relayed to Stein), was the database churn issue. On aver-
age, each of the source databases changed its data model twice a year. This meant that the IGD data
import system broke down every two weeks and the dumping and transformation programs had to be
rewritten—a task that eventually became unmanageable.
Also, because of the breadth and volume of biological databases, the effort involved in maintaining
a comprehensive data warehouse is enormous—and likely prohibitive. Such an effort would have to
integrate diverse biological information, such as sequence and structure, up to the various functions of
biochemical pathways and genetic polymorphisms.
Still, data warehousing is a useful approach for specific applications that are worth the expense of
intense data cleansing to remove potential errors, duplications, and semantic inconsistency.^7 Two cur-
rent examples of data warehousing are GenBank and the International Consortium for Brain Mapping
(ICBM) (the latter is described in Box 4.2).
4.2.5 Data Federation,
The data federation approach to integration is not centralized and does not call for a “master”
database. Data federation calls for scientists to maintain their own specialized databases encapsulating
their particular areas of expertise and retain control of the primary data, while still making it available
to other researchers. In other words, the underlying data sources are autonomous. Data federation often
(^6) Reprinted by permission from L.D. Stein, “Integrating Biological Databases,” Nature Reviews Genetics 4(5)337-345, 2003. Copy-
right 2005 Macmillan Magazines Ltd.
(^7) R. Resnick, “Simplified Data Mining,” pp. 51-52 in Drug Discovery and Development, 2000. (Cited in Chung and Wooley, 2003.)