untitled

(ff) #1
5.3 Macromolecular Sequence Databases 107

5.3.1 Nucleotide Sequence Databases


GenBank http://www.ncbi.nlm.nih.gov/Genbank
GenBank is a comprehensive database that contains publicly available DNA
sequences for more than 140,000 named organisms. The sequences are pri-
marily obtained through submissions from individual laboratories and batch
submissions from large-scale sequencing projects (Benson et al. 2004). As of
February 2004, GenBank contained over 37 billion bases in over 32 million
sequence records. GenBank uses its own non-XML text format.
Most submissions to GenBank are made using the BankIt web service or
Sequin program and accession numbers are assigned by GenBank staff upon
receipt. Daily data exchange with the EMBL data library in the U.K. and the
DNA data bank of Japan (DDBJ) helps ensure worldwide coverage. Gen-
Bank is accessible through NCBI’s retrieval system, Entrez, which integrates
data from the major DNA and protein sequence databases along with taxon-
omy, genome mapping, protein structure, and domain information, and the
biomedical journal literature via PubMed.

EMBL http://www.ebi.ac.uk/embl
The EMBL Nucleotide Sequence Database, maintained at the European Bioin-
formatics Institute (EBI), incorporates, organizes, and distributes nucleotide
sequences from public sources (Kulikova et al. 2004). The database is a part of
an international collaboration with DDBJ and GenBank. Data are exchanged
between the collaborating databases on a daily basis. The Webin web service
is the preferred system for individual submission of nucleotide sequences,
including third party annotation (TPA) and alignment data. Automatic sub-
mission procedures are used for submission of data from large-scale genome
sequencing centers and from the European Patent Office. Database releases
are produced quarterly.
EMBL uses its own non-XML text format, but the XEMBL project has made
it possible to obtain EMBL data in the AGAVE XML format (Wang et al. 2002).
The latest EMBL data collection can be accessed via ftp, email, and web
interfaces. The EBI’s Sequence Retrieval System (SRS) integrates and links
the main nucleotide and protein databases as well as many other specialist
molecular biology databases. For sequence similarity searching, a variety of
tools (e.g., FASTA and BLAST) are available that allow users to compare their
own sequences against the data in EMBL and other databases.
DDBJ http://www.ddbj.nig.ac.jp
DDBJ is maintained at the National Institute of Genetics in Japan (Miyazaki
Free download pdf