3.8 Integration
of Sequence
Phylogeny and Domain
Architecture
You will obtain a phylogenetic tree resembling the evolutionary
history of the analyzed sequences. The Pfam domain annotation of
the analyzed proteins will be displayed at the leaves of the tree. This
provides an initial impression of how the domain architectures and
thus the molecular functions of the sequences have evolved.
- Upload the set of sequences used for tree reconstruction into
DoMosaics [30]. Annotate Pfam-A domains [28] in the input
sequences by running ahmmscananalysis [31] within DoMo-
saics. Once the Pfam annotation has completed, DoMosaics
will provide you with a graphical representation of the domain
architectures. - Upload a phylogenetic tree, and DoMosaic will let you inte-
grate the phylogenetic information with the Pfam domain
architecture of the sequences that were used for tree recon-
struction (seeNote 20). This serves then as an excellent basis to
formulate more comprehensive hypotheses concerning the
evolution of protein families and of their functionality. An
example is shown in Fig.12.
4 Notes
- Pay attention that many of the provided genome sequences are
unpublished. While this has no effect for analyzing the data,
you may require permissions to publish the results. - When working with gene sets from two or more pathways, it is
a good idea to analyze a single combined nonredundant gene
set. This avoids repeated calls of the same programs, and it
ensures that you use the same set of parameters for all proteins.
You can divide the proteins and the newly generated metadata
at a later point in the analysis. Note that individual proteins
might be represented in more than pathway. - It happens now and then that the number of sequence identi-
fiers, for which you have obtained cross-references, differs from
the original number of sequences. The reason is that the opti-
mal case of a one-to-one relationship between identifiers in
different databases is not always accomplished. Make sure to
track and explain any difference in the number of sequences
before and after the cross-referencing step to avoid information
loss or the accumulation of redundancies. - Pan-species databases, such as the nonredundant protein data-
base of NCBI, are not a particularly good choice for such
analyses. It is almost impossible to assess which species are
represented by what sequences, and thus the database is simply
a huge black box.
136 Arpit Jain et al.