AMPK Methods and Protocols

(Rick Simeone) #1

3.8 Integration
of Sequence
Phylogeny and Domain
Architecture


You will obtain a phylogenetic tree resembling the evolutionary
history of the analyzed sequences. The Pfam domain annotation of
the analyzed proteins will be displayed at the leaves of the tree. This
provides an initial impression of how the domain architectures and
thus the molecular functions of the sequences have evolved.


  1. Upload the set of sequences used for tree reconstruction into
    DoMosaics [30]. Annotate Pfam-A domains [28] in the input
    sequences by running ahmmscananalysis [31] within DoMo-
    saics. Once the Pfam annotation has completed, DoMosaics
    will provide you with a graphical representation of the domain
    architectures.

  2. Upload a phylogenetic tree, and DoMosaic will let you inte-
    grate the phylogenetic information with the Pfam domain
    architecture of the sequences that were used for tree recon-
    struction (seeNote 20). This serves then as an excellent basis to
    formulate more comprehensive hypotheses concerning the
    evolution of protein families and of their functionality. An
    example is shown in Fig.12.


4 Notes



  1. Pay attention that many of the provided genome sequences are
    unpublished. While this has no effect for analyzing the data,
    you may require permissions to publish the results.

  2. When working with gene sets from two or more pathways, it is
    a good idea to analyze a single combined nonredundant gene
    set. This avoids repeated calls of the same programs, and it
    ensures that you use the same set of parameters for all proteins.
    You can divide the proteins and the newly generated metadata
    at a later point in the analysis. Note that individual proteins
    might be represented in more than pathway.

  3. It happens now and then that the number of sequence identi-
    fiers, for which you have obtained cross-references, differs from
    the original number of sequences. The reason is that the opti-
    mal case of a one-to-one relationship between identifiers in
    different databases is not always accomplished. Make sure to
    track and explain any difference in the number of sequences
    before and after the cross-referencing step to avoid information
    loss or the accumulation of redundancies.

  4. Pan-species databases, such as the nonredundant protein data-
    base of NCBI, are not a particularly good choice for such
    analyses. It is almost impossible to assess which species are
    represented by what sequences, and thus the database is simply
    a huge black box.


136 Arpit Jain et al.

Free download pdf