- In a nutshell, a gene missed in the annotation of a draft genome
does little harm to the analysis, if there is a second genome
from a species in the same clade where the gene has been
correctly identified. - The extension of precompiled orthologous groups with
HaMStR [25] is in principle straightforward. However, the
naming conventions of sequences in this package are very strict,
and it might be not too simple for an uninitiated user to
meet all requirements. We therefore recommend for the start
the use of HaMStR_OneSeq instead. - Please be aware that file names used here are only examples and
may differ in the actual version of the program you are using. - For a more stringent ortholog identification, replace the
–refspecoption with-strict and omit any specification of a
reference species. The “-strict” option in HaMStR tells the
program to confirm orthology of a candidate sequence from
the target species for each sequence and species represented in
the core ortholog set. Refer to the HaMStR manual for further
details. - Remember that orthology specifies only the evolutionary rela-
tionships of two sequences. However, it does not inform about
whether or not two sequences also exert the same function. - If you run the PhyloProfile application for the first time, it may
perform some preprocessing on your data, such as mapping the
NCBI taxonomy ids to species names. Simply follow the guide-
lines of the tool. Once the preprocessing is completed, a restart
of the application might be required. - Exploring phylogenetic profiles for the first time is not easy. It
requires to have the evolutionary relationships of the analyzed
species in mind, together with all possible evolutionary events
explaining the presence/absence pattern of proteins in these
species. Only then will the phylogenetic profile start making
sense. As an example, imagine you find orthologs to a particu-
lar protein in all mammals except say the dog. It is then safe to
assume that the corresponding protein was present in the last
common ancestor of all mammals. From this follows that the
corresponding gene was either lost on the dog lineage or it was
erroneously missed in the annotation of the dog genome. You
would need to look into the genome assembly of the dog to
differentiate between the two possibilities. If, however, a sec-
ond species that is more closely related to dogs than to any
other species in your collection also lacks the protein, the “loss
hypothesis” gains weight, as it might appear less likely that
twice the same gene has been missed in two independent
genome reconstructions. Of course, you could ask to what
extent the reconstructions are indeed independent. Imagine
138 Arpit Jain et al.