- In cases where the alignment contains lots of gaps, you can add
an optional post-processing step with Guidance [40] or TCS
[41]. This serves to reduce noise in the data due to incorrectly
aligned residues. Alternatively, you can remove alignment col-
umns where the number of sequences represented by an amino
acid does not exceed a certain threshold. A limit of 50% is
common; however, you can increase this value if a more strin-
gent analysis is desired. - Maximum likelihood tree reconstruction requires thea priori
definition of a protein sequence evolution model. Use the
(post-processed) alignment as input for a ProtTest run
(v3.4.1) [34]. Let ProtTest define its own tree for the analysis
and choose AIC as a model selection criterion. Make sure to
enable the testing for modeling substitution rate heterogeneity
across sites (optionG), invariant sites (optionI), and for using
empirical amino acid frequencies (optionF). The top-ranked
model according to the AIC is the one that provides the best fit
to your data and should be used for downstream maximum
likelihood tree reconstruction (seeNote 15). - Use the multiple sequence alignment and the model for
sequence evolution as input for the maximum likelihood
(ML) tree reconstruction with RAxML [35]. When you add
the option–f A, RAxML will compute the maximum likelihood
tree together with a rapid bootstrap support. Specify the num-
ber of bootstrap replicates with the option–N. A complete
program call could read something like this:raxml –n out-
putFileName –s alignmentFileName –m PROTGAMMAILGF
(A) FASTA format (B) Phylip format
A1_YEASTMSSNNNTNT-------------APANANSQRDKMSEQE-ARRFFQQIISAVEY-------
CHRHKIVHRDLKPENLLLDEHLNVKIADFGLSNIMTDGNFLKTSCGSPNYAAPEVISGKL
YAGPEVDVWSCGVIL---------------------->PRKAA1_SORCE
MARCL-----------------VCNAENPGSARFCVAC-GASLTAKEAGAGAATSAPPAPGPRTTVPGQPLLGALVPEDPAHRASL------SALAAGGANGPAANV---HAPHIVPAPL
AGSLPRRA--PGGHLP---------------------
A1_ENTHI
-------------------------------MSQCYRV-GQFIIGKKLGEGMC-------
G-KVYLAFHEKTGVKVAIKIVDKTKL----MRKPEMKRKVEREIAFLKIINHRNVMQLYT
VYETTRYLFLVMELLEGGELFDYISSKGKLEIEEVLV
A1_ARATH
MFKRVDEFNLVSSTIDHRIFKSRMDGSGTGSRSGVESILPNYKLGRTLGIGSF-------
G-RVKIAEHALTGHKVAIKILNRRKI-----KNMEMEEKVRREIKILRLFMHPHIIRLYE
VIETPTDIY----------------------------
PRKAA1_HUMAN
-MRRLSSWR-------------KMATAEKQKHDGRVKI-GHYILGDTLGVGTF-------
G-KVKVGKHELTGHKVAVKILNRQKI-----RSLDVVGKIRREIQNLKLFRHPHIIKLYQ
VISTPSDIFMVMEYVSRAR------------------
PRKAA1_MOUSE
-MRRLSSWR-------------KMATAEKQKHDGRVKI-GHYILGDTLGVGTF-------
G-KVKVGKHELTGHKVAVKILNRQKI-----RSLDVVGKIRREIQNLKLFRHPHIIKLYQ
VISTPSDI-----------------------------
A1_YEAST MSSNNNTNT- ---------- --APANANSQ RDKMSEQE-A RRFFQQIISA 6 157
PRKAA1_SOR MARCL----- ---------- --VCNAENPG SARFCVAC-G ASLTAKEAGA A1_ENTHI ---------- ---------- ---------- -MSQCYRV-G QFIIGKKLGE
A1_ARATH MFKRVDEFNL VSSTIDHRIF KSRMDGSGTG SRSGVESILP NYKLGRTLGI PRKAA1_HUM -MRRLSSWR- ---------- --KMATAEKQ KHDGRVKI-G HYILGDTLGV
PRKAA1_MOU -MRRLSSWR- ---------- --KMATAEKQ KHDGRVKI-G HYILGDTLGV
VEY------- CHRHKIVHRD LKPENLLLDE HLNVKIADFG LSNIMTDGNF GAATSAPPAP GPRTTVPGQP LLGALVPEDP AHRASL---- --SALAAGGA
GMC------- G-KVYLAFHE KTGVKVAIKI VDKTKL---- MRKPEMKRKV
GSF------- G-RVKIAEHA LTGHKVAIKI LNRRKI---- -KNMEMEEKV GTF------- G-KVKVGKHE LTGHKVAVKI LNRQKI---- -RSLDVVGKI
GTF------- G-KVKVGKHE LTGHKVAVKI LNRQKI---- -RSLDVVGKI
LKTSCGSPNY AAPEVISGKL YAGPEVDVWS CGVIL----- ----------NGPAANV--- HAPHIVPAPL AGSLPRRA-- PGGHLP---- ----------
EREIAFLKII NHRNVMQLYT VYETTRYLFL VMELLEGGEL FDYISSKGKL
RREIKILRLF MHPHIIRLYE VIETPTDIY- ---------- ----------RREIQNLKLF RHPHIIKLYQ VISTPSDIFM VMEYVSRAR- ----------
RREIQNLKLF RHPHIIKLYQ VISTPSDI-- ---------- ----------
--------------
EIEEVLV
--------------
-------
Fig. 9Commonly used multiple sequence alignment formats. (a) FASTA and (b) Phylip. Please note that the
sequence identifiers in Phylip format are typically limited to a maximum of ten characters. Format converters
will therefore shorten longer identifier in the FASTA format (e.g., “PRKAA1_SORCE”) to the maximum length of
ten characters (“PRKAA1_SOR”) in Phylip format. You should therefore pay particular attention that the
identifiers are still unique after the conversion
132 Arpit Jain et al.