size of results, while less significant hits can be discarded by
unchecking their boxes, before the MSA procedure. The third
option is the selection of the similarity matrix. A similarity
matrix assigns scores to each amino acid pair combination,
proportional to the probability of observing the corresponding
substitution in the nature. BLOSUM62 is the generally
accepted matrix to detect weak similarities between protein
sequences. BLOSUM80 performs better in finding highly sim-
ilar proteins and BLOSUM45 is a better choice to detect highly
distant sequences. Auto option can be selected in order for the
algorithm to decide for the best matrix for the query sequence.
The fourth parameter option is for filtering low complexity
regions in proteins, to avoid false matches specific to those
regions. If the query protein is known to include low complex-
ity regions, this filter can be selected to obtain more specific
results. The fifth option is for switching the allowance of gaps,
where the default selection is “yes.” The last option is for
limiting the number of returned hits. Usually, 250 hits or less
is sufficient to build a MSA.
- The sources of protein annotations in UniProt include: (1) -
in-house manual curation, (2) in-house automated predictions,
and (3) imports from external resources. This way, UniProt
aims to provide comprehensive information on the properties
of proteins. However, included data does not cover the results
of most of the external predictive approaches available in the
literature. Due to this reason, it is natural to observe differences
between the information in UniProt and the results of a pre-
diction method (e.g., functional site information obtained
from different resources in our case study). UniProt aims to
incorporate only the most reliable annotations backed up by
strong evidence; as a result, a predictive approach may provide
relatively higher coverage on a query protein, where a portion
of its predictions can also be false positives. Nevertheless, pre-
dicted information obtained from other resources can be eval-
uated along with UniProt annotations. For example, predicted
active residues can be compared to mutagenesis and variation
information provided in UniProt protein pages, which can be
utilized to infer disease relations and for drug targeting. - The parameters of Clustal Omega are as follows: The first
option is the selection of the input data as protein, DNA or
RNA. The second one is the output format, which can be
selected among Clustal, Pearson, MSF, PHYLIP and etc.
“More options” button will reveal the parameters about the
tool such as the number of iterations at different steps of the
algorithm (e.g., sequence alignment and tree generation).
Higher number of iterations will refine the results with the
cost of longer computation times. Usually the default values
66 Heval Atas et al.