COMPUTATIONAL TOOLS 105
can be divided into three areas depending on the similarity of the target protein to proteins of known
structure: comparative (also known as homology) modeling, fold recognition (also known as thread-
ing), and de novo/new fold methods (also known as ab initio). This traditional division of prediction
methods has become blurred as the methods in each category incorporate detailed information used by
methods in the other categories.
In comparative (or homology) modeling, one or more template proteins of known structure with
high sequence homology (greater than 25 percent sequence identity) to the target sequence are identi-
fied. The target and template sequences are aligned through multiple sequence alignment (similar to
comparative genomics), and a three-dimensional structure of the target protein is generated from the
coordinates of the aligned residues of the template proteins. Finally, the model is evaluated using a
variety of criteria, and if necessary, the alignment and the three-dimensional model are refined until a
satisfactory model is obtained.
If no reliable template protein can be identified from sequence homology alone, the prediction
problem is denoted as a fold recognition (or threading) problem. The primary goal is to identify one or
more folds in the template proteins that are consistent with the target sequence. In the classical thread-
ing methods, known as “rigid body assembly,” a model is constructed from a library of known core
regions, loops, side chains, and folds, and the target sequence is then threaded onto the known folds.
After evaluating how well the model fits the known folds, the best fit is chosen. The assumption in fold
recognition is that only a finite number of folds exist and most existing folds can be identified from
known structures in the PDB. Indeed, as new sequences are deposited and more protein structures are
solved, there appear to be fewer and fewer unique folds. When two sequences share more than 25
percent similarity (or sequence identity), their structures are expected to have similar folds. However,
there are still remaining issues such as the high rate of false positives in fold recognition, and therefore,
the resulting alignment with the fold structure is poor. At 30 percent sequence identity, the fraction of
incorrectly aligned residues is about 20 percent, and the number rises sharply with further decreases in
sequence similarity. This limits the usefulness of comparative modeling.^138
If no template structure (or fold) can be identified with confidence by sequence homology methods,
the target sequence may be modeled using new fold prediction methods. The goal in this prediction
method rests on the biological assumption that proteins adopt their lowest free energy conformation as
their functional state. Thus, computational methods to predict structure ab initio comprise three ele-
ments: (1) protein geometry, (2) potential energy functions, and (3) an energy space search method
(energy minimization method). First, setting protein geometry involves determining the number of
particles to be used to represent the protein structure (for example, all-atom, united-atom, or virtual-
atom model) and the nature of the space where atoms can be allocated (e.g., continuous (off-lattice) or
discrete (lattice) model). In a simple ab initio folding such as a virtual-atom lattice model, one virtual
atom represents a number of atoms in a protein (i.e., the backbone is represented as a sequence of alpha
carbons) and an optimization method searches only the predetermined lattice points for positions of the
virtual atoms to minimize the energy functions. Second, the potential energy functions in ab initio
models include covalent terms, such as bond stretching, bond angle stretching, improper dihedrals, and
torsional angles, and noncovalent terms, such as electrostatic and van der Waals forces. The use of
molecular mechanics for refinement in comparative modeling is equivalent to ab initio calculation using
all atoms in an off-lattice model. Third, many optimizations tools, such as genetic algorithms, Monte
Carlo, simulated annealing, branch and bound, and successive quadratic programming (SQP), have
been used to search for the global minimum in the energy (or structure) spaces with a number of local
minima. These approaches have provided encouraging results, although the performance of each
method may be limited by the shape of the energy space.
(^138) T. Head-Gordon and J. Wooley, “Computational Challenges in Structural and Functional Genomics,” IBM Systems Journal
40(2):265-296, 2001.