Computational Drug Discovery and Design

(backadmin) #1
relevant pH and it is thus suspected that molecular charge can
influence drug–target interactions). Duplicates should also be
removed to avoid exaggerated incidence of a single compound
on the model. Some chemical functions and moieties that can
be represented in multiple ways should be standardized: aro-
matic rings, nitro groups, etc. Tautomeric groups should also
be curated. Many of the previous steps can be performed in an
automated manner by specialized software applications (e.g.,
ChemAxon’s Standardizer). It is advisable, though, to manu-
ally verify a random subset of the resulting curated molecular
structures to ensure everything has gone well. Also note that
some software applications used for molecular descriptor cal-
culation impose restrictions to the molecular representations
that can be input (e.g., explicit or implicit hydrogens, and
aromatic rings).


  1. Once the dataset has been compiled, it is typically split into a
    training set (to calibrate the model) and an independent test set
    (which will be used to estimate the model’s predictive ability).
    Partitioning the dataset is not a trivial task. Often, training and
    test sets are obtained through random sampling or activity
    range algorithms. These approaches are especially appropriate
    when training and test sets are comparable in size, but better
    results are expected with more rational partitioning procedures
    such as sphere exclusion algorithms when test sets are small in
    comparison with the corresponding training sets. This is the
    typical situation: only 10–20% of the dataset is usually reserved
    for the test set. If active and inactive compounds are included in
    the training set, it is preferred that both categories are balanced
    in order to avoid bias toward the prediction of the overrepre-
    sented category.

  2. Molecular descriptors are numerical variables that reflect chem-
    ical information encoded within a symbolic representation of a
    chemical compound. There is an extensive diversity of descrip-
    tors available to reflect different aspects of a molecule, from
    simple functional group counts to time-demanding quantum
    descriptors. Two fundamental aspects can be considered at this
    stage. First, the throughput speed associated to different kinds
    of descriptors. Second, the interpretability of each type of
    descriptor. If the models are intended to be used to screen
    large collections of chemicals, the selected molecular descrip-
    tors should ideally be easy to compute. If the model is expected
    to describe structure–activity relationships of a reduced num-
    ber of chemically similar compounds, more computationally
    demanding descriptors could be afforded.
    If 3D descriptors are considered, an interesting question is
    what conformation should be used to compute the correspon-
    dent descriptor values. An ideal solution to account for the


14 Alan Talevi

Free download pdf