relevant pH and it is thus suspected that molecular charge can
influence drug–target interactions). Duplicates should also be
removed to avoid exaggerated incidence of a single compound
on the model. Some chemical functions and moieties that can
be represented in multiple ways should be standardized: aro-
matic rings, nitro groups, etc. Tautomeric groups should also
be curated. Many of the previous steps can be performed in an
automated manner by specialized software applications (e.g.,
ChemAxon’s Standardizer). It is advisable, though, to manu-
ally verify a random subset of the resulting curated molecular
structures to ensure everything has gone well. Also note that
some software applications used for molecular descriptor cal-
culation impose restrictions to the molecular representations
that can be input (e.g., explicit or implicit hydrogens, and
aromatic rings).
- Once the dataset has been compiled, it is typically split into a
training set (to calibrate the model) and an independent test set
(which will be used to estimate the model’s predictive ability).
Partitioning the dataset is not a trivial task. Often, training and
test sets are obtained through random sampling or activity
range algorithms. These approaches are especially appropriate
when training and test sets are comparable in size, but better
results are expected with more rational partitioning procedures
such as sphere exclusion algorithms when test sets are small in
comparison with the corresponding training sets. This is the
typical situation: only 10–20% of the dataset is usually reserved
for the test set. If active and inactive compounds are included in
the training set, it is preferred that both categories are balanced
in order to avoid bias toward the prediction of the overrepre-
sented category. - Molecular descriptors are numerical variables that reflect chem-
ical information encoded within a symbolic representation of a
chemical compound. There is an extensive diversity of descrip-
tors available to reflect different aspects of a molecule, from
simple functional group counts to time-demanding quantum
descriptors. Two fundamental aspects can be considered at this
stage. First, the throughput speed associated to different kinds
of descriptors. Second, the interpretability of each type of
descriptor. If the models are intended to be used to screen
large collections of chemicals, the selected molecular descrip-
tors should ideally be easy to compute. If the model is expected
to describe structure–activity relationships of a reduced num-
ber of chemically similar compounds, more computationally
demanding descriptors could be afforded.
If 3D descriptors are considered, an interesting question is
what conformation should be used to compute the correspon-
dent descriptor values. An ideal solution to account for the
14 Alan Talevi