- Throughout section3, we assumed that the data frame of
activity data was already sorted by signal inhibition in decreas-
ing order. While sorting the data frame is not essential for
fitting the machine learning models in the later section, you
may consider sorting your datasets for the heat map visualiza-
tion, to show the ten molecules with the highest inhibition
activity, for example. To sort the data framedf, you can use
sortvaluesmethod of a given pandas data frame object.
For example, the following code sorts the molecules stored as a
data frame df from most active to least active:df¼df.sort-
values(Signal-Inhibition’, ascending ¼ False).
More information about thissort_valuesmethod can be
found in the official pandas documentation athttps://pandas.
pydata.org/pandas-docs/stable/generated/pandas.
DataFrame.sort_values.html. - While we recommend working with 3D structures because they
provide spatial relationships between chemical groups, molec-
ular features can also be derived from 1D string representations
of molecules or 2D structural representations. For example, the
presence of certain substructures or atom types, using so-called
molecular fingerprints, can be computed using the open-source
toolkit OpenBabel (https://openbabel.org/docs/dev/
Fingerprints/intro.html). - To convert a 1D or 2D representation of a molecule into a 3D
structure as input for the spatial functional group matching in
the DKPES dataset that was done via Screenlamp [10] using
ROCS overlays (OpenEye Scientific Software, Santa Fe, NM;
https://www.eyesopen.com/rocs), you may find the following
tools helpful:
l The CACTUS online SMILES translator and structure file
generator (https://cactus.nci.nih.gov/translate/).
l OMEGA (OpenEye Scientific Software, Santa Fe, NM;
https://www.eyesopen.com/omega), which creates multi-
ple favorable 3D conformers of a given structure from 1D,
2D, or 3D representations [38, 39]. This software is avail-
able free for academic researchers upon completion of a
license agreement with OpenEye. - Further, you may find the BioPandas toolkit [40]helpful
(http://rasbt.github.io/biopandas/), which reads 3D
structures from the common MOL2 file format into the pan-
das data frame format. This can be useful if you are working
with large MOL2 databases that contain thousands or
millions of structures that you want to filter for certain
properties prior to generating overlays via ROCS or compute
the functional group matching patterns via Screenlamp:
https://github.com/psa-lab/screenlamp.
332 Sebastian Raschka et al.