specific choice of statistical units whose degree of mirroring of the
reference population is unknown [4]. As an example of the selec-
tion bias, we can think of a model trying to correlate the folding
rate to some structural properties of protein molecules: the data set
we train the model upon is forcedly a limited selection of the entire
protein Universe (even for the simple fact we do not know the 3D
structure of all the proteins). This determines a mixing of “data set
specific” and “valid on the entire protein Universe” features across
all the considered variables. Statistical theory of sampling, while
very powerful in the case of well-defined reference population, is
largely out-of-scope in a large part of modern biological research
where we use a particular model system with the goal of general-
izing the results outside its realm (e.g., from cell lines to the entire
organ).
Thus, we remain with the problem of “degree of generaliza-
tion” (and then actual meaning) of our findings: the “overfitting”
effect gives us an empirical guidance in this respect. In machine
learning experiments [3] it was noted that, after a given level of
accuracy (e.g., degree of correlation between the predicted and
observed values) the generalization ability (performance of the
model on a test set not used for model building) starts to decline.
This decline stems from the fact that, after a certain percentage
of accuracy, the fitting procedure starts to model noise, the more
the parameters that can be adjusted to fit the data, the faster the
entrenchment of the procedure into the modeling of “data set
specific” (and thus not-generalizable) details.
In [3], the authors demonstrate that the overfitting problem
can be faced by reducing the dimensionality of the system, which
corresponds to reducing the degrees of freedom of modeling pro-
cedure. This means that (like suggested by Fig.1) to rely on more
information is not necessarily a good thing. This statement could
sound paradoxical in these times of ever-increasing computational
power and of huge data sets (seehttp://omics.org/index.php/
Alphabetically_ordered_list_of_omes_and_omics) but is crucial to
pay attention on the above issues if we want to avoid a sort of
thermal death of science.
2 Meaningful Syntheses
In his seminal 1901 paper [5], Karl Pearson synthetically defined
the main goal of Principal Component Analysis (PCA): “In many
physical, statistical and biological investigations it is desirable to
represent a system of points in plane, three or higher dimensioned
space by the ‘best fitting’ straight line or plane.” The need to collapse
multidimensional information scattered over different (and some-
times heterogeneous) descriptors into a lower number of relevant
dimensions is one of the main pillars of scientific knowledge and, as
Parameters Search 59