Systems Biology (Methods in Molecular Biology)

specific choice of statistical units whose degree of mirroring of the reference population is unknown [4]. As an example of the selection bias, we can think of a model trying to correlate the folding rate to some structural properties of protein molecules: the data set we train the model upon is forcedly a limited selection of the entire protein Universe (even for the simple fact we do not know the 3D structure of all the proteins). This determines a mixing of “data set specific” and “valid on the entire protein Universe” features across all the considered variables. Statistical theory of sampling, while very powerful in the case of well-defined reference population, is largely out-of-scope in a large part of modern biological research where we use a particular model system with the goal of general- izing the results outside its realm (e.g., from cell lines to the entire organ). Thus, we remain with the problem of “degree of generalization” (and then actual meaning) of our findings: the “overfitting” effect gives us an empirical guidance in this respect. In machine learning experiments [3] it was noted that, after a given level of accuracy (e.g., degree of correlation between the predicted and observed values) the generalization ability (performance of the model on a test set not used for model building) starts to decline. This decline stems from the fact that, after a certain percentage of accuracy, the fitting procedure starts to model noise, the more the parameters that can be adjusted to fit the data, the faster the entrenchment of the procedure into the modeling of “data set specific” (and thus not-generalizable) details. In [3], the authors demonstrate that the overfitting problem can be faced by reducing the dimensionality of the system, which corresponds to reducing the degrees of freedom of modeling procedure. This means that (like suggested by Fig.1) to rely on more information is not necessarily a good thing. This statement could sound paradoxical in these times of ever-increasing computational power and of huge data sets (seehttp://omics.org/index.php/ Alphabetically_ordered_list_of_omes_and_omics) but is crucial to pay attention on the above issues if we want to avoid a sort of thermal death of science.

2 Meaningful Syntheses

In his seminal 1901 paper [5], Karl Pearson synthetically defined the main goal of Principal Component Analysis (PCA): “In many physical, statistical and biological investigations it is desirable to represent a system of points in plane, three or higher dimensioned space by the ‘best fitting’ straight line or plane.” The need to collapse multidimensional information scattered over different (and some- times heterogeneous) descriptors into a lower number of relevant dimensions is one of the main pillars of scientific knowledge and, as

Parameters Search 59

Systems Biology (Methods in Molecular Biology)

Get our desktop app

Company

Features

Documentation

Resources