Systems Biology (Methods in Molecular Biology)

(Tina Sui) #1
we said above, the best antidote against irrelevance of our
findings [6].
Pearson continues: “In nearly all the cases dealt with in the text-
books of least squares, the variables on the right of our equations are
treated as independent, those on the left as dependent variables.” This
implies that the minimization of the sum of squared distances only
deals with the dependent (y) variable. The variance along indepen-
dent (x) variable, being the consequence of the choice of the
scientist (e.g., dose, time of observation...), is supposed to be
strictly controlled and thus does not enter in the evaluation of the
“fit” of the model.
The novelty of PCA lies in a different look at reality, much more
adherent with the actual situation of systems biology where the
traditional distinction between the independent and dependent
variables is blurred. Karl Pearson [5] recognized this point as
crucial:
In many cases of physics and biology, however, the ‘independent’ variable is
subject to just as much deviation or error as the ‘dependent’ variable, we do not,
for example, know x accurately and then proceed to find y, butboth x and y are
found by experiment or observation.

This new attitude is the core of the peculiar “best fitting”
procedure set forth by Pearson. Figure2 reports on the left the
original plot of Karl Pearson and on the right the classical regression
scheme [6].
In PCA (left panel) the distances to minimize are perpendicular
to the model of the data (the straight-line correspondent to the first
principal component ofx,yspace), while in the classical regression
model (right panel) the distances are perpendicular to thexaxis,
because the only uncertainty taken into account refers toy. This
apparently minor geometrical detail encompasses a sort of revolu-
tion in the style of doing science [6]. The “real thing” (the struc-
ture to be approximated by least squares approach) is no more the
“results as such” (the actual values of the observables that we know
are an intermingled mixture of “general” and “singular” informa-
tion) but their “meaningful syntheses” correspondent to the prin-
cipal components.
The discriminatory principle at the basis of PCA has to do with
a classical information theory axiom [7, 8]: the signal (meaningful,
general) part of information carried by the data corresponds to the
correlated variance (the flux of variability shared by different
variables).
This choice has a physical counterpart in the dynamics of
complex systems [9]: the uncorrelated part of information corre-
sponds to the so-called noise floor, i.e., to the minor components of
a data set.
Principal components are both the “best summary,” in a least
square sense, of the information present in the data cloud, and the

60 Alessandro Giuliani

Free download pdf