Encyclopedia of Sociology

(Marcin) #1
CORRELATION AND REGRESSION ANALYSIS

transformed into ‘‘standard deviates’’ or ‘‘z meas-
ures.’’ Each value in a distribution may be trans-
formed into a ‘‘z measure’’ by finding its deviation
from the mean of the distribution and dividing by
the standard deviation (the square root of the
variance) of that distribution. Thus


Zx = X


  • X
    ∑ (X – X)^2
    N


( 7 )

When both the X and Y measures have been thus
standardized, ryx = rxy is the slope of the regression
of Y on X, and of X on Y. For standard deviates,
the Y intercept is necessarily 0, and the following
equation holds:


Zy = ryxZx (^8 )

where Ẑy = the regression prediction for the ‘‘Z
measure’’ of Y, given X; Zx = the standard deviates
of X; and ryx = rxy = the Pearsonian correlation
between X and Y.


Like the slope byx, for unstandardized meas-
ures, the slope for standardized measures, r, may
be positive or negative. But unlike byx, r is always
between 0 and 1.0 in absolute value. The correla-
tion coefficient, r, will be 0 when the standardized
regression line is horizontal so that the two vari-
ables do not covary at all—and, incidentally, when
the regression toward the mean, which was Galton’s
original interest, is complete. On the other hand, r
will be 1.0 or − 1.0 when all values of Zy fall
precisely on the regression line rZx. This means
that when r = + 1.0, for every case Zx = Zy—that is,
each case deviates from the mean on X by exactly
as much and in the same direction as it deviates
from the mean on Y, when those deviations are
measured in their respective standard deviation
units. And when r = − 1.0, the deviations from the
mean measured in standard deviation units are
exactly equal, but they are in opposite directions.
(It is also true that when r = 1.0, there is no
regression toward the mean, although this is very
rarely of any interest in contemporary applica-
tions.) More commonly, r will be neither 0 nor 1.0
in absolute value but will fall between these ex-
tremes, closer to 1.0 in absolute value when the Zy
values cluster closely around the regression line,
which, in this standardized form, implies that the


slope will be near 1.0, and closer to 0 when they
scatter widely around the regression line.

But while r has a precise meaning—it is the
slope of the regression line for standardized meas-
ures—that meaning is not intuitively understand-
able as a measure of the degree to which one
variable can be accurately predicted from the oth-
er. The square of the correlation coefficient, r^2 ,
does have such an intuitive meaning. Briefly stat-
ed, r^2 indicates the percent of the possible reduc-
tion in prediction error (measured by the variance
of actual values around predicted values) that is
achieved by shifting from (a)  as the prediction,
to (b) the regression line values as the prediction.
Otherwise stated,

Variance of Y values
Around Y

r^2 =

Variance of Y values Variance of Y values
around Y around Y

( 9 )

The denominator of Equation 9 is called the
total variance of Y. It is the sum of two components:
(1) the variance of the Y values around Ŷ, and (2)
the variance of the Ŷ around . Hence the numera-
tor of equation 9 is equal to the variance of the Ŷ
values (regression values) around . Therefore

r^2 =

Variance of Y values
around Y
Variance of Y values
around Y
= proportion of variance explained

( 10 )

Even though it has become common to refer to r^2
as the proportion of variance ‘‘explained,’’ such
terminology should be used with caution. There
are several possible reasons for two variables to be
correlated, and some of these reasons are inconsis-
tent with the connotations ordinarily attached to
terms such as ‘‘explanation’’ or ‘‘explained.’’ One
possible reason for the correlation between two
variables is that X influences Y. This is presumably
the reason for the positive correlation between
education and income; higher education facili-
tates earning a higher income, and it is appropri-
ate to refer to a part of the variation in income as
being ‘‘explained’’ by variation in education. But
there is also the possibility that two variables are
correlated because both are measures of the same
Free download pdf