Encyclopedia of Sociology

(Marcin) #1
CORRELATION AND REGRESSION ANALYSIS

dimension. For example, among twentieth-centu-
ry nation-states, there is a high correlation be-
tween the energy consumption per capita and the
gross national product per capita. These two vari-
ables are presumably correlated because both are
indicators of the degree of industrial develop-
ment. Hence, one variable does not ‘‘explain’’
variation in the other, if ‘‘explain’’ has any of its
usual meanings. And two variables may be corre-
lated because both are influenced by a common
cause, in which case the two variables are ‘‘spuri-
ously correlated.’’ For example, among elementa-
ry-school children, reading ability is positively cor-
related with shoe size. This correlation appears
not because large feet facilitate learning, and not
because both are measures of the same underlying
dimension, but because both are influenced by
age. As they grow older, schoolchildren learn to
read better and their feet grow larger. Hence, shoe
size and reading ability are ‘‘spuriously correlat-
ed’’ because of the dependence of both on age. It
would therefore be misleading to conclude from
the correlation between shoe size and reading
ability that part of the variation in reading ability is
‘‘explained’’ by variation in shoe size, or vice versa.


In the attempt to discover the reasons for the
correlation between two variables, it is often useful
to include additional variables in the analysis. Sev-
eral techniques are available for doing so.


PARTIAL CORRELATION

One may wish to explore the correlation between
two variables with a third variable ‘‘held constant.’’
The partial correlation coefficient may be used for
this purpose. If the only reason for the correlation
between shoe size and reading ability is because
both are influenced by variation in age, then the
correlation should disappear when the influence
of variation in age is made nil—that is, when age is
held constant. Given a sufficiently large number of
cases, age could be held constant by considering
each age grouping separately—that is, one could
examine the correlation between shoe size and
reading ability among children who are six years
old, among children who are seven years old, eight
years old, etc. (And one presumes that there would
be no correlation between reading ability and shoe
size among children who are homogeneous in
age.) But such a procedure requires a relatively
large number of children in each age grouping.


Lacking such a large sample, one may hold age
constant by ‘‘statistical adjustment.’’

To understand the underlying logic of partial
correlation, one considers the regression residuals
(i.e., for each case, the discrepancy between the
regression line value and the observed value of the
predicted variable). For example, the regression
residual of reading ability on age for a given case is
the discrepancy between the actual reading ability
and the predicted reading ability based on age.
Each residual will be either positive or negative
(depending on whether the observed reading abili-
ty is higher or lower than the regression predic-
tion). Each residual will also have a specific value,
indicating how much higher or lower than the age-
specific mean (i.e., regression line values) the read-
ing ability is for each person. The complete set of
these regression residuals, each being a deviation
from the age-specific mean, describes the pattern
of variation in reading abilities that would obtain if
all of these schoolchildren were identical in age.
Similarly, the regression residuals for shoe size on
age describe the pattern of variation that would
obtain if all of these schoolchildren were identical
in age. Hence, the correlation between the two sets
of residuals—(1) the regression residuals of shoe
size on age and (2) the regression residuals of
reading ability on age—is the correlation between
shoe size and reading ability, with age ‘‘held con-
stant.’’ In practice, it is not necessary to find each
regression residual to compute the partial correla-
tion, because shorter computational procedures
have been developed. Hence,
rxy – rxzryz
rxy•x =
(1 – r^2 xz) (1 – r^2 yz) (^11 )

where rxy⋅z; = the partial coefficient between X and
Y, holding Z constant; rxy = the bivariate correla-
tion coefficient between X and Y; rxz = the bivari-
ate correlation coefficient between X and Z; and
ryz = the bivariate correlation coefficient be-
tween Y and Z.

It should be evident from equation 11 that if Z
is unrelated to both X and Y, controlling for Z will
yield a partial correlation that does not differ from
the bivariate correlation. If all correlations are
positive, each increase in the correlation between
the control variable, Z, and each of the focal
Free download pdf