Encyclopedia of Sociology

(Marcin) #1
CORRELATION AND REGRESSION ANALYSIS

Typically the number of cases is insufficient to
yield a stable estimate of each of a series of condi-
tional means, one for each level of X. Means based
on a relatively small number of cases are inaccu-
rate because of sampling variation, and a line
connecting such unstable conditional means may
not be straight even though the true regression
line is. Hence, one assumes that the regression line
is a straight line unless there are compelling rea-
sons for assuming otherwise; one can then use the
X and Y values for all cases together to estimate the
Y intercept, ayx, and the slope, byx, of the regres-
sion line that is best fit by the criterion of least
squares. This criterion requires predicted values
for Y that will minimize the sum of squared devia-
tions between the predicted values and the ob-
served values. Hence, a ‘‘least squares’’ regression
line is the straight line that yields a lower sum of
squared deviations between the predicted (regres-
sion line) values and the observed values than does
any other straight line. One can find the parame-
ters of the ‘‘least squares’’ regression line for a
given set of X and Y values by computing


byx = –––––––––––––––∑^ (X – X) (Y – Y)
∑ (X – X)^2

( 3 )

ayx = Y – byxX (^4 )

These parameters (substituted in equation 2)
describe the straight regression line that best fits
by the criterion of least squares. By substituting
the X value for a given case into equation 2, one
can then find Ŷ for that case. Otherwise stated,
once ayx and byx have been computed, equation 2
will yield a precise predicted income level (Ŷ) for
each education level.


These predicted values may be relatively good
or relatively poor predictions, depending on wheth-
er the actual values of Y cluster closely around the
predicted values on the regression line or spread
themselves widely around that line. The variance
of the Y values (income levels in this illustration)
around the regression line will be relatively small if
the Y values cluster closely around the predicted
values (i.e., when the regression line provides rela-
tively good predictions). On the other hand, the
variance of Y values around the regression line will
be relatively large if the Y values are spread widely
around the predicted values (i.e., when the regres-
sion line provides relatively poor predictions). The


variance of the Y values around the regression
predictions is defined as the mean of the squared
deviations between them. The variances around
each of the values along the regression line are
assumed to be equal. This is known as the assump-
tion of homoscedasticity (homogeneous scatter or
variance). When the variances of the Y values
around the regression predictions are larger for
some values of X than for others (i.e., when
homoscedasticity is not present), then X serves as a
better predictor of Y in one part of its range than
in another. The homoscedasticity assumption is
usually at least approximately true.

The variance around the regression line is a
measure of the accuracy of the regression predic-
tions. But it is not an easily interpreted measure of
the degree of correlation because it has not been
‘‘normed’’ to vary within a limited range. Two
other measures, closely related to each other, pro-
vide such a normed measure. These measures,
which are always between zero and one in absolute
value (i.e., sign disregarded) are: (a) the correla-
tion coefficient, r, which is the measure devised by
Karl Pearson; and (b) the square of that coeffi-
cient, r^2 , which, unlike r, can be interpreted as a
percentage.

Pearson’s correlation coefficient, r, can be
computed using the following formula:

∑ (X – X) (Y – Y)

∑ (X – X)^2 ∑ (Y – Y)^2

N
ryx = rxy =

()N ( N )

( 5 )

The numerator in equation 5 is known as the
covariance of X and Y. The denominator is the
square root of the product of the variances of X
and Y. Hence, equation 5 may be rewritten:

Covariance (X, Y)
ryx = rxy =
[Variance (X)] [Variance (Y)]

( 6 )

While equation 5 may serve as a computing
guide, neither equation 5 nor equation 6 tells why
it describes the degree to which two variables
covary. Such understanding may be enhanced by
stating that r is the slope of the least squares
regression line when both X and Y have been
Free download pdf