Encyclopedia of Sociology

CORRELATION AND REGRESSION ANALYSIS

Typically the number of cases is insufficient to
yield a stable estimate of each of a series of condi-
tional means, one for each level of X. Means based
on a relatively small number of cases are inaccu-
rate because of sampling variation, and a line
connecting such unstable conditional means may
not be straight even though the true regression
line is. Hence, one assumes that the regression line
is a straight line unless there are compelling rea-
sons for assuming otherwise; one can then use the
X and Y values for all cases together to estimate the
Y intercept, ayx, and the slope, byx, of the regres-
sion line that is best fit by the criterion of least
squares. This criterion requires predicted values
for Y that will minimize the sum of squared devia-
tions between the predicted values and the ob-
served values. Hence, a ‘‘least squares’’ regression
line is the straight line that yields a lower sum of
squared deviations between the predicted (regres-
sion line) values and the observed values than does
any other straight line. One can find the parame-
ters of the ‘‘least squares’’ regression line for a
given set of X and Y values by computing

byx = –––––––––––––––∑^ (X – X) (Y – Y) ∑ (X – X)^2

( 3 )

ayx = Y – byxX (^4 )

These parameters (substituted in equation 2)
describe the straight regression line that best fits
by the criterion of least squares. By substituting
the X value for a given case into equation 2, one
can then find Ŷ for that case. Otherwise stated,
once ayx and byx have been computed, equation 2
will yield a precise predicted income level (Ŷ) for
each education level.

These predicted values may be relatively good
or relatively poor predictions, depending on wheth-
er the actual values of Y cluster closely around the
predicted values on the regression line or spread
themselves widely around that line. The variance
of the Y values (income levels in this illustration)
around the regression line will be relatively small if
the Y values cluster closely around the predicted
values (i.e., when the regression line provides rela-
tively good predictions). On the other hand, the
variance of Y values around the regression line will
be relatively large if the Y values are spread widely
around the predicted values (i.e., when the regres-
sion line provides relatively poor predictions). The

variance of the Y values around the regression predictions is defined as the mean of the squared deviations between them. The variances around each of the values along the regression line are assumed to be equal. This is known as the assumption of homoscedasticity (homogeneous scatter or variance). When the variances of the Y values around the regression predictions are larger for some values of X than for others (i.e., when homoscedasticity is not present), then X serves as a better predictor of Y in one part of its range than in another. The homoscedasticity assumption is usually at least approximately true.

The variance around the regression line is a measure of the accuracy of the regression predictions. But it is not an easily interpreted measure of the degree of correlation because it has not been ‘‘normed’’ to vary within a limited range. Two other measures, closely related to each other, pro- vide such a normed measure. These measures, which are always between zero and one in absolute value (i.e., sign disregarded) are: (a) the correlation coefficient, r, which is the measure devised by Karl Pearson; and (b) the square of that coefficient, r^2 , which, unlike r, can be interpreted as a percentage.

Pearson’s correlation coefficient, r, can be computed using the following formula:

∑ (X – X) (Y – Y)

∑ (X – X)^2 ∑ (Y – Y)^2

N ryx = rxy =

()N ( N )

( 5 )

The numerator in equation 5 is known as the covariance of X and Y. The denominator is the square root of the product of the variances of X and Y. Hence, equation 5 may be rewritten:

Covariance (X, Y) ryx = rxy = [Variance (X)] [Variance (Y)]

( 6 )

While equation 5 may serve as a computing guide, neither equation 5 nor equation 6 tells why it describes the degree to which two variables covary. Such understanding may be enhanced by stating that r is the slope of the least squares regression line when both X and Y have been

Encyclopedia of Sociology

()N ( N )

Get our desktop app

Company

Features

Documentation

Resources