Encyclopedia of Sociology

(Marcin) #1
CORRELATION AND REGRESSION ANALYSIS

parents were, on average, shorter than their par-
ents, while the children of very short parents
tended to exceed their parents in height (cited in
Walker 1929). Galton referred to this as ‘‘rever-
sion’’ or the ‘‘law of regression’’ (i.e., regression to
the average height of the species). Galton also saw
in his graphs and tables a feature that he named
the ‘‘co-relation’’ between variables. The stature of
kinsmen are ‘‘co-related’’ variables, Galton stated,
meaning, for example, that when the father was
taller than average, his son was likely also to be
taller than average. Although Galton devised a way
of summarizing in a single figure the degree of
‘‘co-relation’’ between two variables, it was Galton’s
associate Karl Pearson who developed the ‘‘coeffi-
cient of correlation,’’ as it is now applied. Galton’s
original interest, the phenomenon of regression
toward the mean, is no longer germane to contem-
porary correlation and regression analysis, but the
term ‘‘regression’’ has been retained with a modi-
fied meaning.


Although Galton and Pearson originally fo-
cused their attention on bivariate (two variables)
correlation and regression, in current applications
more than two variables are typically incorporated
into the analysis to yield partial correlation coeffi-
cients, multiple regression analysis, and several
related techniques that facilitate the informed in-
terpretation of the linkages between pairs of vari-
ables. This summary begins with two variables and
then moves to the consideration of more than two
variables.


Consider a very large sample of cases, with a
measure of some variable, X, and another variable,
Y, for each case. To make the illustration more
concrete, consider a large number of adults and,
for each, a measure of their education (years of
school completed = X) and their income (dollars
earned over the past twelve months = Y). Subdi-
vide these adults by years of school completed, and
for each such subset compute a mean income for a
given level of education. Each such mean is called
a conditional mean and is represented by |X, that
is, the mean of Y for a given value of X.


Imagine now an ordered arrangement of the
subsets from left to right according to the years of
school completed, with zero years of school on the
left, followed by one year of school, and so on
through the maximum number of years of school
completed in this set of cases, as shown in Figure 1.


Assume that each of the |X values (i.e., the mean
income for each level of education) falls on a
straight line, as in Figure 1. This straight line is the
regression line of Y on X. Thus the regression line of
Y on X is the line that passes through the mean Y
for each value of X—for example, the mean in-
come for each educational level.

If this regression line is a straight line, as
shown in Figure 1, then the income associated
with each additional year of school completed is
the same whether that additional year of school
represents an increase, for example, from six to
seven years of school completed or from twelve to
thirteen years. While one can analyze curvilinear
regression, a straight regression line greatly simpli-
fies the analysis. Some (but not all) curvilinear
regressions can be made into straight-line regressions
by a relatively simple transformation of one of the
variables (e.g., taking a logarithm). The common
assumption that the regression line is a straight
line is known as the assumption of rectilinearity, or
more commonly (even if less precisely) as the
assumption of linearity.

The slope of the regression line reflects one
feature of the relationship between two variables.
If the regression line slopes ‘‘uphill,’’ as in Figure
1, then Y increases as X increases, and the steeper
the slope, the more Y increases for each unit
increase in X. In contrast, if the regression line
slopes ‘‘downhill’’ as one moves from left to right,
Y decreases as X increases, and the steeper the
slope, the more Y decreases for each unit increase
in X. If the regression line doesn’t slope at all but is
perfectly horizontal, then there is no relationship
between the variables. But the slope does not tell
how closely the two variables are ‘‘co-related’’ (i.e.,
how closely the values of Y cluster around the
regression line).

A regression line may be represented by a
simple mathematical formula for a straight line. Thus:

Y|X = ayx + byxX (^1 )

where |X = the mean Y for a given value of X, or
the regression line values of Y given X; ayx = the Y
intercept (i.e., the predicted value of |X when X
= 0); and byx = the slope of the regression of Y on X
(i.e., the amount by which |X increases or de-
creases—depending on whether b is positive or
negative—for each one-unit increase in X).
Free download pdf