Encyclopedia of Sociology

(Marcin) #1
CORRELATION AND REGRESSION ANALYSIS

States because such coefficients are affected by the
variances in the two populations.


The multiple correlation coefficient, R, is de-
fined as the correlation between the observed
values of Y and the values of Y predicted by the
multiple regression equation. It would be unneces-
sarily tedious to calculate the multiple correlation
coefficient in that way. The more convenient com-
putational procedure is to compute R^2 (for two
predictors, and analogously for more than two
predictors) by the following:


R^2 = b*y1.2ry1 + b*y2.1ry2 ( 19 )

Like r^2 , R^2 varies from 0 to 1.0 and indicates the
proportion of variance in the criterion that is
‘‘explained’’ by the predictors. Alternatively stat-
ed, R^2 is the percent of the possible reduction in
prediction error (measured by the variance of
actual values around predicted values) that is
achieved by shifting from (a)  as the prediction to
(b) the multiple regression values, Ŷ, as the prediction.


VARIETIES OF MULTIPLE REGRESSION

The basic concept of multiple regression has been
adapted to a variety of purposes other than those
for which the technique was originally developed.
The following paragraphs provide a brief summa-
ry of some of these adaptations.


Dummy Variable Analysis. As originally con-
ceived, the correlation coefficient was designed to
describe the relationship between continuous, nor-
mally distributed variables. Dichotomized predic-
tors such as gender (male and female) were intro-
duced early in bivariate regression and correlation,
which led to the ‘‘point biserial correlation coeffi-
cient’’ (Walker and Lev 1953). For example, if one
wishes to examine the correlation between gender
and income, one may assign a ‘‘0’’ to each instance
of male and a ‘‘1’’ to each instance of female to
have numbers representing the two categories of
the dichotomy. The unstandardized regression
coefficient, computed as specified above in equa-
tion 3, is then the difference between the mean
income for the two categories of the dichotomous
predictor, and the computational formula for r
(equation 5), will yield the point biserial correla-
tion coefficient, which can be interpreted much
like any other r. It was then only a small step to the


inclusion of dichotomies as predictors in multiple
regression analysis, and then to the creation of a
set of dichotomies from a categorical variable with
more than two subdivisions—that is, to dummy
variable analysis (Cohen 1968; Bohrnstedt and
Knoke 1988; Hardy 1993).

Religious denomination—e.g., Protestant, Catho-
lic, and Jewish—serves as an illustration. From
these three categories, one forms two dichotomies,
called ‘‘dummy variables.’’ In the first of these, for
example, cases are classified as ‘‘1’’ if they are
Catholic, and ‘‘0’’ otherwise (i.e., if Protestant or
Jewish). In the second of the dichotomies, cases
are classified as ‘‘1’’ if they are Jewish, and ‘‘0’’
otherwise (i.e., if Protestant or Catholic). In this
illustration, Protestant is the ‘‘omitted’’ or ‘‘refer-
ence’’ category (but Protestants can be identified
as those who are classified ‘‘0’’ on both of the other
dichotomies). The resulting two dichotomized
‘‘dummy variables’’ can serve as the only predic-
tors in a multiple regression equation, or they may
be combined with other predictors. When the
dummy variables mentioned are the only predic-
tors, the unstandardized regression coefficient for
the predictor in which Catholics are classified ‘‘1’’
is the difference between the mean Y for Catholics
and Protestants (the ‘‘omitted’’ or ‘‘reference’’
category). Similarly, the unstandardized regres-
sion coefficient for the predictor in which Jews are
classified ‘‘1’’ is the difference between the mean Y
for Jews and Protestants. When the dummy vari-
ables are included with other predictors, the
unstandardized regression coefficients are the same
except that the difference of each mean from the
mean of the ‘‘reference’’ category has been statisti-
cally adjusted to control for each of the other
predictors in the regression equation.

The development of ‘‘dummy variable analy-
sis’’ allowed multiple regression analysis to be
linked to the experimental statistics developed by
R. A. Fisher, including the analysis of variance and
covariance. (See Cohen 1968.)

Logistic Regression. Early students of corre-
lation anticipated the need for a measure of corre-
lation when the predicted or dependent variable
was dichotomous. Out of this came (a) the phi
coefficient, which can be computed by applying
the computational formula for r (equation 5) to
two dichotomies, each coded ‘‘0’’ or ‘‘1,’’ and (b)
the tetrachoric correlation coefficient, which uses
Free download pdf