Encyclopedia of Sociology

CORRELATION AND REGRESSION ANALYSIS

States because such coefficients are affected by the
variances in the two populations.

The multiple correlation coefficient, R, is de-
fined as the correlation between the observed
values of Y and the values of Y predicted by the
multiple regression equation. It would be unneces-
sarily tedious to calculate the multiple correlation
coefficient in that way. The more convenient com-
putational procedure is to compute R^2 (for two
predictors, and analogously for more than two
predictors) by the following:

R^2 = b*y1.2ry1 + b*y2.1ry2 ( 19 )

Like r^2 , R^2 varies from 0 to 1.0 and indicates the
proportion of variance in the criterion that is
‘‘explained’’ by the predictors. Alternatively stat-
ed, R^2 is the percent of the possible reduction in
prediction error (measured by the variance of
actual values around predicted values) that is
achieved by shifting from (a) as the prediction to
(b) the multiple regression values, Ŷ, as the prediction.

VARIETIES OF MULTIPLE REGRESSION

The basic concept of multiple regression has been
adapted to a variety of purposes other than those
for which the technique was originally developed.
The following paragraphs provide a brief summa-
ry of some of these adaptations.

Dummy Variable Analysis. As originally con-
ceived, the correlation coefficient was designed to
describe the relationship between continuous, nor-
mally distributed variables. Dichotomized predic-
tors such as gender (male and female) were intro-
duced early in bivariate regression and correlation,
which led to the ‘‘point biserial correlation coeffi-
cient’’ (Walker and Lev 1953). For example, if one
wishes to examine the correlation between gender
and income, one may assign a ‘‘0’’ to each instance
of male and a ‘‘1’’ to each instance of female to
have numbers representing the two categories of
the dichotomy. The unstandardized regression
coefficient, computed as specified above in equa-
tion 3, is then the difference between the mean
income for the two categories of the dichotomous
predictor, and the computational formula for r
(equation 5), will yield the point biserial correla-
tion coefficient, which can be interpreted much
like any other r. It was then only a small step to the

inclusion of dichotomies as predictors in multiple regression analysis, and then to the creation of a set of dichotomies from a categorical variable with more than two subdivisions—that is, to dummy variable analysis (Cohen 1968; Bohrnstedt and Knoke 1988; Hardy 1993).

Religious denomination—e.g., Protestant, Catho- lic, and Jewish—serves as an illustration. From these three categories, one forms two dichotomies, called ‘‘dummy variables.’’ In the first of these, for example, cases are classified as ‘‘1’’ if they are Catholic, and ‘‘0’’ otherwise (i.e., if Protestant or Jewish). In the second of the dichotomies, cases are classified as ‘‘1’’ if they are Jewish, and ‘‘0’’ otherwise (i.e., if Protestant or Catholic). In this illustration, Protestant is the ‘‘omitted’’ or ‘‘reference’’ category (but Protestants can be identified as those who are classified ‘‘0’’ on both of the other dichotomies). The resulting two dichotomized ‘‘dummy variables’’ can serve as the only predictors in a multiple regression equation, or they may be combined with other predictors. When the dummy variables mentioned are the only predictors, the unstandardized regression coefficient for the predictor in which Catholics are classified ‘‘1’’ is the difference between the mean Y for Catholics and Protestants (the ‘‘omitted’’ or ‘‘reference’’ category). Similarly, the unstandardized regression coefficient for the predictor in which Jews are classified ‘‘1’’ is the difference between the mean Y for Jews and Protestants. When the dummy variables are included with other predictors, the unstandardized regression coefficients are the same except that the difference of each mean from the mean of the ‘‘reference’’ category has been statisti- cally adjusted to control for each of the other predictors in the regression equation.

The development of ‘‘dummy variable analysis’’ allowed multiple regression analysis to be linked to the experimental statistics developed by R. A. Fisher, including the analysis of variance and covariance. (See Cohen 1968.)

Logistic Regression. Early students of correlation anticipated the need for a measure of correlation when the predicted or dependent variable was dichotomous. Out of this came (a) the phi coefficient, which can be computed by applying the computational formula for r (equation 5) to two dichotomies, each coded ‘‘0’’ or ‘‘1,’’ and (b) the tetrachoric correlation coefficient, which uses

Encyclopedia of Sociology

Get our desktop app

Company

Features

Documentation

Resources