Encyclopedia of Sociology

(Marcin) #1
CORRELATION AND REGRESSION ANALYSIS

information in the form of two dichotomies to
estimate the Pearsonian correlation for the corre-
sponding continuous variables, assuming the
dichotomies result from dividing two continuous
and normally distributed variables by arbitrary
cutting points (Kelley 1947; Walker and Lev 1953;
Carroll 1961).


These early developments readily suggested
use of a dichotomous predicted variable, coded
‘‘0’’ or ‘‘1,’’ as the predicted variable in a multiple
regression analysis. The predicted value is then the
conditional proportion, which is the conditional
mean for a dichotomized predicted variable. But
this was not completely satisfactory in some cir-
cumstances because the regression predictions are,
under some conditions, proportions greater than
1 or less than 0. Logistic regression (Retherford
1993; Kleinman 1994; Menard 1995) is responsive
to this problem. After coding the predicted vari-
able ‘‘0’’ or ‘‘1,’’ the predicted variable is trans-
formed to a logistic—that is, the logarithm of the
‘‘odds,’’ which is to say the logarithm of the ratio of
the number of 1’s to the number of 0’s. With
the logistic as the predicted variable, impossible
regression predictions do not result, but the
unstandardized logistic regression coefficients, de-
scribing changes in the logarithm of the ‘‘odds,’’
lack the intuitive meaning of ordinary regression
coefficients. An additional computation is required
to be able to describe the change in the predicted
proportion for a given one-unit change in a predic-
tor, with all other predictors in the equation held
constant.


Path Analysis. The interpretation of multiple
regression coefficients can be difficult or impossi-
ble when the predictors include an undifferenti-
ated set of causes, consequences, or spurious cor-
relates of the predicted variable. Path analysis was
developed by Sewell Wright (1934) to facilitate the
interpretation of multiple regression coefficients
by making explicit assumptions about causal struc-
ture and including as predictors of a given variable
only those variables that precede that given vari-
able in the assumed causal structure. For example,
if one assumes that Y is influenced by X 1 and X 2 ,
and X 1 and X 2 are, in turn, both influenced by Z 1 ,
Z 2 , and Z 3 , this specifies the assumed causal struc-
ture. One may then proceed to write multiple
regression equations to predict X 1 , X 2 , and Y,
including in each equation only those variables
that come prior in the assumed causal order. For


example, the Z variables are appropriate predic-
tors in the equation predicting X 1 because they are
assumed causes of X 1. But X 2 is not an appropriate
predictor of X 1 because it is assumed to be a
spurious correlate of X 1 (i.e., X 1 and X 2 are pre-
sumed to be correlated only because they are both
influenced by the Z variables, not because one
influences the other). And Y is not an appropriate
predictor of X 1 because Y is assumed to be an
effect of X 1 , not one of its causes. When the
assumptions about the causal structure linking a
set of variables have been made explicit, the appro-
priate predictors for each variable have been iden-
tified from this assumed causal structure, and the
resulting equations have been estimated by the
techniques of regression analysis, the result is a
path analysis, and each of the resulting coefficients
is said to be a ‘‘path coefficient’’ (if expressed in
standardized form) or a ‘‘path regression coeffi-
cient’’ (if expressed in unstandardized form).

If the assumed causal structure is correct, a
path analysis allows one to ‘‘decompose’’ a correla-
tion between two variables into ‘‘direct effects’’;
‘‘indirect effects’’; and, potentially, a ‘‘spurious
component’’ as well (Land 1969; Bohrnstedt and
Knoke 1988; McClendon 1994).

For example, we may consider the correlation
between the occupational achievement of a set of
fathers and the occupational achievement of their
sons. Some of this correlation may occur because
the father’s occupational achievement influences
the educational attainment of the son, and the
son’s educational attainment, in turn, influences
his occupational achievement. This is an ‘‘indirect
effect’’ of the father’s occupational achievement
on the son’s occupational achievement ‘‘through’’
(or ‘‘mediated by’’) the son’s education. A ‘‘direct
effect,’’ on the other hand, is an effect that is not
mediated by any variable included in the analysis.
Such mediating variables could probably be found,
but if they have not been identified and included
in this particular analysis, then the effects mediat-
ed through them are grouped together as the
‘‘direct effect’’—that is, an effect not mediated by
variables included in the analysis. If the father’s
occupational achievement and the son’s occupa-
tional achievement are also correlated, in part,
because both are influenced by a common cause
(e.g., a common hereditary variable), then that
Free download pdf