In a scatterplot, the predictorvariable is traditionally represented on the abscissa, or
X-axis, and the criterionvariable on the ordinate, or Y-axis. If the eventual purpose of the
study is to predict one variable from knowledge of the other, the distinction is obvious; the
criterion variable is the one to be predicted, whereas the predictor variable is the one from
which the prediction is made. If the problem is simply one of obtaining a correlation coef-
ficient, the distinction may be obvious (incidence of cancer would be dependent on amount
smoked rather than the reverse, and thus incidence would appear on the ordinate), or it may
not (neither running speed nor number of trials to criterion is obviously in a dependent
position relative to the other). Where the distinction is not obvious, it is irrelevant which
variable is labeled Xand which Y.
Consider the three scatter diagrams in Figure 9.1. Figure 9.1a is plotted from data re-
ported by St. Leger, Cochrane, and Moore (1978) on the relationship between infant mor-
tality, adjusted for gross national product, and the number of physicians per 10,000
population.^1 Notice the fascinating result that infant mortality increaseswith the number
of physicians. That is certainly an unexpected result, but it is almost certainly not due to
chance. (As you look at these data and read the rest of the chapter you might think about
possible explanations for this surprising result.)
The lines superimposed on Figures 9.1a–9.1c represent those straight lines that “best
fit the data.” How we determine that line will be the subject of much of this chapter. I have
included the lines in each of these figures because they help to clarify the relationships.
These lines are what we will call the regression linesof Ypredicted on X(abbreviated “Y
on X”), and they represent our best prediction of for a given value of , for the ith sub-
ject or observation. Given any specified value of X, the corresponding height of the regres-
sion line represents our best prediction of Y(designated , and read “Yhat”). In other
words, we can draw a vertical line from to the regression line and then move horizon-
tally to the y-axis and read i.
The degree to which the points cluster around the regression line (in other words, the
degree to which the actual values of Yagree with the predicted values) is related to the
correlation (r)between Xand Y. Correlation coefficients range between 1 and 2 1. For
Figure 9.1a, the points cluster very closely about the line, indicating that there is a strong
linear relationship between the two variables. If the points fell exactly on the line, the cor-
relation would be 1 1.00. As it is, the correlation is actually .81, which represents a high
degree of relationship for real variables in the behavioral sciences.
In Figure 9.1b I have plotted data on the relationship between life expectancy (for
males) and per capita expenditure on health care for 23 developed (mostly European) coun-
tries. These data are found in Cochrane, St. Leger, and Moore (1978). At a time when there
is considerable discussion nationally about the cost of health care, these data give us pause.
If we were to measure the health of a nation by life expectancy (admittedly not the only,
and certainly not the best, measure), it would appear that the total amount of money we
spend on health care bears no relationship to the resultant quality of health (assuming that
different countries apportion their expenditures in similar ways). (Several hundred thou-
sand dollars spent on transplanting an organ from a baboon into a 57-year-old male, as was
done a few years ago, may increase hislife expectancy by a few years, but it is not going to
make a dent in the nation’slife expectancy. A similar amount of money spent on preven-
tion efforts with young children, however, may eventually have a very substantial effect—
hence the inclusion of this example in a text primarily aimed at psychologists.) The two
YN
Xi
YN
Yi Xi
248 Chapter 9 Correlation and Regression
(^1) Some people have asked how mortality can be negative. The answer is that this is the mortality rate adjusted for
gross national product. After adjustment the rate can be negative.
predictor
criterion
regression lines
correlation (r)