In addition, the percentage of students taking the SAT varies drastically from state to state,
with 81% of the students in Connecticut and only 4% of the students in Utah. The states with
the lowest percentages tend to be in the Midwest, with the highest in the Northeast. In states
where a small percentage of the students are taking the exam, those are most likely to be the
best students who have their eyes on being admitted to the best schools. These are students
who are likely to do well. In Massachusetts and Connecticut, where most of the students take
the SAT—the less able as well as the more able—the poorer students are going to lower the
state average relative to states whose best students are mainly the ones being tested. If this
were true, we would expect to see a negative relationship between the percentage of students
taking the exam and the state’s mean score. This is what we see when we look at the correla-
tion between SAT and LogPctSAT and at the scatterplot in the lower right of Figure 15.1.
Looking at One Predictor While Controlling for Another
The question that now arises is what would happen if we used both variables (Expend and
LogPctSAT) simultaneously as predictors of the SAT score. What this really means, though
it may not be immediately obvious, is that we will look at the relationship between Expend
and SAT controlling for LogPctSAT. (We will also look at the relationship between
LogPctSAT and SAT controlling for Expend.) When I say that we are controlling for
LogPctSAT I mean that we are looking at the relationship while holding LogPctSAT con-
stant. Imagine that we had many thousands of states instead only 50. Imagine also that we
could pull out a collection of states that had exactly the same percentage of students taking
the SAT—e.g., 60%. Then we could look at only the students from those states and compute
the correlation and regression coefficient for predicting SAT from Expend. Then we could
draw another sample of states, perhaps those with 40% of their students taking the exam.
Again we could correlate Expect and SAT for only those states and compute a regression
coefficient. Notice that I have calculated 2 correlations and 2 regression coefficients here,
each with PctSAT held constant at a specific value (40% or 60%). Because we are only
imagining that we had thousands of states, we can go further and imagine that we repeated
this process many times, with PctSAT held at a specific value each time. For each of those
analyses we would obtain a regression coefficient for the relationship between Expend and
SAT, and an average of those many regression coefficients will be very close to the overall
regression coefficient that we will shortly examine. The same is true if we averaged the cor-
relations. (Without introducing a more complex model we are assuming that whatever the
relationship between SAT and Expend, it is the same for each level of PctSAT.)
Because in our imaginary exercise each correlation is based on a sample with a fixed
value of LogPctSAT, each correlation is independent of LogPctSAT. In other words, if
every state included in one of our correlations had 35% of its students taking the SAT, then
LogPctSAT doesn’t vary and it can’t have an effect on the relationship between Expend
and SAT. That means that our correlation, and regression coefficient between those two
variables have controlled for LogPctSAT.
Obviously we don’t have thousands of states—we only have 50 and that is not likely to
get much larger. However that does not stop us from mathematically estimating what we
would obtain if we could carry out the imaginary exercise that I just explained. And that is
exactly what multiple regression is all about.
The Multiple Regression Equation
There are ways to think about multiple regression other than fixing the level of one or
more variables, but before I discuss those I will go ahead and run a multiple regression on
these data. I used SPSS to do so, and the results are shown in Exhibit 15.1. I specifically
15.1 Multiple Linear Regression 521