The first step in examining the data has already been carried out in Figure 15.1 with
graphical presentations of important variables. At that point we noted that most of the vari-
ables were fairly messy with the percentage of students taking the SAT being decidedly bi-
modal. SAT scores were also somewhat bimodal, and much of that can probably related to
the bimodal nature of PctSAT. For reasons that will become clear shortly we used the log
of PctSAT rather than PctSAT itself. This at least had the effect of reducing the curvilinear
relationship between the SAT scores and the percentage of students in each state taking the
SAT. None of our variables had extreme outliers, especially after we used a log transforma-
tion of PctSAT.
The fact that we don’t have more outliers when we look at the variables individually does
not necessarily mean that all is well. There is still the possibility of having multivariate
outliers.A case might seem to have reasonable scores on each of the variables taken sepa-
rately but have an unusual combinationof scores on two or more variables. For example, it is
not uncommon to be 6 feet tall, nor is it uncommon to weigh 125 pounds. But it clearly would
be unusual to be 6 feet tall andweigh 125 pounds.
Having temporarily satisfied ourselves that the data set does not contain unreasonable
data points and that the distributions are not seriously distorted, a useful second step is to
conduct a preliminary regression analysis using all the variables, as we have done. I say
“preliminary” because the point here is to use that analysis to examine the data rather than
as an end in itself.
Instead of jumping directly into the educational expenditure data set, we will first in-
vestigate diagnostic tools with a smaller data set created to illustrate the use of those tools.
These data are shown below and are plotted in Figure 15.5.
X: 1133345 5 761013
Y: 123576810105 414
The three primary classes of diagnostic statistics, each of which is represented in
Figure 15.5, are
1.Distance,which is useful in identifying potential outliers in the dependent variable (Y).
2.Leverage (hi),which is useful in identifying potential outliers in the independent vari-
ables ( ).
3.Influence,which combines distance and leverage to identify unusually influential ob-
servations. An observation is influential if the location of the regression surface would
change markedly depending on the presence or absence of that observation.
Our most common measure of distance is the residual ( ). It measures the verti-
cal distance between any point and the regression line. Points Aand Cin Figure 15.5 have
Yi 2 YNi
X 1 , X 2 ,... , Xp
540 Chapter 15 Multiple Regression
Figure 15.5 Scatterplot of Yon X
15
12
9
Y
X
C
A
B
6
3
0
0 3 6 9 12 15
multivariate
outliers
Distance
Leverage (hi)
Influence