large residuals (they lie far from the regression line). Such points may represent random
error, they may be data that are incorrectly recorded, or they may reflect unusual cases
that don’t really belong in this data set. (An example of this last point would arise if we
were trying to predict physical reaction time as a function of cognitive processing features
of a task, and our subjects included one individual who suffered from a neuromuscular
disorder that seriously slowed his reaction time.) Residuals are a standard feature of all
regression analyses, and you should routinely request and examine them in running your
analyses.
Leverage (often denoted , or “hat diag”) measures the degree to which a case is unusual
with respect to the predictor variables. In the case of one predictor, leverage is simply a
function of the deviation of the score on that predictor from the predictor mean. Point Bin
Figure 15.5 is an example of a point with high leverage because the Xscore for that point (13)
is far from. Most programs for multiple regression compute and print the leverage of each
observation if requested. Possible values on leverage range from a low of 1Nto a high of
1.0, with a mean of (p 1 1)N, where p 5 the number of predictors. Stevens (1992) recom-
mends looking particularly closely at those leverage values that exceed 3(p 1 1)n.
Points that are high on either distance or leverage do not necessarily have an important
influence on the regression, but they have the potential for it. In order for a point to be high
on influence, it must have relatively high values on both distance and leverage. In Figure
15.5, Point Bis very high on leverage, but it has a relatively small residual (distance). Point
A, on the other hand, has a large residual but, because it is near the mean on X, has low
leverage. Point Cis high on leverage and has a large residual, suggesting that it is high on
influence. The most common measure of influence is known as Cook’s D.It is a function
of the sum of the squared changesin bjthat would occur if the ith observation were
removed from the data and the analysis rerun.
Exhibit 15.2 contains various diagnostic statistics for the data shown in Figure 15.5.
These diagnostics were produced by an SAS, but similar statistics would be produced by
almost any other program.
To take the diagnostic statistics in order, consider first the column headed Resid, which
is a measure of distance. This column reflects what we can already see in Figure 15.5—
that the 8th and 11th observations have the largest residuals. Considering that the Yvalues
range only from 1 to 14, a residual of 2 5.89 seems substantial.
>
>
>
X
Xj
hi
15.10 Regression Diagnostics 541
Exhibit 15.2 Diagnostic statistics for data in Figure 15.5
“A” ->
“C” ->
“B” ->
OBS
1 2 3 4 5 6 7 8 9
10
11
12
X 1 1 3 3 3 4 5 5 6 7
10
13
Y 1 2 3 5 7 6 8
10
5
10
4
14
PRED
3.23
3.23
4.71
4.71
4.71
5.45
6.19
6.19
6.93
7.77
9.89
12.11
RESID
–2.23
–1.22
–1.71
0.29
2.29
0.55
1.81
3.81
–1.93
2.33
–5.89
1.89
RSTUDENT
–0.87
–0.47
–0.62
0.10
0.85
0.19
0.65
1.49
–0.69
0.86
–3.54
0.98
HAT DIAG
H
0.20
0.20
0.11
0.11
0.11
0.09
0.08
0.08
0.09
0.11
0.26
0.54
MSE
8.22
8.71
8.55
8.91
8.26
8.88
8.52
7.16
8.46
8.24
3.73
8.06
COOK’S
D
0.10
0.03
0.03
0.00
0.05
0.00
0.02
0.09
0.02
0.05
1.01
0.55
Cook’s D