Stepwise Regression
The stepwise regressionmethod is more or less the reverse of the backward elimination
method.^11 However, because at each stage we do not have all of the other variables in the
model and therefore immediately available to test, as we did with backward elimination,
we will go about it in a slightly different way.
Stepwise regression relies on the fact that
If we define variable 1 as that variable with the highest validity (correlation with the crite-
rion), then the first step in the process involves only variable 1. We then calculate all semi-
partials of the form. The variable (assume that it is ) with the highest
(first-order) semipartial correlation with the criterion is the one that will produce the great-
est increment in. This variable is then entered and we obtain the regression of Yon
and. We now test to see whether that variable contributes significantly to the model con-
taining two variables. We could either test the regression coefficient or the semipartial cor-
relation directly, or test to see if there was a significant increment in. The result would
be the same. Because the test on the increment in will prove useful later, we will do it
that way here. A test on the difference between an based on fpredictors and an based
on rpredictors (where the rpredictors are a subset of the fpredictors) is given by
where is the for the full , is the for the reduced model 5 , f
is the number of predictors in the full model, and ris the number of predictors in the
reduced model.
This process is repeated until the addition of further variables produces no significant
(by whatever criterion we wish to use) improvement. At each step in the process, before
we add a new variable we first ask whether a variable that was added on an earlier step
should now be removed on the grounds that it is no longer making a significant contribu-
tion. If the test on a variable falls below “Fto remove” (or above “pto remove”), that vari-
able is removed before another variable is added. Procedures that do not include this step
are often referred to as forward selectionprocedures.
Of the three variable selection methods discussed here, the stepwise regression method is
probably the best. Both Draper and Smith (1981) and Darlington (1990) recommend it as the
best compromise between finding an “optimal” equation for predicting future randomly se-
lected data sets from the same population and finding an equation that predicts the maximum
variance for the specific data set under consideration. I would go even further. Instead of say-
ing that it is the best compromise, I would say that it is the best of a set of poor choices. I rec-
ommend against any mechanistic way of arriving at a final solution. You need to make use of
what you know about your variables and what you see in separate regressions.
Cross-Validation
The stumbling block for most multiple regression studies is the concept of cross-
validationof the regression equation against an independent data set. For example we
might break our data into two or more data sets and derive a regression equation for the
R^2 f R^2 model=R0.12^2 Rr^2 R^2 R^2 0.1
F(f 2 r, N 2 f 2 1)=
(N 2 f 2 1)(R^2 f 2 R^2 r)
(f 2 r)(1 2 R^2 f)
R^2 R^2
R^2
R^2
X 2
R^2 X 1
r0(i.1), i=2... p X 2
R^2 0.123... p=r^2011 r^2 0(2.1) 1 r^2 p(3.12) 1 Á
15.11 Constructing a Regression Equation 549
(^11) The terminology here is terrible, but you’ll just have to bear with me. Backward elimination is astepwise
procedure, as is forward elimination, but when we refer to the stepwise approach we normally mean the
procedure that I’m about to discuss.
stepwise
regression
forward
selection
cross-validation