Applied Statistics and Probability for Engineers

(Chris Devlin) #1
12-6 ASPECTS OF MULTIPLE REGRESSION MODELING 453

regressors from a set that quite likely includes all the important variables, but we are sure that
not all these candidate regressors are necessary to adequately model the response Y.
In such a situation, we are interested in variable selection;that is, screening the candidate
variables to obtain a regression model that contains the “best” subset of regressor variables. We
would like the final model to contain enough regressor variables so that in the intended use of the
model (prediction, for example) it will perform satisfactorily. On the other hand, to keep model
maintenance costs to a minimum and to make the model easy to use, we would like the model to
use as few regressor variables as possible. The compromise between these conflicting objectives
is often called finding the “best” regression equation. However, in most problems, no single
regression model is “best” in terms of the various evaluation criteria that have been proposed. A
great deal of judgment and experience with the system being modeled is usually necessary to
select an appropriate set of regressor variables for a regression equation.
No single algorithm will always produce a good solution to the variable selection problem.
Most of the currently available procedures are search techniques, and to perform satisfactorily,
they require interaction with judgment by the analyst. We now briefly discuss some of the more
popular variable selection techniques. We assume that there are Kcandidate regressors, x 1 , x 2 ,
p,xK, and a single response variable y. All models will include an intercept term  0 , so the
model with allvariables included would have K1 terms. Furthermore, the functional form of
each candidate variable (for example, x 1  1 x, x 2 ln x, etc.) is correct.

All Possible Regressions
This approach requires that the analyst fit all the regression equations involving one candidate
variable, all regression equations involving two candidate variables, and so on. Then these
equations are evaluated according to some suitable criteria to select the “best” regression
model. If there are Kcandidate regressors, there are 2Ktotal equations to be examined. For
example, if K4, there are 2^4 16 possible regression equations; while if K10, there are
210 1024 possible regression equations. Hence, the number of equations to be examined
increases rapidly as the number of candidate variables increases. However, there are some
very efficient computing algorithms for all possible regressions available and they are widely
implemented in statistical software, so it is a very practical procedure unless the number of
candidate regressors is fairly large.
Several criteria may be used for evaluating and comparing the different regression mod-
els obtained. A commonly used criterion is based on the value of R^2 or the value of the
adjustedR^2 , R^2 adj. Basically, the analyst continues to increase the number of variables in the
model until the increase in R^2 or the adjusted Radj^2 is small. Often, we will find that the Radj^2 will
stabilize and actually begin to decrease as the number of variables in the model increases.
Usually, the model that maximizes R^2 adjis considered to be a good candidate for the best re-
gression equation. Because we can write Radj^2  1 {MSE[SSE(n1)]} and SSE(n1)
is a constant, the model that maximizes the R^2 adjvalue also minimizes the mean square error,
so this is a very attractive criterion.
Another criterion used to evaluate regression models is the Cpstatistic, which is a meas-
ure of the total mean square error for the regression model. We define the total standardized
mean square error for the regression model as



1
^2

31 bias 22 variance 4



1
^2

ea

n

i 1

3 E 1 Yi 2 E 1 Yˆi 242  a

n

i 1

V 1 Yˆi2f

p

1
^2 a

n

i 1

E 3 YˆiE 1 Yi 242

c12 B.qxd 5/20/02 10:03 M Page 453 RK UL 6 RK UL 6:Desktop Folder:TEMP WORK:MONTGOMERY:REVISES UPLO D CH114 FIN L:Quark Files:

Free download pdf