The Essentials of Biostatistics for Physicians, Nurses, and Clinicians

(Ann) #1
112 CHAPTER 7 Correlation, Regression, and Logistic Regression

be found in Draper and Smith ( 1998 , p. 136). The maximum likelihood
result can also be found in Draper and Smith ( 1998 , p. 137).
In practice, once we consider multiple regression, there is an
issue of how many candidate variables should be included in the
regression. Also, some of the variables that we think affect the depen-
dent variable may be related to each other, and so some different
selections of subsets of the variables may produce essentially the same
predictions. However, in such cases, we have a phenomenon called
multicollinearity.
When this happens, it is not a good idea to include all the variables.
This is because there may be different sets of values that could be used
for the parameters to almost identically fi t the data. When this is the
case, the estimates are unstable, meaning that slight changes in the data
could produce large changes in the regression parameters. Consequently,
multicollinearity must be avoided.
There are diagnostics for determining when multicollinearity or
near multicollinearity occurs. Belsley et al. ( 1980 ) cover this in detail.
Another way to avoid multicollinearity is to use one of the many pos-
sible procedures for selecting a subset of the independent variables.
Among the possibilities are best subset selection (requiring an evalua-
tion of all possible subsets, which can be a lot of possibilities), forward
selection (adding variables in one at a time based on an F to enter
criterion), backward selection (start with all variables in the model and
remove one at a time based on an F to exit criterion), and stepwise
selection (at each stage, when a proper subset of the variables is in the
regression model F to enter and F to exit, criteria are looked at to decide
if the next step should be to add or drop a variable, and which variable
to remove [add]).
Other texts on regression cover these methods in detail, but are not
important to cover in this text. These methods are all available in most
statistical packages that include multiple regression.
We will illustrate multiple regression by again using the Florida
2000 Presidential Election results. We will attempt to predict Buchanan ’ s
votes in Palm Beach on the basis of the data from all the other counties,
but not simply use Bush ’ s or Gore ’ s or Nader ’ s votes in a simple linear
regression. Rather, we will look at a multiple regression model using
Bush, Gore, and Nader, and the possible subsets of these. We hope to
get a better prediction by using more than one predictor, but we also
realize that these vote totals are positively correlated because of the

Free download pdf