Robert_V._Hogg,_Joseph_W._McKean,_Allen_T._Craig

(Jacob Rumans) #1
9.6. A Regression Problem 545

As a final note, in Model 9.6.1 we have centered thex’s; i.e., subtractedxfrom
xi. In practice, usually we do not precenter thex’s. Instead, we fit the model
yi=α∗+βxi+ei. In this case, the least squares, and hence, mles minimize the
sum of squares
∑n


i=1

(yi−α∗−βxi)^2. (9.6.11)

In Exercise 9.6.1, the reader is asked to show that the estimate ofβremains the
same as in expression (9.6.5), while ˆα∗=y−βˆx. We use this noncentered model
in the following example.


Example 9.6.1(Men’s 1500 meters). As a numerical illustration, consider data
drawn from the Olympics. The response of interest is the winning time of the men’s
1500 meters, while the predictor is the year of the olympics. The data were taken
from Wikipedia and can be found inolym1500mara.rda. Assume the R vectors
for the winning times and year aretimeandyear, respectively. There aren=27
data points. The top panel of Figure 9.6.2 shows a scatterplot of the data that is
computed by the R command
par(mfrow=c(2,1));plot(time~year,xlab="Year",ylab="Winning time")
The winning times are steadily decreasing over time and, based on this plot, a sim-
ple linear model seems reasonable. Obviously the time for 2016 is an outlier but it
is the correct time. Before proceeding to inference, though, we check the quality
of the fit of the model. The following R commands obtain the least squares fit,
overlaying it on the scatterplot in Figure 9.6.2, the fitted values, and the residuals.
These are used to obtain the residual plot that is displayed in the bottom panel of
9.6.2.
fit <- lm(time~year); abline(fit)
ehat <- fit$resid; yhat <- fit$fitted.values
plot(ehat~yhat,xlab="Fitted values",ylab="Residuals")
Recall a “good” fit is indicated by a random scatter in the residual plot. This does
not appear to be the case. There is a dependence^4 between adjacent points over
time. This dependence is apparent from the scatterplot too. In a time series course,
this dependence would be investigated.
Based on the dependence, the following inference is approximate. The command
summary(fit)produces the table of coefficients:
Estimate Std. Error t value Pr(>t|)|
(Intercept) 12.325411 1.039402 11.858 9.26e-12
year -0.004376 0.000530 -8.257 1.31e-08
Hence, the prediction equation is ˆy=12. 33 −.0044year. Based on the slope estimate,
we predict the winning time to drop by 0.004 minutes every year. For a 95%
confidence interval for the slope, thet-critical value via R isqt(.975,25)which
computes to 2.060. Using the standard error in the summary table, the following R
commands compute confidence interval for the slope parameter:
err=0.000530*2.060;lb=-0.004376-err;ub=-0.004376+err;ci=c(lb,ub)


(^4) This dependence is not surprising. The runners race against each other but they also try to
beat the Olympic record.

Free download pdf