to predict the weight of the student whose height is on the slip of paper you have drawn. What
is your best guess as to the weight of the student?
solution: In the absence of any known relationship between height and weight, your best guess
would have to be the average weight of all the students. You know the weights vary about the
average and that is about the best you could do.
If we guessed at the weight of each student using the average, we would be wrong most of the time. If
we took each of those errors and squared them, we would have what is called the sum of squares total
(SST). It’s the total squared error of our guesses when our best guess is simply the mean of the weights of
all students, and represents the total variability of y .
Now suppose we have a least-squares regression line that we want to use as a model for predicting
weight from height. It is, of course, the LSRL we discussed in detail earlier in this chapter, and our hope
is that there will be less error in prediction than by using . Now, we still have errors from the
regression line (called residuals , remember?). We call the sum of those errors the sum of squared
errors (SSE). So, SST represents the total error from using as the basis for predicting weight from
height, and SSE represents the total error from using the LSRL. SST – SSE represents the benefit of using
the regression line rather than for prediction. That is, by using the LSRL rather than , we have
explained a certain proportion of the total variability by regression.
The proportion of the total variability in y that is explained by the regression of y on x is called the
coefficient of determination . The coefficient of determination is symbolized by r 2 . Based on the
above discussion, we note that
It can be shown algebraically, although it isn’t easy to do so, that this r 2 is actually the square of the
familiar r , the correlation coefficient. Many computer programs will report the value of r 2 only (usually
as “R-sq”), which means that we must take the square root of r 2 if we only want to know r (remember
that r and b , the slope of the regression line, are either both positive or negative so that you can check the
sign of b to determine the sign of r if all you are given is r 2 ). The TI-83/84 calculator will report both r
and r 2 , as well as the regression coefficient, when you do LinReg(a+bx) .
example: Consider the following output for a linear regression:
We can see that the LSRL for these data is ŷ = –1.95 + 0.8863x. r 2 = 53.2% = 0.532. This means that
53.2% of the total variability in y can be explained by the regression of y on x . Further,
(r is positive since b = 0.8863 is positive). We learn more about the other items in the
printout later.
You might note that there are two standard errors (estimates of population standard deviations) in the