The scatterplot (see page 94 ) leads us to believe that the form of this relationship is linear. This, and
given r = 0.864 for these data, leads us to say that we have a strong, positive, linear association between
the variables. Suppose we wanted to predict the score of a person who studied for 2.75 hours. If we knew
we were working with a linear model—a line that seemed to fit the data well—we would feel confident
about using the equation of the line to make such a prediction. We are looking for a line of best fit . We
want to find a regression line —a line that can be used for predicting response values from explanatory
values. In this situation, we would use the regression line to predict the exam score for a person who
studied 2.75 hours.
The line we are looking for is called the least-squares regression line . We could draw a variety of
lines on our scatterplot trying to determine which has the best fit. Let ŷ be the predicted value of y for a
given value of x . Then y – ŷ represents the error in prediction. We want our line to minimize errors in
prediction, so we might first think that S(y – ŷ ) would be a good measure (y – ŷ is the actual value minus
the predicted value ). However, because our line is going to average out the errors in some fashion, we
find that S(y – ŷ ) = 0. To get around this problem, we use S(y – ŷ ) 2 . This expression will vary with
different lines and is sensitive to the fit of the line. That is, S(y – ŷ ) 2 is small when the linear fit is good
and large when it is not.
The least-squares regression line (LSRL) is the line that minimizes the sum of squared errors. If ŷ =
a + bx is the LSRL, then ŷ minimizes S(y – ŷ ) 2 .
Digression for calculus students only: It should be clear that trying to find a and b for the line ŷ = a + bx
that minimizes Σ(y – ŷ )^2 is a typical calculus problem. The difference is that, since ŷ is a function of
two variables, it requires multivariable calculus to derive it. That is, you need to be beyond first-year
calculus to derive the results that follow.