In this equation, SSY, which you know to be equal to , is the sum of squares of Y
and represents the totals of
- The part of the sum of squares of Ythat is related to
- The part of the sum of squares of Ythat is independent of X[i.e., ]
In the context of our example, we are talking about that part of the number of symptoms
people exhibited that is related to how many stressful life events they had experienced, and
that part that is related to other things. The quantity is the sum of squares of Ythat is
independent of Xand is a measure of the amount of error remaining even after we use Xto
predict Y. These concepts can be made clearer with a second example.
Suppose we were interested in studying the relationship between amount of cigarette
smoking (X) and age at death (Y). As we watch people die over time, we notice several
things. First, we see that not all die at precisely the same age. There is variability in age at
death regardless of smoking behavior, and this variability is measured by
. We also notice that some people smoke more than others. This variabil-
ity in smoking regardless of age at death is measured by. We further
find that cigarette smokers tend to die earlier than nonsmokers, and heavy smokers earlier
than light smokers. Thus, we write a regression equation to predict Yfrom X. Since people
differ in their smoking behavior, they will also differ in their predictedlife expectancy ( ),
and we will label this variability This last measure is variability in Y
that is directly attributable to variability in X, since different values of arise from differ-
ent values of Xand the same values of arise from the same value of X—that is, does
not vary unless Xvaries.
We have one last source of variability: the variability in the life expectancy of those
people who smoke exactly the same amount. This is measured by and is the vari-
ability in Ythat cannot be explained by the variability in X(since these people do not differ
in the amount they smoke). These several sources of variability (sums of squares) are sum-
marized in Table 9.5.
If we considered the absurd extreme in which all of the nonsmokers die at exactly age
72 and all of the smokers smoke precisely the same amount and die at exactly age 68, then
all of the variability in life expectancy is directly predictable from variability in smoking
behavior. If you smoke you will die at 68, and if you don’t you will die at 72. Here
, and
As a more realistic example, assume smokers tend to die earlier than nonsmokers, but
within each group there is a certain amount of variability in life expectancy. This is a situa-
tion in which some of is attributable to smoking ( ) and some is not ( ).
What we want to be able to do is to specify what percentageof the overall variability in
SSY SSYN SSresidual
SSYN=SSY SSresidual=0.
SSresidual
YN YN
YN
SSYN=g(YN^2 Y)^2.
YN
SSX=g(X 2 X)^2
SSY=g(Y 2 Y)^2
SSresidual
SSresidual
X 3 i.e., SSY(r^2 ) 4
g(Y 2 Y)^2
262 Chapter 9 Correlation and Regression
Table 9.5 Sources of variance in regression for the study of smoking and life
expectancy
SSX 5 variability in amount smoked 5
SSY 5 variability in life expectancy 5
5 variability in life expectancy directly attributable to variability in
smoking behavior 5
SSresidual 5 variability in life expectancy that cannot be attributed to variability in
smoking behavior 5 g(Y 2 YN)^2 =SSY 2 SSYN
g(YN 2 Y)^2
SSYN
g(Y 2 Y)^2
g(X 2 X)^2