otherwise stated. (We do occasionally fit straight lines to curvilinear data, but we do so on
the assumption that the line will be sufficiently accurate for our purpose—although the
standard error of prediction might be poorly estimated. There are other forms of regression
besides linear regression, but we will not discuss them here.)
As mentioned earlier, whether or not we make various assumptions depends on what
we wish to do. If our purpose is simply to describe data, no assumptions are necessary. The
regression line and rbest describe the data at hand, without the necessity of any assump-
tions about the population from which the data were sampled.
If our purpose is to assess the degree to which variance in Yis linearly attributable to
variance in X, we again need make no assumptions. This is true because and are
both unbiased estimators of their corresponding parameters, independent of any underly-
ing assumptions, and
is algebraically equivalent to.
If we want to set confidence limits on bor Y, or if we want to test hypotheses about ,
we will need to make the conditional assumptions of homogeneity of variance and normality
in arrays of Y. The assumption of homogeneity of variance is necessary to ensure that is
representative of the variance of each array, and the assumption of normality is necessary
because we use the standard normal distribution.
If we want to use rto test the hypothesis that r50, or if we wish to establish confi-
dence limits on r, we will have to assume that the (X, Y) pairs are a random sample from a
bivariate-normal distribution, but keep in mind that for many studies the significance of r
is not particularly an issue, nor do we often want to set confidence limits on r.
9.14 Factors That Affect the Correlation
The correlation coefficient can be substantially affected by characteristics of the sample.
Two such characteristics are the restriction of the range (or variance) of Xand/or Yand the
use of heterogeneous subsamples.
The Effect of Range Restrictions
A common problem concerns restrictions on the range over which Xand Yvary. The effect
of such range restrictionsis to alter the correlation between Xand Yfrom what it would
have been if the range had not been so restricted. Depending on the nature of the data, the
correlation may either rise or fall as a result of such restriction, although most commonly r
is reduced.
With the exception of very unusual circumstances, restricting the range of Xwill in-
crease ronly when the restriction results in eliminating some curvilinear relationship. For
example, if we correlated reading ability with age, where age ran from 0 to 70 years, the
data would be decidedly curvilinear (flat to about age 4, rising to about 17 years of age and
then leveling off) and the correlation, which measures linearrelationships, would be rela-
tively low. If, however, we restricted the range of ages to 5 to 17 years, the correlation
would be quite high, since we would have eliminated those values of Ythat were not vary-
ing linearly as a function of X.
The more usual effect of restricting the range of Xor Yis to reduce the correlation. This prob-
lem is especially pertinent in the area of test construction, since here criterion measures (Y) may
be available for only the higher values of X. Consider the hypothetical data in Figure 9.8. This
s^2 Y#X
b*
r^2
SSY 2 SSresidual
SSY
s^2 Y s^2 Y#X
Section 9.14 Factors That Affect the Correlation 281
range
restrictions