The Essentials of Biostatistics for Physicians, Nurses, and Clinicians

(Ann) #1
9.5 Insensitivity of Rank Tests to Outliers 153

9.5 INSENSITIVITY OF RANK TESTS TO OUTLIERS


Of course, with univariate data outliers are the extremely large or
extremely small observations. For bivariate data, it is less obvious what
should constitute an outlier, as there are many directions to consider.
Observations that are extreme in both dimensions will usually be outli-
ers, but not always. For example, if data are bivariate normal, the
contours of constant probability are ellipses whose major axis is along
the linear regression line. When the data are highly correlated, these
ellipses are elongated.
If a bivariate observation falls on or near the regression line, it is
a likely observation, and if the correlation is positive, and if X and Y
are both large or both small, we may not want to consider such obser-
vations to be outliers. The real outliers are the points that are far from
the center of the semi - minor axis. Another measure, called the infl uence
function, determines a different direction, namely the direction that
most highly affects the estimate of a parameter. For the Pearson correla-
tion, the contours of constant infl uence are hyperbolae. So outliers with
respect to correlation are values that are far out on the hyperbolic
contours.
We noticed previously that outliers affect the mean and variance
estimates, and they can also affect the bivariate correlation. So,
confi dence intervals and hypothesis tests can be invalidated by
outliers. However, nonparametric procedures are designed to apply to
a wide variety of distributions, and so should not be sensitive to outli-
ers. Rank tests clearly are insensitive to outliers because a very large
value is only one rank higher than the next largest, and this does not at
all depend on the magnitude of the observations or how far separated
they are.
As an illustration, consider the following data set of 10 values,
whose ordered values are 16, 16.5, 16.5, 16.5, 17, 19.5, 21, 23, 24,
and 30. The largest value, 30, clearly appears to be an outlier.
The sample mean is 20, and half the range is 7, whereas the
number 30 is 10 units removed from the mean. The largest and
second largest observations are separated by six units, but in term
of ranks, 24 has rank 9 and 30 has rank 10, a difference in rank that
is the same as between 23 and 24, which have ranks 8 and 9,
respectively.

Free download pdf