Robert_V._Hogg,_Joseph_W._McKean,_Allen_T._Craig

(Jacob Rumans) #1
4.4. Order Statistics 259

56 70 89 94 96 101 102 102
102 105 106 108 110 113 116
For these data, sincen+ 1 = 16, the realizations of the five-number summary are
y 1 = 56,Q 1 =y 4 = 94,Q 2 =y 8 = 102,Q 3 =y 12 = 108, andy 15 = 116. Hence,
based on the five-number summary, the data range from 56 to 116; the middle 50%
of the data range from 94 to 108; and the middle of the data occurred at 102. The
data are in the fileeg4.4.4data.rda.


The five-number summary is the basis for a useful and quick plot of the data.
This is called aboxplotof the data. The box encloses the middle 50% of the
data and a line segment is usually used to indicate the median. The extreme order
statistics, however, are very sensitive to outlying points. So care must be used in
placing these on the plot. We make use of thebox and whiskerplots defined by
John Tukey. In order to define this plot, we need to define a potential outlier. Let
h=1.5(Q 3 −Q 1 ) and define thelower fence(LF)andtheupper fence(UF)by


LF=Q 1 −handUF=Q 3 +h. (4.4.6)
Points that lie outside the fences, i.e., outside the interval (LF, U F), are called
potential outliersand they are denoted by the symbol “0” on the boxplot. The
whiskers then protrude from the sides of the box to what are called theadjacent
points, which are the points within the fences but closest to the fences. Exercise
4.4.2 shows that the probability of an observation from a normal distribution being
a potential outlier is 0.006977.
Example 4.4.5(Example 4.4.4, Continued).Consider the data given in Example
4.4.4. For these data,h=1.5(108−94) = 21,LF= 73, andUF= 129. Hence the
observations 56 and 70 are potential outliers. There are no outliers on the high side
of the data. The lower adjacent point is 89. The boxplot of the data set is given in
Panel A of Figure 4.4.1, which was computed by the R segmentboxplot(x)where
the R vectorxcontains the data.
Note that the point 56 is over 2hfromQ 1. Some statisticians call such a point
an “outlier” and label it with a symbol other than “O,” but we do not make this
distinction.


In practice, we often assume that the data follow a certain distribution. For
example, we may assume thatX 1 ,...,Xnare a random sample from a normal
distribution with unknown mean and variance. Thus the form of the distribution
ofXis known, but the specific parameters are not. Such an assumption needs to
be checked and there are many statistical tests which do so; see D’Agostino and
Stephens (1986) for a thorough discussion of such tests. As our second statistical
application of quantiles, we discuss one such diagnostic plot in this regard.
We consider the location and scale family. SupposeXis a random variable
with cdfF((x−a)/b), whereF(x) is known butaandb>0maynotbe. Let
Z=(X−a)/b;thenZhas cdfF(z). Let 0<p<1andletξX,pbe thepth quantile
ofX.LetξZ,pbe thepth quantile ofZ=(X−a)/b. BecauseF(z)isknown,ξZ,p
is known. But
p=P[X≤ξX,p]=P


[
Z≤

ξX,p−a
b

]
,
Free download pdf