Inner fences and adjacent values can cause some confusion. Think of a herd of cows
scattered around a field. (I spent most of my life in Vermont, so cows seem like a natural
example.) The fence around the field represents the inner fence of the boxplot. The cows
closest to but still inside the fence are the adjacent values. Don’t worry about the cows that
have escaped outside the fence and are wandering around on the road. They are not in-
volved in the calculations at this point. (They will be the outliers.)
Now we are ready to draw the boxplot. First, we draw and label a scale that covers the
whole range of the obtained values. This has been done at the bottom of Table 2.8. We then
draw a rectangular box from Q 1 to Q 3 , with a vertical line representing the location of the
median. Next we draw lines (whiskers) from the quartiles out to the adjacent values.
Finally we plot the locations of all points that are more extreme than the adjacent values.
From Table 2.8 we can see several important things. First, the central portion of the dis-
tribution is reasonably symmetric. This is indicated by the fact that the median lies in the
center of the box and was apparent from the stem-and-leaf display. We can also see that the
distribution is positively skewed, because the whisker on the right is substantially longer
than the one on the left. This also was apparent from the stem-and-leaf display, although
not so clearly. Finally, we see that we have four outliers, where an outlier is defined here as
any value more extreme than the whiskers (and therefore more extreme than the adjacent
values). The stem-and-leaf display did not show the position of the outliers nearly so
graphically as does the boxplot.
Outliers deserve special attention. An outlier could represent an error in measurement,
in data recording, or in data entry, or it could represent a legitimate value that just happens
to be extreme. For example, our data represent length of hospitalization, and a full-term in-
fant might have been born with a physical defect that required extended hospitalization.
Because these are actual data, it was possible to go back to hospital records and look more
closely at the four extreme cases. On examination, it turned out that the two most extreme
scores were attributable to errors in data entry and were readily correctable. The other two
extreme scores were caused by physical problems of the infants. Here a decision was re-
quired by the project director as to whether the problems were sufficiently severe to cause
the infants to be dropped from the study (both were retained as subjects). The two corrected
values were 3 and 5 instead of 33 and 20, respectively, and a new boxplot for the corrected
data is shown in Figure 2.14. This boxplot is identical to the one shown in Table 2.8 except
for the spacing and the two largest values. (You should verify for yourself that the corrected
data set would indeed yield this boxplot.)
From what has been said, it should be evident that boxplots are extremely useful
tools for examining data with respect to dispersion. I find them particularly useful for
screening data for errors and for highlighting potential problems before subsequent
analyses are carried out. Boxplots are presented often in the remainder of this book as
visual guides to the data.
A word of warning: Different statistical computer programs may vary in the ways they
define the various elements in boxplots. (See Frigge, Hoaglin, and Iglewicz [1989] for an
extensive discussion of this issue.) You may find two different programs that produce
slightly different boxplots for the same set of data. They may even identify different
50 Chapter 2 Describing and Exploring Data
02468
* *
10
Figure 2.14 Boxplot for corrected data from Table 2.8
whiskers