A SIGNIFICANTFIN AN ANALYSIS OF VARIANCEis simply an indication that not all the popu-
lation means are equal. It does not tell us which means are different from which other
means. As a result, the overall analysis of variance often raises more questions than it
answers. We now face the problem of examining differences among individual means, or
sets of means, for the purpose of isolating significant differences or testing specific hypothe-
ses. We want to be able to make statements of the form , and , but
the first three means are different from the last two, and all of them are different from.
Many different techniques for making comparisons among means are available, and
the list grows each year. Here we will consider the most common and useful ones. A thor-
ough discussion of this topic can be found in Miller (1981), and in Hochberg and
Tamhane (1987), and Toothaker (1991). Keselman, Holland, and Cribbie (2005) offer a
review of some of the newer methods. The papers by Games (1978a, 1978b) are also help-
ful, as is the paper by Games and Howell (1976) on the treatment of unequal sample sizes.
It may be helpful to the reader to understand how this chapter has changed through vari-
ous editions. The changes largely reflect the way people look at experimental results. Origi-
nally this chapter covered a few of the most common test procedures and left it at that. Then
as time went on I kept adding to the number of procedures and focused at length on ways to
make many individual comparisons among means. But in this edition I am deliberately go-
ing in the other direction. I am emphasizing the fact that we should direct our attention to
those differences we really care about and not fill our results section with all of the other
differences that we can test but don’t actually care about. This philosophy carries over to
calculating effect sizes and selecting appropriate error terms. Taking a standard multiple
comparison test such as Tukey’s (which is an excellent test for the purpose for which it was
designed) and then testing every conceivable pairwise null hypothesis is a very poor idea. It
wastes power, it often leads to the use of inappropriate error terms, it gives poor measures of
effect size, and generally confuses what is often a clear and simple set of results. The fact
that you are able to do something is rarely a sufficient reason for actually doing it.
12.1 Error Rates
The major issue in any discussion of multiple-comparison procedures is the question of
the probability of Type I errors. Most differences among alternative techniques result from
different approaches to the question of how to control these errors. The problem is in part
technical, but it is really much more a subjective question of how you want to define the
error rate and how large you are willing to let the maximum possible error rate be.
Here we will distinguish two basic ways of specifying error rates, or the probability of
Type I errors.^1 (Later we will discuss an alternative view of error rates called the False Dis-
covery Rate, which has received a lot of attention in the last few years.) In doing so, we shall
use the terminology that has become more or less standard since an extremely important un-
published paper by Tukey in 1953. (See also Ryan, 1959; O’Neil & Wetherill, 1971.)
Error Rate per Comparison (PC)
We have used the error rate per comparison (PC)in the past and it requires little elabo-
ration. It is the probability of making a Type I error on any given comparison. If, for
m 6
m 1 =m 2 =m 3 m 4 =m 5
364 Chapter 12 Multiple Comparisons Among Treatment Means
(^1) There is another error rate called the error rate per experiment (PE), which is the expected numberof Type I
errors in a set of comparisons. The error rate per experiment is not a probability, and we typically do not attempt
to control it directly. We can easily calculate it, however, as PE 5 c, where cis the number of comparisons and
ais the per comparison error rate.
a
error rate per
comparison (PC)