Because we have a two-tailed test ( was taken from two-tailed tables), the sign of the
difference is irrelevant. The last two differences exceed 7.24 and are therefore declared
to be significant.
In the case in which the groups have unequal sample sizes or heterogeneous variances,
a test on the difference in treatment means is given by the same general procedure we used
with the Tukey. Of the test that we have discussed, when you want to compare one group
against each of the other groups I would recommend Dunnett’s test.
Benjamini-Hochberg Test
Each of the post hoc tests that we have been discussing has focused on controlling the fam-
ilywise error rate (FEW), and several of them have been sequential tests, which change the
critical value as you move through a series on comparisons. Benjamini and Hochberg
(1995, 2000) have developed tests that are becoming more popular, are sequential, and are
not based on the FWE. They advocate using what they call the False Discovery Rate
(FDR)instead of the familywise error rate. When Tukey began advocating FWE in the
early 1950s he, perhaps unintentionally, oriented our thinking almost exclusively toward
controlling the probability of even one Type I error. When you compute a familywise rate,
you are dealing with the probability of one or moreType I errors. In effect you are saying
that your whole set of conclusions are erroneous when you make even one Type I error.
(Curiously we don’t consider our conclusions to be erroneous if we make Type II errors.)
Hochberg and Benjamini have looked at the problem somewhat differently and asked
“What percentage of the significant results (“discoveries”) that we have found are false dis-
coveries?” Suppose that we carry out nine comparisons (either simple contrasts, complex
contrasts, tests on a single mean, or any other test). We find that there are four significant
effects but, unknown to us, one of those significant effects is really a Type I error. The FDR
is then defined as
I will take an example of a simple “thought experiment” from Maxwell and Delaney (2004),
who have an excellent discussion of the FDR. Imagine that we have a situation in which we
test 10 null hypotheses, three of which are known to be false and the others true. Suppose that
we mentally run our experiment 100 times, testing all 10 hypotheses for each run. Further
suppose that we have very considerable power to reject false null hypotheses, so that we
nearly always reject the three false null hypotheses. Finally assume that we have chosen a
critical value so as to set the experimentwise error rate at .20. (You probably think that .20 is
too high, but bear with me.) Then out of our 100 hypothetical experiments, 80 percent of the
time we will make no Type I errors and 20 percent of the time we will make one Type I error
(assuming that we don’t make two type I errors in any experiment). Because we have a great
deal of power, we will almost always reject the three false null hypotheses. Here our FWE is
.20, which perhaps made you wince. But what about the FDR? Given the description above,
we will make no errors in 80 percent of the experiments. In the other 20 experiments we will
make one Type I error and three correct rejections, for an FDR of^1 ⁄ 45 .25 for those 20 exper-
iments and an FDR of 0 for the other 80 experiments. Over the long haul of 100 experiments,
the average FDR will be .05, while the FWE will be .20. Thus the critical value that sets the
familywise FWE at .20 leaves the FDR at only .05. The problem is how we choose that criti-
cal value. Unfortunately, that choice is quite complicated in the general case, but fortunately
it is fairly simple in the case of either independent contrasts or pairwise contrasts. See
Keselman, Cribbie, and Holland (1999). In this chapter I have been a strong advocate of pair-
wise contrasts, so restricting ourselves to that case is not particularly onerous.
FDR =
Number of False Rejections
Number of Total Rejections
=
1
4
=. 25
6
td
396 Chapter 12 Multiple Comparisons Among Treatment Means
False Discovery
Rate (FDR)