John DiNardo 131
3.5.5 What should “tests” do?
The previous discussion has attempted to be clear about why the “probabilities”
of the usual hypothesis testing procedures shouldnotbe conflated with the
“probability that the hypothesis is true.”
What, then, is the “heart of the problem”? One argument, now associated with
Mayo (1996), is that hypothesis tests should be used to put propositions to “severe”
tests. The purpose of the probabilities for the non-Bayesian is to ascertain, as much
as one can, how reliable specific procedures are at detecting errors in one’s beliefs.
What is a severe test? In C.S. Peirce’s words:
[After posing a question or theory], the next business in order is to commence
deducing from it whatever experimental predictions are extremest and most
unlikely ...in order to subject them to the test of experiment. The process of
testing it will consist, not in examining the facts, in order to see how well they
accord with the hypothesis, but on the contrary in examining such of the prob-
able consequences of the hypothesis as would be capable of direct verification,
especially those consequences which would be very unlikely or surprising in
case the hypothesis were not true. When the hypothesis has sustained a testing
as severe as the present state of our knowledge...renders imperative, it will
be admitted provisionally...subject of course to reconsideration. (Peirce, 1958,
7.182 and 7.231, as cited in Mayo, 1996)
Perhaps no better account can be given than Peirce’s quotation. A nice quick gloss
of a slightly more formal version of this idea is given in Mayo (2003):
Hypothesis H passes a severe test T withxif:
(i)xagrees or “fits”H(for a suitable notion of fit).
(ii)with very high probability, testTwould have produced a result that fitsH
less well thanx,ifHwere false or incorrect.
Mayo (1996) gives a nice example of why error probabilitiesof themselvesare not
enough, and why specification of an “appropriate” test statistic is a key ingredient.
Mayo’s example involves testing whether the probability of heads is 0.35 (H 0 )
against the alternative that it is 0.10 (H 1 ). It is an “artificial” example, but doesn’t
suffer the defect of the previous example – namely that the test is not the best in
its class.
Suppose it is agreed that four coins will be tossed and that the most powerful
test of size 0.1935 will be chosen. The following table shows the likelihood of
observing various outcomes in advance of the experiment:
#Heads 01234
P(H 0 |·) 0.1785 0.3845 0.3105 0.1115 0.0150
P(H 1 |·) 0.6561 0.2916 0.0486 0.0036 0.0001