characteristics. Such an explanation will usually sug-
gest appropriate further analysis and interpretation.
In the absence of an explanation, heterogeneity of
treatment effect, as evidenced, for example, by
marked quantitative interactions implies that alter-
native estimates of the treatment effect, giving dif-
ferent weights to the centers, may be needed to
substantiate the robustness of the estimates of treat-
ment effect. It is even more important to understand
the basis of any heterogeneity characterized by
marked qualitative interactions, and failure to find
an explanation may necessitate further clinical trials
before the treatment effect can be reliably predicted.
(ICH, E9, 3.2)
Multiplicity
Clinical trials always include multiple end points
and/or multiple comparisons between treatments.
For example, in a clinical trial of a new drug for
asthma, one may want to analyze the change in the
Forced Expiratory Volume in 1 s (FEV 1 ) as well as
the change in the total asthma symptoms score, the
subject’s morning and evening symptoms severity
scores, the investigator’s global improvement
score and perhaps other end points. In a dose–
response trial with placebo, low dose, intermediate
dose and high dose, the investigator may want to
compare the three dose groups to the control and
perhaps the different dose groups with each other.
The issue of multiplicity is that when perform-
ing multiple statistical tests, the error probability
associated with the inferences made is inflated. To
see this, let us consider a simple situation where
one is interested in performing two statistical tests
on independent sets of data, each at a significance
level of 0.05. Thus, the probability that each of the
two tests will be declared significant erroneously
(type I error) is 0.05. However, the probability that
at least one of the two tests will be declared
significant erroneously is 0.0975. The probability
that at least one of the tests of interest will be
declared significant erroneously is called the
experiment-wise error rate. If we perform three
0.05 level tests, the experiment-wise error rate
increases to 0.143. In practical terms, this means
that if we perform multiple tests and make multi-
ple inferences, each one at a reasonably low error
probability, the likelihood that some of these
inferences will be erroneous could be appreciable.
To correct for this, one must conduct each indivi-
dual test at a decreased significance level with
the result that either the power of the tests will
be reduced as well, or the sample size must be
increased to accommodate the desired power. This
could make the trial prohibitively expensive. Sta-
tisticians sometimes refer to the need to adjust the
significance level so that the experiment-wise
error rate is controlled, as the statistical penalty
for multiplicity.
The need to control the experiment-wise error
rate may not apply to exploratory analyses. Statis-
ticians often perform formal statistical tests for
exploratory purposes. So, no formal hypotheses
are stated and no inferences are made based on
them. Even though the act of performing formally
an exploratory test involves the same steps as infer-
ential testing, it is conceptually different because
of the absence of a null hypothesis. Thep-value
obtained in such a test should be viewed as a
measure of the level of inconsistency of the data
with the underlying assumptions of the test rather
than error probabilities involved in making causal
inferences.
In summary, one should limit the number of
inferential tests to be performed to the minimum
necessary for making the desired causal inferences.
They must be specified in the study protocol and
the appropriate adjustments to the error probabil-
ities must be made. Similarly, one should remem-
ber that when multiple tests are performed without
adjustment, as the case would be in exploratory
testing situation, one should expect to see spurious
statistically significant results that may or may not
be meaningful. This last comment applies particu-
larly tostatistical tests performedon adverse events
and laboratory data. Adverse events reported in a
study are often summarized by reporting their inci-
dences summarized by body system. Often, dozens
of categories are listed. When formal statistical
tests are applied to these data, some of these tests
will result withp-values less than the customary
0.05. The researcher should be cognizant of this
issue and not jump to conclusions. It is strongly
advisable to specify in advance the particular
safety tests to be performed inferentially if there
336 CH25 STATISTICAL PRINCIPLES AND APPLICATION IN BIOPHARMACEUTICAL RESEARCH