Principles and Practice of Pharmaceutical Medicine

characteristics. Such an explanation will usually sug- gest appropriate further analysis and interpretation. In the absence of an explanation, heterogeneity of treatment effect, as evidenced, for example, by marked quantitative interactions implies that alter- native estimates of the treatment effect, giving different weights to the centers, may be needed to substantiate the robustness of the estimates of treatment effect. It is even more important to understand the basis of any heterogeneity characterized by marked qualitative interactions, and failure to find an explanation may necessitate further clinical trials before the treatment effect can be reliably predicted. (ICH, E9, 3.2)

Multiplicity

Clinical trials always include multiple end points
and/or multiple comparisons between treatments.
For example, in a clinical trial of a new drug for
asthma, one may want to analyze the change in the
Forced Expiratory Volume in 1 s (FEV 1 ) as well as
the change in the total asthma symptoms score, the
subject’s morning and evening symptoms severity
scores, the investigator’s global improvement
score and perhaps other end points. In a dose–
response trial with placebo, low dose, intermediate
dose and high dose, the investigator may want to
compare the three dose groups to the control and
perhaps the different dose groups with each other.
The issue of multiplicity is that when perform-
ing multiple statistical tests, the error probability
associated with the inferences made is inflated. To
see this, let us consider a simple situation where
one is interested in performing two statistical tests
on independent sets of data, each at a significance
level of 0.05. Thus, the probability that each of the
two tests will be declared significant erroneously
(type I error) is 0.05. However, the probability that
at least one of the two tests will be declared
significant erroneously is 0.0975. The probability
that at least one of the tests of interest will be
declared significant erroneously is called the
experiment-wise error rate. If we perform three
0.05 level tests, the experiment-wise error rate
increases to 0.143. In practical terms, this means
that if we perform multiple tests and make multi-
ple inferences, each one at a reasonably low error

probability, the likelihood that some of these inferences will be erroneous could be appreciable. To correct for this, one must conduct each indivi- dual test at a decreased significance level with the result that either the power of the tests will be reduced as well, or the sample size must be increased to accommodate the desired power. This could make the trial prohibitively expensive. Sta- tisticians sometimes refer to the need to adjust the significance level so that the experiment-wise error rate is controlled, as the statistical penalty for multiplicity. The need to control the experiment-wise error rate may not apply to exploratory analyses. Statis- ticians often perform formal statistical tests for exploratory purposes. So, no formal hypotheses are stated and no inferences are made based on them. Even though the act of performing formally an exploratory test involves the same steps as inferential testing, it is conceptually different because of the absence of a null hypothesis. Thep-value obtained in such a test should be viewed as a measure of the level of inconsistency of the data with the underlying assumptions of the test rather than error probabilities involved in making causal inferences. In summary, one should limit the number of inferential tests to be performed to the minimum necessary for making the desired causal inferences. They must be specified in the study protocol and the appropriate adjustments to the error probabilities must be made. Similarly, one should remem- ber that when multiple tests are performed without adjustment, as the case would be in exploratory testing situation, one should expect to see spurious statistically significant results that may or may not be meaningful. This last comment applies particu- larly tostatistical tests performedon adverse events and laboratory data. Adverse events reported in a study are often summarized by reporting their inci- dences summarized by body system. Often, dozens of categories are listed. When formal statistical tests are applied to these data, some of these tests will result withp-values less than the customary 0.05. The researcher should be cognizant of this issue and not jump to conclusions. It is strongly advisable to specify in advance the particular safety tests to be performed inferentially if there

336 CH25 STATISTICAL PRINCIPLES AND APPLICATION IN BIOPHARMACEUTICAL RESEARCH

Principles and Practice of Pharmaceutical Medicine

Get our desktop app

Company

Features

Documentation

Resources