analysts and healthcare professionals with the abil-
ity to assess the validity of the results generated by
a data mining algorithm. Additionally, data mining
is notdata dredging, which isa pejorative termused
to imply the repeated evaluation of a data set,
usually involving multiple comparisons with no
prior defined method, to find some ‘statistically
significant’ event. Given the statistical problems
associated with conducting multiple comparisons,
such a ‘statistically significant’ event may merely
be a random finding that only gets noted due to the
multiple comparisons, or data dredging.
40.2 Methods
Before any data mining algorithms or models are
used on a database, it is important to first make sure
that the data have been collected appropriately and
that they have been organized and checked for
accuracy. Subsequently, there is a choice from
among multiple data mining methods that can be
used. Among these are the Multi-Item Gamma
Poisson Shrinker (MGPS) algorithm, which gen-
erates an Empirical Bayesian Geometric Mean
(EBGM) score, the Proportional Reporting Ratio
(PRR) method and the Bayesian Neural Network
approach Du Mouchel (1999); Evanset al. (2001);
Bateet al. (1998). Both the MGPS and PRR meth-
ods will generate similar drug–event combinations
for further investigation when theobserved number
of cases with the drug–event combination is greater
than 20 or the expected number of cases with the
drug–event combination is<1.
EBGM is a statistical measure of disproportionality,
comparing the observed and expected reporting fre-
quency within a database. The determination of the
expected reporting frequency assumes complete inde-
pendence of cases associated with either a drug or an
event. Thus, in a hypothetical database of 100 cases, if
Drug Z represented 20 cases in the database and there
were 10 cases of rhabdomyolysis, the expected report-
ing frequency would be 20/100 (probability of Drug
Z)10/100 (probability of rhabdomyolysis) 100
cases (total database size)¼2 expected cases. If the
observed number of drug–event cases was 8, then the
relative reporting ratio (RR) would be 8/2 (N/E)¼ 4
and the EBGM would be about 4, depending on the
amount of ‘shrinkage’ thatoccurs based on the model
(see Figure 40.1).
The larger the number of adverse event (AE)
reports for a particular drug (for a drug that has
- N is the observed number of cases with the combination of items.
- E is the expected number of cases with the combination. Calculated as:
Observed # cases with DRUG Observed # cases with EVENT
E = ------------------------------------------ x ---------------------------------------- x Total # cases
Total # cases Total # cases
RR Relative reporting ratio (the same as N/E). Observed number of
cases with the combination divided by the expected number of cases
with the combination. This may be viewed as a sampling estimate of
the true value of observed/expected for the particular combination of
drug and event.
EBGM Empirical Bayesian Geometric Mean. A more stable estimate than
RR; the so-called ‘shrinkage’ estimate.
EB05 A value such that there is less than a 5% probability that the true
value of observed/expected lies below it.
EB95 A value such that there is less than a 5% probability that the true
value of observed/expected lies above it.
90% CI The interval from EB05 to EB95 may be considered to be the ‘90%
confidence interval’.
Figure 40.1 Empirical Bayesian Geometric Mean (EBGM) terms
546 CH40 DATA MINING