interest (e.g. results from preclinical, clinical,
pharmacoepidemiologic or other available
studies).
Data mining techniques should always be used in
conjunction with, and not in place of, analyses of
single case reports. Data mining techniques facil-
itate the evaluation of spontaneous reports by
using statistical methods to detect potential sig-
nals for further evaluation. This tool does not
quantify the magnitude of risk, and caution should
be exercised when comparing drugs. Further,
when using data mining techniques, consideration
should be given to the threshold established
for detecting signals, as this will have implications
for the sensitivity and specificity of the method
(a high threshold is associated with high specifi-
city and low sensitivity). Confounding factors that
influence spontaneous AE reporting are not
removed by data mining.
40.7 Privacy
Privacy concerns are becoming more important as
data mining becomes more common. Besides issues
of data ownership, there are questions that abound
on who has access to the data, the amount of identi-
fying information that is present in the database and
how the results of the data mining will be used.
Furthermore, there are laws both in the United
States and Europe which regulate data privacy, and
in addition the FDA has separate rules on data
integrity and traceability. All of these issues will
have an impact on the way data are collected, data
mined and how these results are used.
The European Union’s Directive on Data Pro-
tection bars the movement of personal data to
countries that do not have sufficient data privacy
laws in place. Additionally, the US Health Insur-
ance Portability and Accountability Act (HIPAA)
sets national standards for the protection of health
information, as applied to the three types of cov-
ered entities: health plans, healthcare clearing-
houses and healthcare providers who conduct
certain healthcare transactions electronically.
HHS OCR HIPAA Privacy (2003). This law was
enacted in recognition of the fact that advances in
electronic technology could erode the privacy of
health information.
The discussion about data mining and privacy is
just the beginning. There will be increased scrutiny
of data mining and its impact on privacy in the
years to come. This is especially true as consumers
and lawmakers become more aware and concerned
about the potential for data mining, if used impro-
perly, to violate the privacy rights ofindividuals. At
the same time, however, governments are actively
engaged in data mining for national security and
law enforcement purposes, as they too begin to
recognize the tremendous value of using this
powerful technique. Nevertheless, as long as the
data that are collected contain any potentially iden-
tifying information, legal, ethical and privacy
questions will need to be addressed.
40.8 Limitations
The biggest limitation with data mining is the
quality of the data. Simply put, the results of the
analyses are only as good as the data from which
they are derived. The best databases are those that
are relevant, complete, have rich-quality data, are
large and get updated frequently. Unfortunately,
many databases are designed for purposes entirely
different than what they are being used for, when
they are data mined.
Additionally, as errors can easily occur in data-
bases, it cannot be assumed that the data they
contain are entirely correct. Even after ‘data clean-
ing’ – a process to remove obvious errors and
duplicates – there may be inherent errors or mis-
classification in the data being collected, particu-
larly if there is subjectivity involved in the
measurement that is used. Furthermore, in large,
constantly changing databases, there must be rules
in place for the data mining algorithm to capture
the most current data.
Lastly, because the results obtained from the
data mining process can be difficult to interpret, it
is extremely useful for the results to be presented in
a graphical form that allows the user to interact
with both the data and the results. This allows the
end user to further explore and better understand
the results obtained. By being able to go from a
554 CH40 DATA MINING