100 CATALYZING INQUIRY
rely on training sets provided by the researchers, where the class membership of data is provided. Then,
when presented with experimental data, supervised methods apply the learning from the training set to
perform similar classifications. One such technique is support vector machines (SVMs), which are
useful for highly multidimensional data. SVMs map the data into a “feature space” and then create
(through one of a large number of possible algorithms) a hyperplane that separates the classes. Another
common method is Artificial Neural Nets (see XREF), which train on a dataset with defined class
membership; if the neural network classifies a member of the training set incorrectly, the error back-
propagates through the system and updates the weightings. Unsupervised and supervised methods can
be combined for “semisupervised” learning methods, in which heterogeneous training data can be both
classified and unclassified.^128
However, there is no analytic method optimal to any dataset. Thus, it would be useful to develop a
scheme that can guide users to choose an appropriate method (e.g., in hierarchical clustering, an appro-
priate set of similarity measure, linkage method, and the measure used to determine the number of
clusters) to achieve a reasonable analysis of their own datasets.
Ultimately, it is desirable to go beyond correlations and associations in the analysis of gene expres-
sion data to seek causal relationships. It is an elementary truism of statistics that indications of correla-
tion are not by themselves indicators of causality—an experimental manipulation of one of more vari-
ables is always necessary to conclude a causal relationship. Nevertheless, analysis of microarray data
can be helpful in suggesting experiments that might be particularly fruitful in uncovering causal rela-
tionships. Bayesian analysis allows one to make inferences about the possible structure of a genetic
regulatory pathway on the basis of microarray data, but even advocates of such analysis recognize the
need for experimental test. One work goes so far as to suggest that it is possible that automated
processing of microarray data can suggest interesting experiments that will shed light on causal rela-
tionships, even if the existing data themselves don’t support causal inferences.^129
4.4.8 Data Mining and Discovery,
4.4.8.1 The First Known Biological Discovery from Mining Databases^130
By the early 1970s, the simian sarcoma virus had been determined to cause cancer in certain species
of monkeys. In 1983, the responsible oncogene within the virus was sequenced. At around the same
time, and entirely independently, a partial amino acid sequence of an important growth factor in
humans—the platelet-derived growth factor (PDGF) was also determined. PDGF was known to cause
cultured cells to proliferate in a cancer-like manner. Russell Doolittle compared the two sequences and
found a high degree of similarity between them, indicating a possible connection between an oncogene
and a normal human gene. In this case, the indication was that the simian sarcoma virus acted on cells
in monkeys in a manner similar to the action of PDGF on human cells.
(^128) T. Li, S. Zhu, Q. Li, and M. Ogihara, “Gene Functional Classification by Semisupervised Learning from Heterogeneous
Data,” pp. 78-82 in Proceedings of the ACM Symposium on Applied Computing, ACM Press, New York, 2003.
(^129) C. Yoo and G. Cooper, “An Evaluation of a System That Recommends Microarray Experiments to Perform to Discover
Gene-regulation Pathways,” Artificial Intelligence in Medicine 31(2):169-182, 2004, available at http://www.phil.cmu.edu/projects/
genegroup/papers/yoo2003a.pdf.
(^130) Adapted from S.G.E. Andersson and L. Klasson, “Navigating Through the Databases,” available at http://artedi.ebc.uu.se/
course/overview/navigating_databases.html. The original Doolittle article was published as R.F. Doolittle, M.W. Hunkapiller,
L.E. Hood, S.G. Davare, K.C. Robbins, S.A. Aaronson, and H.N. Antoniades, “Simian Sarcoma Virus onc Gene, v-sis, Is Derived
from the Gene (or Genes) Encoding a Platelet-derived Growth Factor,” Science 221(4607):275-277, 1983.