98 CATALYZING INQUIRY
Many analytic techniques have been developed and applied to the problem of revealing biologically
significant patterns in microarray data. Various statistical tests (e.g., t-test, F-test) have been developed
to identify genes with significant changes in expression (out of thousands of genes); such genes have
had widespread attention as potential diagnostic markers or drug targets for disease, stages of develop-
ment, and other cellular phenotypes. Many classification tools (e.g., Fisher’s Discriminant Analysis,
Bayesian classifier, artificial neural networks, tools from signal processing) have also been developed to
build a phenotype classifier with the genes differentially expressed. These classification tools are gener-
ally used to discriminate known sample groups from each other using differentially expressed genes
selected by statistical testing.
Other algorithms are necessary because data acquired through microarray technology often have
problems that must be managed prior to use. For example, the quality of microarray data is highly
dependent on the way in which a sample is prepared. Many factors can affect the extent to which a dot
fluoresces, of which the transcription level of the particular gene involved is only one. Such extraneous
factors include the sample’s spatial homogeneity, its cleanliness (i.e., lack of contamination), the sensi-
tivity of optical detectors in the specific instrument, varying hybridization efficiency between clones,
relative differences between dyes, and so forth. In addition, because different laboratories (and different
technicians) often have different procedures for sample preparation, datasets taken from different labo-
ratories may not be strictly comparable. Statistical methods of analysis of variance (ANOVA) have been
applied to deal with these problems, using models to estimate the various contributions to relative
signal from the many potential sources. Importantly, these models not only allow researchers to attach
measures of statistical significance to data, but also suggest improved experimental designs.^122
An important analytical task is to identify groups of genes with similar expression patterns. These
groups of genes are more likely to be involved in the same cellular pathways, and many data-driven
hypotheses about cellular regulatory mechanisms (e.g., disease mechanisms) have been drawn under
this assumption. For this purpose, various clustering methods, such as hierarchical clustering methods,
self-organizing maps (trained neural networks), and COSA (Clustering Objects on Subsets of Attributes),
have been developed. The goal of cluster analysis is to partition a dataset of N objects into subgroups
such that these objects are more similar to those in their subgroups than to those in other groups.
Clustering tools are generally used to identify groups of genes that have similar expression pattern
across samples; thus, it is reasonable to suppose that the genes in each group (or cluster) are involved in
the same biological pathway. Most clustering methods are iterative and involve the calculation of a
notional distance between any two data points; this distance is used as the measure of similarity. In
many implementations of clustering, the distance is a function of all of the attributes of each sample.
Agglomerative hierarchical clustering begins with assigning N clusters for N samples, where all
samples are defined as different individual clusters. Potential clusters are arranged in a hierarchy
displayed as a binary tree or “dendrogram.” Euclidian distance or Pearson correlation is used with
“average linking” to develop the dendrogram. For example, two clusters that are closest to each other in
terms of Euclidean distance are combined to form a new cluster, which is represented as the average of
two groups combined (average linkage). This process is continued until there is one cluster to which all
samples belong. In the process of forming the single cluster, the overall structure of clusters is evaluated
for whether the merging of two clusters into one new cluster decreases both the sum of the similarity
within all of the clusters and the sum of differences between all of the clusters. The clustering procedure
stops at the level at which these are equal.
Self-organizing maps (SOMs)^123 are another form of cluster analysis. With SOMs, a number of
desired clusters is decided in advance, and a geometry of nodes (such as an N × M grid) is created,
where each node represents a single cluster. The nodes are randomly placed in the data space. Then, in
(^122) M. Kerr, M. Martin, and G. Churchill, “Analysis of Variance for Gene Expression Microarray Data,” Journal of Computational
Biology 7(6):819-837, 2000.
(^123) T. Kohonen, Self-Organizing Maps, Second Edition, Springer, Berlin, 1997.