Catalyzing Inquiry at the Interface of Computing and Biology

(nextflipdebug5) #1
COMPUTATIONAL TOOLS 99

a random order, each data point is selected. At each iteration, the nodes move closer to the selected data
point, with the distance moved influenced by the distance from the data point to the node and the
iteration number. Thus, the closest node will move the most. Over time, the initial geometry of the
nodes will deform and each node will represent the center of an identified cluster. Experimentation is
often necessary to arrive at a useful number of nodes and geometry, but since SOMs are computationally
tractable, it is feasible to run many sessions. The properties of SOMs—partially structured, scalable to
large datasets, unsupervised, easily visualizable—make them well suited for analysis of microarray
data, and they have been used successfully to detect patterns of gene expression.^124
In contrast to the above two methods, COSA is based on the assumption that better clustering can
be achieved if only relevant genes are used in individual clusters. This is consistent with the idea of
identifying differentially expressed genes (relevant genes) and then using only those genes to build a
classifier. The search algorithm in COSA identifies an optimal set of variables that should be used to
group individual clusters and which clusters should be merged when their similarity is assessed using
the optimal set of variables identified. This idea was implemented by adding weights reflecting contri-
butions of all genes to producing a particular set of sample clusters, and the search algorithm is then
formulated as an optimization problem. The clustering results by COSA indicate that a subset of genes
makes a greater contribution to a particular sample cluster than to other clusters.^125
Clustering methods are being used in many types of studies. For example, they are particularly
useful in modeling cell networks and in clustering disparate kinds of data (e.g., RNA data and non-
RNA data; sequence data and protein data). Clustering can be applied to evaluate how feasible a given
network structure is. Also, clustering is often combined with perturbation analysis to explore a set of
samples or genes for a particular purpose. In general, clustering can be useful in any study in which
local analyses with groups of samples or genes identified by clustering improve the understanding of
the overall system.
Biclustering is an alternate approach to revealing meaningful patterns in the data.^126 It seeks to
identify submatrices in which the set of values has a low mean-squared residue, meaning that the each
value is reasonably coherent with other members in its row and column. (However, excluding meaning-
less solutions with zero area, this problem is unfortunately NP-complete.) Advantages of this approach
include that it can reveal clusters based on a subset of attributes, it simultaneously clusters genes with
similar expression patterns and conditions with similar expression patterns, and most importantly,
clusters can overlap. Since genes are often involved in multiple biological pathways, this can be used to
reveal linkages that otherwise would be obscured by traditional cluster analysis.
While many analyses of microarray data consider a single snapshot in time, of course expression
levels vary over time, especially due to the cellular life cycle. A challenge in analyzing microarray time-
series data is that cell cycles may be unsynchronized, making it difficult to correctly identify correla-
tions between data samples that have similar expression behavior. Statistical techniques can identify
periodicity in series and look for phase-shifted correlations between pairs of samples,^127 as well as more
traditional clustering analysis.
A separate set of analytic techniques is referred to as supervised methods, in contrast to clustering
and similar methods that run with no incoming assumptions. Supervised methods, in contrast, use
existing knowledge of the dataset to classify data into one of a set of classes. In general, these techniques


(^124) P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewwan, E. Dmitrovsky, E.S. Lander, and T.R. Golub, “Interpreting
Patterns of Gene Expression with Self-organizing maps: Methods and Application to Hematopoietic Differentiation,” Proceedings
of the National Academy of Sciences 96(6):2907-2912, 1999.
(^125) J.H. Friedman and J.J. Meulman, “Clustering Objects on Subsets of Attributes,” Journal of the Royal Statistical Society Series B
66(4):815-849(34), 2004.
(^126) Y. Cheng and G.M. Church, “Biclustering of Expression Data,” Proceedings of the Eighth International Conference on Intelligent
Systems for Molecular Biology 8:93-103, 2000.
(^127) V. Filkov, S. Skiena, and J. Zhi, “Analysis Techniques for Microarray Time-Series Data,” Journal of Computational Biology
9(2):317-330. Available at http://www.cs.ucdavis.edu/~filkov/papers/spellmananalysis.pdf.

Free download pdf