Computational Systems Biology Methods and Protocols.7z

(nextflipdebug5) #1
put and decreasing cost. The rise of RNA-seq methodologies has
greatly deepened our understandings of embryonic development
[2], carcinogenesis [3], cell differentiation [4], and many other
research areas.

1.1 An Overview of
RNA-seq Workflow


A complete RNA-seq procedure consists of both experimental
stage and analysis stage. Although several sequencing protocols
exist for RNA-seq, general steps and outputs in the experimental
stage are similar. Briefly, RNA molecules with poly-A tails are first
isolated by oligo-dT priming [5]. Alternatively, non-rRNAs are
enriched by rRNA depletion [6]. The resulting RNAs are fragmen-
ted and then reverse-transcribed into short (200–1000 bp) cDNA
fragments, which are then attached with sequencing adaptors and
sequenced from one end or both ends. Several NGS technologies,
including Illumina [7] and SOLiD [8], can be used for RNA-seq to
generate millions or billions of short reads representing DNA
segments.
The analysis stage of RNA-seq begins with mapping reads to
the reference genome. Because eukaryotic genomes contain
introns, RNA-seq reads often have gaps with varying lengths up
to hundreds of thousands of base pairs, which make DNA sequence
mapping tools generally unsuitable for direct use in RNA-seq.
Widely used RNA-seq mapping tools include Tophat [9], SOAP
[10], and GSNAP [11]. There are also programs that map reads
onto a reference transcriptome, rather than reference genome, to
circumvent the gap problem and to reduce computation time, such
as Sailfish [12] and Kallisto [13]. Following read mapping is the
quantification of each RNA species that are either provided by the
reference transcriptome or de novo assembled from reads. Most
mapping software packages also perform the quantification step.
Generally, the final output of mapping and quantification steps can
be described as a matrix with each column being a sample and each
row being a gene or a splicing isoform of a transcript. This matrix,
often called the expression profile, is the starting point of down-
stream analysis of RNA-seq datasets.
The expression profile contains rich transcriptomic information
regarding the tested samples. How to draw biological meanings
from it, however, is highly contingent on the specific research
background. A one-size-fits-all analytical workflow does not exist.
For example, different normalization methods have been proposed
to alleviate technical variations and batch effects among samples,
each with its strengths and drawbacks [14]. Theoretically predict-
ing which method will give best results is a challenging or some-
times impossible task [14]. RNA-seq data are often analyzed by
clustering methods to discover co-expressed gene groups or sample
subclasses that share expression patterns. Commonly used cluster-
ing algorithms, including hierarchical clustering, principal

168 Chao Zhang et al.

Free download pdf