Computational Systems Biology Methods and Protocols.7z

(nextflipdebug5) #1
scaling factor in different contexts. Basically, two classes of methods
are available to calculate normalization factors. They are the
control-based normalization and the average-bulk normalization.
The former class of methods assumes the total expression level
summed over a small group of genes is approximately the same
across all the samples. The latter class of methods assumes most of
genes are not differentially expressed (DE) genes across all the
samples. The control-based normalization uses RNA from a
group of internal control genes or external spike-in RNA. The
commonly used internal control genes are housekeeping genes,
and spike-in RNA usually are artificial RNA added to cell lysate.
Since internal control genes and spike-in RNA may not be present
in some data, the average-bulk normalization is more commonly
used for their universality. Five average-bulk normalization meth-
ods designed to normalize bulk RNA-seq data are library size,
trimmed mean of M values (TMM), relative log expression
(RLE), upper quartile, and median of the ratios of observed counts
that is also referred to as the DESeq method (Fig.2b). The DESeq
method has been included into the bioconductor package DESeq
[7] for R environment. TMM, RLE, and upper quartile have been
included into the bioconductor package edgeR [8] for R
environment.
Although many methods have been developed and improved,
the solution of RNA-seq data normalization is still unsatisfactory.
Both the control-based normalization and the average-bulk nor-
malization depend on their assumptions, which cannot be directly
validated by experiments. As for internal control genes, the exis-
tence of housekeeping genes has been investigated in many previ-
ous studies, but none of them sampled human tissues completely.
By the integration of these results to remove false-positive genes
due to inadequate sampling, Zhang et al. only found 1 common
gene across 15 examined housekeeping gene datasets comprised of
187 different tissue and cell types [9]. But the shortcoming of this
study is each dataset was normalized based on the highest gene
expression level for comparison. It is a logical paradox to use other
normalization methods to examine housekeeping genes for their
usability in data normalization. The commonly used spike-in RNA
is the External RNA Control Consortium (ERCC) RNA set con-
sisting of 92 polyadenylated transcripts with short 3^0 polyA tails but
without 5^0 caps [10]. They are designed to have a wide range of
sequence lengths (273–2022 nt) and GC-content percentages
(30.79–52.69%). The same quantity of ERCC RNA should be
spiked into each sample prior to RNA reverse transcription. Risso
et al. evaluated the performance of the ERCC method and con-
cluded that the ERCC method was not reliable enough to be used
in standard global scaling or regression-based normalization pro-
cedures [11]. Although Risso et al. investigated the ERCC method
in two very different datasets, the measures (PCA plot, RLE

316 Shan Gao

Free download pdf