Computational Systems Biology Methods and Protocols.7z

(nextflipdebug5) #1
the first group of methods produced very similar results, it is more
convenient to use read count for data normalization by simply
producing column sums of a gene expression matrix. Currently,
most of software or R packages (e.g., edgeR and DESeq) produce
column sums of a gene expression matrix to obtain library sizes.
However, normalization factors calculated using library size (total
RNA) have significant differences from those calculated using cel-
lular RNA or nuclear RNA (Table2). The second group of meth-
ods was closer to the first group, compared to the third group.
Upper quartile from the second group reached the highest correla-
tion with the ERCC method from the first group. In the third
group, nuclear RNA was closest to pooled size factors. In conclu-
sion, the normalization of scRNA-seq or RNA-seq data is still
unsettled. Based on our studies, if ERCC data is not available,
library size can be used instead of ERCC.

4 Fundamental Problems


Here, we present a schema to generalize four fundamental pro-
blems (Fig.4). Besides data normalization and cluster analysis,
sample and feature reduction are two other fundamental problems
in the scRNA-seq data analysis. The normalized gene expression
matrix is composed of n samples by m features, which can be genes,
transcripts or exons (Fig.2a). The scRNA-seq data from SMS (e.g.,
PacBio full-length transcriptome [15]) use transcripts as features,
while the scRNA-seq data from NGS often use genes as features

Fig. 3Correlation of different normalization methods. The hierarchical clustering
used correlation distances (1—Pearson correlation coefficients) between
265 samples containing ERCC RNA from Table2. TMM, RLE, upper quartile,
and DESeq were modified to process scRNA-seq data containing a high
frequency of zeroes. Pooled represents pooled size factors [14]

318 Shan Gao

Free download pdf