Computational Systems Biology Methods and Protocols.7z

(nextflipdebug5) #1
due to their short reads. Using the human genome GRCh38 as
reference, 57,992 annotated genes can be used to produce the gene
expression matrix, which is a large sparse matrix. So sample reduc-
tion and then feature reduction need to be performed to remove as
much noise as possible.
Basically, sample reduction is performed based on the library
size or the mitochondrial RNA percentage (Subheading2). Since
the gene number and the UMI number correlates well with the
library size, they are also used to fill out samples. Using the human
genome GRCh38 as reference, the library size can be calculated by
counting reads aligned to 57,992 annotated genes (cellular RNA)
and 92 ERCC RNA sequences. Cellular RNA can be calculated by
counting reads aligned to 57,955 nuclear genes (nuclear RNA) and
37 mitochondrial genes (mitochondrial RNA). In our previous
studies, we found that sample reduction greatly affected the results
of cluster analysis and the downstream analyses (e.g., differential
expression analysis). Instead of library size and cellular RNA,
nuclear RNA containing at least 100,000 read counts was sug-
gested as a criterion to filter out samples.

Table 2
Comparison of scRNA-seq normalization methods


Method Pearson correlation coefficient
ERCC 1.00 1.00#
Library size 1 0.82 0.83#
Library size 2 0.89 0.89#
Cellular RNA 0.29 0.39#
Nuclear RNAa 0.02 0.36#
Pooleda 0.09 0.17#
Q1 (25%) Q2 (50%) Q3 (75%)
TMMa 0.26 0.20 0.29 0.56#
RLEa 0.27 0.42 0.53 0.63#
Upper quartilea 0.27 0.37 0.50 0.69#
DESeq 0.28 0.43 0.55 0.63#

Pearson correlation coefficients have been calculated between the factors using the ERCC method and those using each
bulk normalization method. The PCC calculation used 265 samples containing ERCC RNA from a colon cancer scRNA-
seq dataset (Subheading2). Nuclear RNA represents the total count of reads aligned to the nuclear genome. Cellular
RNA represents the total count of reads aligned to the nuclear and mitochondrial genome. Library size 1 and 2 represent
total RNA using read number and read count, respectively (Subheading3). Pooled represents pooled size factors [14]
with the parameter sizes¼c (15,40,80,130). TMM, RLE, upper quartile, and DESeq were modified to be fit for the
scRNA-seq data containing a high frequency of zeroes
aThese methods use the nuclear RNA as library size for calculation. # From 265 samples, 171 samples containing


mitochondrial RNA proportion less than 30% were selected to repeat the calculation


Data Analysis in Single-Cell Transcriptome Sequencing 319
Free download pdf