boxplot, and MA plot) to evaluate the performance were not con-
victive. The average-bulk normalization assumes that the total
amount of RNA in each sample is approximately the same, and
most of gene expression changes are less than twofold. However,
Loven et al. found that cells with high levels of c-Myc could amplify
their gene expression programs, producing two to three times more
total RNA and generating cells that were larger than their low-Myc
counterparts [12]. In that study, it was recommended that the
spike-in normalization should be used as the default standard for
all gene expression studies.
The spike-in normalization methods for scRNA-seq data are
typically using ERCC RNA or unique molecular identifiers (UMIs)
[13], while the average-bulk normalization methods are difficult to
be used to process scRNA-seq data due to a high frequency of
zeroes. Lun et al. assessed the suitability of three average-bulk
normalization methods (library size, TMM, and DESeq) for nor-
malizing scRNA-seq data by simulation [14]. As a result, they
introduced a new method using the pooled size factors and claimed
that their method outperformed the library size method, TMM,
and DESeq. However, this new method had some new assump-
tions, and the parameters of it need to be arbitrarily set to pool cells
of similar library sizes in each group. In addition, they claimed that
the pooled size factors were closest to the true factors based on the
results using simulated scRNA-seq datasets. Actually, the first rea-
son for difficulty in validating normalization methods is lack of a
standard method to estimate the true factors.
Library size is not only a commonly used normalization
method but also used to calculate other normalization factors
(e.g., TMM, RLE and upper quartile). Library size represents
total RNA including spike-in RNA and cellular RNA. The latter
includes nuclear RNA and mitochondrial RNA. There are two
methods to estimate the library size of one sample. The first one
is using the number of all reads, which can be aligned to the spike-in
sequences (e.g., ERCC RNA), nuclear genomes, and mitochon-
drial genomes. The second one is using read count which is usually
more than read number due to multiple alignments.
To assess normalization methods using real scRNA-seq data,
we compared the ERRC method with six modified average-bulk
normalization methods and acquired some new insights into the
scRNA-seq data normalization. In that study, 265 samples contain-
ing ERCC RNA were selected from a colon cancer scRNA-seq
dataset (Subheading2) to obtain Pearson correlation coefficients
(PCCs) between the factors calculated using the ERCC method
and those calculated using the library size method, pooled size
factors, TMM, RLE, upper quartile, and DESeq. These normaliza-
tion methods were classified into three groups by hierarchical clus-
tering (Fig.3). The first group included read number (library size
1), read count (library size 2), and the ERCC method. Although
Data Analysis in Single-Cell Transcriptome Sequencing 317