Computational Systems Biology Methods and Protocols.7z

(nextflipdebug5) #1

2.2.1 Normalization of
scRNA-seq Data
Without UMIs


Although the bulk-based normalization strategy has to be used
without other choices in the cases, mostly at the very beginning of
single-cell sequencing emerge, that both UMIs and extrinsic spike-
ins are absent, recent strategies are based on spike-ins. Because
without external spike-in controls, it is difficult to determine how
much RNA is present in a cell, which varies cell to cell. Since the
spike-in RNAs are assumed to be constantly added into the libraries
across cells, it is possible to accurately estimate relative differences in
the total RNA content between cells. Specifically, the ratio of the
number of reads mapped to the genome of interest to the number of
reads mapped to the spike-ins is easy to be calculated. When com-
pared between cells, this ratio accounts for differences in the amount
of RNA within a cell to be inferred. Therefore, if the spike-ins are
available, the read counts associated with each gene can be converted
into absolute numbers of mRNA molecules based on the level of the
spike-ins, which are of known concentrations. However, spike-in
control is not a perfect resolution for scRNA-seq read normalization.
Most common set of spike-ins, such as ERCC [120 ], are 500–2000
nucleotides (nts) in length, which is shorter than an average human
mRNAs (~2100 nts including untranslated regions [128 ]). The 3^0
bias of scRNA-seq protocols leads the conversion based on the
shorter ERCC spike-ins is potentially problematic. Additionally, the
spike-ins have comparatively short poly(A) tails and lack 5^0 caps,
which may result in different degree of degradation and efficiency
of reverse transcription of the endogenous mRNAs. Consequently, it
is challenging to develop a generally applicable normalization strat-
egy for scRNA-seq data that properly accounts for variability both in
sequencing depth and cell size.
In many cases, a sensible and pragmatic approach is to calculate
two alternative size factors: one for the spike-ins and one for the
endogenous mRNAs [129]. The former accounts solely for
sequencing depth whereas the latter for the endogenous mRNAs
reflecting the relative differences in cell size. This twofold normali-
zation strategy relies on the assumption that the normalized spike-
ins can be used to estimate the degree of the technical variability
across the whole dynamic range of expression, which is the basic
principle for spike-in control. Even if the technical noise is well
accounted for by the spike-in control, the transcript length-based
normalization methods, such as FPKM or RPKM, are still prob-
lematic. In particular, although improvements have been made
recently [37], there is still noticeable 3^0 bias to several scRNA-seq
protocols, including smart set used by the popular Fluidigm tech-
nology. This bias leads to the underestimation of the expression
long transcripts and overestimation of the short ones. Therefore,
until protocols allow unbiased sampling of reads from across the
whole transcript length, using FPKMs to compare the expression of
transcripts with different lengths must be paid with more attention.
To overcome this shortcoming, the UMIs were designed and used
for scRNA-seq protocols.

Applications of Single-Cell Sequencing for Multiomics 353
Free download pdf