Computational Systems Biology Methods and Protocols.7z

2.2.1 Normalization of
scRNA-seq Data
Without UMIs

Although the bulk-based normalization strategy has to be used without other choices in the cases, mostly at the very beginning of single-cell sequencing emerge, that both UMIs and extrinsic spike- ins are absent, recent strategies are based on spike-ins. Because without external spike-in controls, it is difficult to determine how much RNA is present in a cell, which varies cell to cell. Since the spike-in RNAs are assumed to be constantly added into the libraries across cells, it is possible to accurately estimate relative differences in the total RNA content between cells. Specifically, the ratio of the number of reads mapped to the genome of interest to the number of reads mapped to the spike-ins is easy to be calculated. When com- pared between cells, this ratio accounts for differences in the amount of RNA within a cell to be inferred. Therefore, if the spike-ins are available, the read counts associated with each gene can be converted into absolute numbers of mRNA molecules based on the level of the spike-ins, which are of known concentrations. However, spike-in control is not a perfect resolution for scRNA-seq read normalization. Most common set of spike-ins, such as ERCC [120 ], are 500–2000 nucleotides (nts) in length, which is shorter than an average human mRNAs (~2100 nts including untranslated regions [128 ]). The 3^0 bias of scRNA-seq protocols leads the conversion based on the shorter ERCC spike-ins is potentially problematic. Additionally, the spike-ins have comparatively short poly(A) tails and lack 5^0 caps, which may result in different degree of degradation and efficiency of reverse transcription of the endogenous mRNAs. Consequently, it is challenging to develop a generally applicable normalization strategy for scRNA-seq data that properly accounts for variability both in sequencing depth and cell size. In many cases, a sensible and pragmatic approach is to calculate two alternative size factors: one for the spike-ins and one for the endogenous mRNAs [129]. The former accounts solely for sequencing depth whereas the latter for the endogenous mRNAs reflecting the relative differences in cell size. This twofold normalization strategy relies on the assumption that the normalized spike- ins can be used to estimate the degree of the technical variability across the whole dynamic range of expression, which is the basic principle for spike-in control. Even if the technical noise is well accounted for by the spike-in control, the transcript length-based normalization methods, such as FPKM or RPKM, are still problematic. In particular, although improvements have been made recently [37], there is still noticeable 3^0 bias to several scRNA-seq protocols, including smart set used by the popular Fluidigm tech- nology. This bias leads to the underestimation of the expression long transcripts and overestimation of the short ones. Therefore, until protocols allow unbiased sampling of reads from across the whole transcript length, using FPKMs to compare the expression of transcripts with different lengths must be paid with more attention. To overcome this shortcoming, the UMIs were designed and used for scRNA-seq protocols.

Applications of Single-Cell Sequencing for Multiomics 353

Computational Systems Biology Methods and Protocols.7z

Get our desktop app

Company

Features

Documentation

Resources