Computational Systems Biology Methods and Protocols.7z

obtained directly from the FastQC output. This metric indicates whether there is a problem with the sequencing library generated from an individual cell. A low fraction might indicate that RNA has degraded, that there is external contamination, or that the cell was inefficiently lysed. The second metric is available when the spike-in control was used. It is the ratio of the number of read mapped to the endoge- nous RNA to the number of reads mapped to the extrinsic spike- ins, which can be computed from the FastQC output or directly from the table of counts gained from HTSeq. A high proportion of reads mapped to the spike-ins would be indicative of a low quality of RNA in the cell of interest and might be a reason to exclude these cells for downstream analyses. However, this ratio could vary noticeably from cell-to-cell biological visibilities (e.g., if the cells are of different phases of the cell cycle). Nevertheless, cells for which the ratio of spike-ins is extremely discordant from the remaining population are strong candidates for exclusion. The last useful approach for identifying problematic cells is to apply principal component analysis (PCA) to the read count matrix or gene expression matrix. The expectation is that good-quality cells cluster together and poor-quality cells are outliers. Note that, in some cases, poor-quality cell may also cluster together to form a second distinct population. For example, it has been observed that poor-quality cells are often enriched in the expression of mitochondrial genes [34], which can cause them to cluster separately. This, therefore, stresses that outlier analyses must be performed carefully to ensure that cells with physiologically rele- vant differences are not inadvertently discarded. To prevent this, one useful observation is that poor-quality cells typically display extreme values of the two other metrics described above.

2.2 Normalization
and Quantification

Besides the QC that is extremely important for scRNA-seq data analyses, normalization is also a computational challenge for scRNA-seq quantification. For bulk RNA-seq data, read counts between different samples are always standardized by the transcript length and sequencing depth, such as FPKM (fragments per kilobase per million fragments mapped) and RPKM (reads per kilobase per million reads mapped) for paired and single ends, respectively. However, standers for normalizing bulk RNA-seq reads make an implicit assumption that the total amount of RNA processed in each library is approximately the same or that the variation is technical noise. This assumption is always useful when relative expression estimates are compared. In scRNA-seq, the normalization procedure can substantially affect the interpretation of the data, and thus special attention should be taken. There are two categories of approaches depending on whether the UMI criterion is used.

352 Yungang Xu and Xiaobo Zhou

Computational Systems Biology Methods and Protocols.7z

Get our desktop app

Company

Features

Documentation

Resources