Computational Systems Biology Methods and Protocols.7z

the first group of methods produced very similar results, it is more convenient to use read count for data normalization by simply producing column sums of a gene expression matrix. Currently, most of software or R packages (e.g., edgeR and DESeq) produce column sums of a gene expression matrix to obtain library sizes. However, normalization factors calculated using library size (total RNA) have significant differences from those calculated using cel- lular RNA or nuclear RNA (Table2). The second group of methods was closer to the first group, compared to the third group. Upper quartile from the second group reached the highest correlation with the ERCC method from the first group. In the third group, nuclear RNA was closest to pooled size factors. In conclu- sion, the normalization of scRNA-seq or RNA-seq data is still unsettled. Based on our studies, if ERCC data is not available, library size can be used instead of ERCC.

4 Fundamental Problems

Here, we present a schema to generalize four fundamental problems (Fig.4). Besides data normalization and cluster analysis, sample and feature reduction are two other fundamental problems in the scRNA-seq data analysis. The normalized gene expression matrix is composed of n samples by m features, which can be genes, transcripts or exons (Fig.2a). The scRNA-seq data from SMS (e.g., PacBio full-length transcriptome [15]) use transcripts as features, while the scRNA-seq data from NGS often use genes as features

Fig. 3Correlation of different normalization methods. The hierarchical clustering used correlation distances (1—Pearson correlation coefficients) between 265 samples containing ERCC RNA from Table2. TMM, RLE, upper quartile, and DESeq were modified to process scRNA-seq data containing a high frequency of zeroes. Pooled represents pooled size factors [14]

318 Shan Gao

Computational Systems Biology Methods and Protocols.7z

Get our desktop app

Company

Features

Documentation

Resources