Computational Systems Biology Methods and Protocols.7z

scaling factor in different contexts. Basically, two classes of methods are available to calculate normalization factors. They are the control-based normalization and the average-bulk normalization. The former class of methods assumes the total expression level summed over a small group of genes is approximately the same across all the samples. The latter class of methods assumes most of genes are not differentially expressed (DE) genes across all the samples. The control-based normalization uses RNA from a group of internal control genes or external spike-in RNA. The commonly used internal control genes are housekeeping genes, and spike-in RNA usually are artificial RNA added to cell lysate. Since internal control genes and spike-in RNA may not be present in some data, the average-bulk normalization is more commonly used for their universality. Five average-bulk normalization methods designed to normalize bulk RNA-seq data are library size, trimmed mean of M values (TMM), relative log expression (RLE), upper quartile, and median of the ratios of observed counts that is also referred to as the DESeq method (Fig.2b). The DESeq method has been included into the bioconductor package DESeq [7] for R environment. TMM, RLE, and upper quartile have been included into the bioconductor package edgeR [8] for R environment. Although many methods have been developed and improved, the solution of RNA-seq data normalization is still unsatisfactory. Both the control-based normalization and the average-bulk normalization depend on their assumptions, which cannot be directly validated by experiments. As for internal control genes, the exis- tence of housekeeping genes has been investigated in many previ- ous studies, but none of them sampled human tissues completely. By the integration of these results to remove false-positive genes due to inadequate sampling, Zhang et al. only found 1 common gene across 15 examined housekeeping gene datasets comprised of 187 different tissue and cell types [9]. But the shortcoming of this study is each dataset was normalized based on the highest gene expression level for comparison. It is a logical paradox to use other normalization methods to examine housekeeping genes for their usability in data normalization. The commonly used spike-in RNA is the External RNA Control Consortium (ERCC) RNA set con- sisting of 92 polyadenylated transcripts with short 3^0 polyA tails but without 5^0 caps [10]. They are designed to have a wide range of sequence lengths (273–2022 nt) and GC-content percentages (30.79–52.69%). The same quantity of ERCC RNA should be spiked into each sample prior to RNA reverse transcription. Risso et al. evaluated the performance of the ERCC method and con- cluded that the ERCC method was not reliable enough to be used in standard global scaling or regression-based normalization pro- cedures [11]. Although Risso et al. investigated the ERCC method in two very different datasets, the measures (PCA plot, RLE

316 Shan Gao

Computational Systems Biology Methods and Protocols.7z

Get our desktop app

Company

Features

Documentation

Resources