Computational Systems Biology Methods and Protocols.7z

(nextflipdebug5) #1
obtained directly from the FastQC output. This metric indicates
whether there is a problem with the sequencing library generated
from an individual cell. A low fraction might indicate that RNA has
degraded, that there is external contamination, or that the cell was
inefficiently lysed.
The second metric is available when the spike-in control was
used. It is the ratio of the number of read mapped to the endoge-
nous RNA to the number of reads mapped to the extrinsic spike-
ins, which can be computed from the FastQC output or directly
from the table of counts gained from HTSeq. A high proportion of
reads mapped to the spike-ins would be indicative of a low quality
of RNA in the cell of interest and might be a reason to exclude these
cells for downstream analyses. However, this ratio could vary
noticeably from cell-to-cell biological visibilities (e.g., if the cells
are of different phases of the cell cycle). Nevertheless, cells for
which the ratio of spike-ins is extremely discordant from the
remaining population are strong candidates for exclusion.
The last useful approach for identifying problematic cells is to
apply principal component analysis (PCA) to the read count matrix
or gene expression matrix. The expectation is that good-quality
cells cluster together and poor-quality cells are outliers. Note
that, in some cases, poor-quality cell may also cluster together to
form a second distinct population. For example, it has been
observed that poor-quality cells are often enriched in the expression
of mitochondrial genes [34], which can cause them to cluster
separately. This, therefore, stresses that outlier analyses must be
performed carefully to ensure that cells with physiologically rele-
vant differences are not inadvertently discarded. To prevent this,
one useful observation is that poor-quality cells typically display
extreme values of the two other metrics described above.

2.2 Normalization
and Quantification


Besides the QC that is extremely important for scRNA-seq data
analyses, normalization is also a computational challenge for
scRNA-seq quantification. For bulk RNA-seq data, read counts
between different samples are always standardized by the transcript
length and sequencing depth, such as FPKM (fragments per kilo-
base per million fragments mapped) and RPKM (reads per kilobase
per million reads mapped) for paired and single ends, respectively.
However, standers for normalizing bulk RNA-seq reads make an
implicit assumption that the total amount of RNA processed in
each library is approximately the same or that the variation is
technical noise. This assumption is always useful when relative
expression estimates are compared. In scRNA-seq, the normaliza-
tion procedure can substantially affect the interpretation of the
data, and thus special attention should be taken. There are two
categories of approaches depending on whether the UMI criterion
is used.

352 Yungang Xu and Xiaobo Zhou

Free download pdf