analysis includes raw data cleaning with quality control [3], read
alignment, generation of read counts, normalization, data filtering
with quality control, and downstream analyses. Currently, a major
and popular application of downstream analyses is to identify cell
types or states using cluster analysis. Although many existing tools
for processing bulk RNA-seq data can be used to process scRNA-
seq data with or without modification, scRNA-seq data analysis
poses several unique computational challenges that necessitate the
development of entirely new analytical methods.
In this chapter, we focus on the introduction and discussion of
the research status in the field of scRNA-seq data normalization
(Subheading3) and cluster analysis (Subheading5), which are the
two most important challenges in the scRNA-seq data analysis. We
also present a schema to generalize four fundamental problems
(Subheading4). Preliminary results from our previous studies of
these problems are provided to give directions for researchers in
their future studies. Particularly, we present a protocol to discover
and validate cancer stem cells (CSCs), which was first implemented
by Lin Liu et al. using a colon cancer scRNA-seq dataset (Subhead-
ing2).
2 Experiment Design and Data Quality Control
Six commonly used scRNA-seq protocols are CEL-seq2, Drop-seq,
MARS-seq, SCRB-seq, Smart-seq, and Smart-seq2 (Table1), the
performances of which have been evaluated in a comparative study
[4]. These performances included sensitivity (i.e., the probability to
capture and convert a particular mRNA transcript present in a
single cell into a cDNA molecule present in the library), accuracy
(i.e., how well the read quantification corresponds to the actual
concentration of mRNA), precision (i.e., the technical variation of
the quantification), cost, etc. The authors of the comparative study
concluded that Smart-seq2 was the most sensitive and accurate
protocol with a similar cost efficiency to five other protocols. In
this chapter, we demonstrate all the research results from our
previous studies using a colon cancer scRNA-seq dataset (SRA:
SRP113436) provided by Lin Liu et al. This dataset includes 831 -
single-cell samples and 18 bulk samples using the Smart-seq2
scRNA-seq protocol. The 831 single-cell samples are 814 single
cells from colon tumor tissues and 17 single cells from distal tissues
(>10 cm) as control. The 18 bulk samples are nine samples from
colon tumor tissues and nine samples from distal tissues.
Besides six protocols, the scRNA-seq experiment design needs
to consider other factors such as sequencing length and depth. The
sequence length determines the alignment quality and then affects
the accuracy of quantitative analysis. In addition, paired-end
(PE) reads have advantages over single-end (SE) reads for genome
312 Shan Gao