5 Cluster Analysis
Currently, the clustering methods in the scRNA-seq data analysis
are actually using dimension reduction methods to visualize high-
dimensional data in a low-dimensional space. We assume that the
gene expression matrix is composed of n samples by m features
(mn) after sample and feature reduction (Fig.2a).The dimen-
sion reduction methods transform m-dimensional pointsx 1 ,x 2 ,...,
xninto s-dimensional pointsy 1 ,y 2 ,...,yn(ms). By observation of
samples in a two- or three-dimensional space, biologists cluster
single cells into different groups. The best-known dimension
reduction method is principal component analysis (PCA) [17],
and other methods are independent component analysis (ICA)
[18], linear discriminant analysis (LDA) [19], multidimensional
scaling (MDS) [20], and t-distributed stochastic neighbor embed-
ding (t-SNE) [21]. The most commonly used method t-SNE is a
variation of the SNE method. The basic idea of t-SNE is to mini-
mize the Kullback-Leibler divergence (Formula 1) between
the joint probabilitypijin the high-dimensional space (Formula
2 ) and the joint probabilityqijin the low-dimensional space (For-
mula4).
KL P QðÞ¼k
X
i,j
pijlog
pij
qij
ð 1 Þ
pij¼
pjjiþpijj
2 n
fori 6 ¼jandpii¼ 0 ð 2 Þ
pjji¼
exp xi‐xj
(^2) = 2 σi^2
P
k 6 ¼i
exp kkxi‐xk^2 = 2 σi^2
ð 3 Þ
qij¼
1 þd^2 ij
1
P
k 6 ¼l
1 þd^2 kl
1 fori^6 ¼jandqii¼^0 ð^4 Þ
dij¼ yiyj
ð 5 Þ
i,j,k,l∈fg 1 ;...;n ð 6 Þ
Here,x 1 ,x 2 ,...,xnrepresent columns of the gene expression
matrix (Fig.2a), and Formulas3 and 5 use Euclidean distances.
Using the gradient descent method, the solution of Formula1 can
be obtained as the final low-dimensional pointsy 1 ,y 2 ,...,yn. The
t-SNE method needs two important user-defined parameters. The
first one is the perplexity defined as PerpðÞ¼Pi 2 HðÞPi, where
HPðÞi ¼
P
j
pjijlog 2 pjij. The perplexity can be interpreted as a
Data Analysis in Single-Cell Transcriptome Sequencing 321