Computational Systems Biology Methods and Protocols.7z

(nextflipdebug5) #1
smooth measure of the effective number of neighbors. The perfor-
mance of t-SNE is fairly robust to changes in the perplexity and
typical values are between 5 and 50. The second parameter is the
low dimension s, which is arbitrarily set to two for convenient
visualization. However, no research has been conducted to investi-
gate how much the low dimension parameter should be set to.
We proposed that the low dimension parameter can be set as
the intrinsic dimensionality of the sparse gene expression matrix.
For the first time, we estimated the intrinsic dimensionality of
scRNA-seq data using six methods (Table 3). They were
eigenvalue-based estimation (EigValue) [17], maximum likelihood
estimation (MLE) [22], correlation dimension (CorrDim) [23],
nearest neighbor evaluation (NearNb) [24], geodesic minimum
spanning tree (GMST) [25], and packing numbers (PackingNum-
bers) [26]. These methods were implemented using the MATLAB
toolbox for dimensionality reduction (available at http://
lvdmaaten.github.io/drtoolbox/). Although preliminary results
were obtained using only a colon cancer scRNA-seq dataset (Sub-
heading2), it still suggested that the intrinsic dimensionality could
be six estimated by EigValue. Since five other methods are
parameter-dependent except EigValue, further estimation needs
to be performed using adjusted parameters based on grid search,
avoiding the arbitrary settings. The final intrinsic dimensionality
must be determined by combined use of six or more methods.
Until this chapter was written, the most popular application of
cluster analysis using scRNA-seq data was to identify cells in terms
of known or novel types and then perform downstream analyses
(e.g., differential expression or regulatory network analysis) based

Table 3
Estimating intrinsic dimensionality of scRNA-seq data


Method Raw Raw Gaussian ERCC ERCC Gaussian DESeq DESeq Gaussian
EigValue 7 6 5 6 7 6
MLE 19.00 19.92 15.27 19.92 22.28 19.92
CorrDim 0.47 2.89 0.56 2.89 0.51 2.89
NearNb 0.05 0 0.03 0 0.01 0
GMST 5.21 28.01 4.71 29.78 6.25 17.36
PackingNumbers 0 0 0 0 0 0

EigValue represents eigenvalue-based estimation. MLE represents maximum likelihood estimation. CorrDim represents
correlation dimension. NearNb represents nearest neighbor evaluation. GMST represents geodesic minimum spanning
tree. PackingNumbers represents packing numbers. Raw represents gene expression data using read counts of nuclear
RNA (Table2). Raw Gaussian represents raw data processed by the standard Gaussianization. ERCC represents normal-
ized data using the ERCC method. ERCC Gaussian represents ERCC-normalized data processed by the standard
Gaussianization. DESeq represents normalized data using DESeq. DESeq Gaussian represents DESeq-normalized data
processed by the standard Gaussianization. The standard Gaussianization was performed on each row of the gene
expression matrix (Fig.2a) by the difference of mean and divided by the standard deviation


322 Shan Gao

Free download pdf