nt12dreuar3esd

(Sean Pound) #1

Article


in blood or more than one cell, respectively. To study the overlap of
clonotypes between blood and tissue, we also classified clonotypes
from blood samples as either blood non-expanded or expanded on the
basis of their blood clone sizes being one or more than one, respectively,
without regard to their NAT or tumour clone sizes.


Processing of scRNA-seq data
scRNA-seq data were processed with Cell Ranger software. Illumina
base call (BCL) files were converted to FASTQ files with the command
‘cellranger mkfastq’. Expression data were processed with ‘cellranger
count’ on the pre-built human reference set of 30,727 genes. Cell Ranger
performed default filtering for quality control, and produced for each
sample a barcodes.tsv, genes.tsv, and matrix.mtx file containing counts
of transcripts for each gene in each single cell, which we parsed using
a script in R to create a count matrix for each sample. Single cells were
identified by barcodes that were consistent with those from the scTCR-
seq dataset. In total, scRNA-seq data were obtained on 541,863,094
transcripts in 200,626 immune cells.


Cluster analysis of combined immune cells
Count matrices for each sample were processed using Seurat using
the SCTransform procedure with default parameters to perform a
regularized negative binomial regression based on the 3,000 most
variable genes. Normalized datasets for the tumour, NAT and blood
(if available) samples for each patient were combined using the Find-
IntegrationAnchors and IntegrateData functions in Seurat with the
default value of 30 dimensions.
The resulting datasets for the 200,626 immune cells from all 14
patients were then combined using the FindIntegrationAnchors and
IntegrateData functions in Seurat with the default value of 30 dimen-
sions. This process required 130 GB of computer memory. The inte-
grated dataset was scaled and processed under principal components
analysis using the ScaleData and RunPCA functions in Seurat. Two-
dimensional map coordinates were generated using the RunUMAP
procedure. Cluster analysis was performed using the FindNeighbours
procedure and the FindClusters procedure at a resolution of 1.6, larger
than the default value of 0.8, in order to obtain a finer resolution of
cell subtypes, and yielded 33 clusters of both T and non-T cells. The
FindMarkers procedure in Seurat was run on each cluster to obtain
biomarker genes upregulated in that cluster. The default value of 0.25
for logfc.threshold was found to generate fewer than 20 genes for some
clusters, so the procedure was run with a value of 0.10, at the expense
of greater compute time.


Computational separation of T and non-T cells
Clusters identified from all 200,626 immune cells were characterized
by their mean expression of CD3E and their fraction of cells having a
clonotype identified from the scTCR-seq assay (Extended Data Fig. 1a).
T cell clusters were identified as having high levels of both measures,
and non-T cell clusters were those that had low levels of both measures.
Some clusters had intermediate values and their biological meaning
was inferred by their biomarker genes from the FindMarkers proce-
dure (Supplementary Table 2). Cluster 28 (Extended Data Fig. 1a, c)
was determined to represent plasma cells based on exclusively high
expression of MZB1 (encoding pERp1, plasma cell-induced resident
endoplasmic reticulum protein)^46 ,^47. Cluster 17 was determined to
represent natural killer cells, which had gene expression similar to
effector T cells, with cytotoxic biomarkers, such as NKG7, GZMB and
PRF1. However, the origin of cells in this cluster was heavily biased
towards CD45-selected samples, and the fraction of TCR clonotypes
was much lower than for other T cell clusters. The 20% of cells with TCR
clonotypes in this cluster were thought to be from NKT cells having
gene expression similar to that of natural killer cells. Cluster 29 was
determined to represent macrophages expressing TCR receptors, a
phenomenon that has been observed by others^48 ,^49. The identity of this


cluster as representing macrophages was evidenced by high expression
of SAT1, which is increased in granulocytes and macrophages, but not
in T, B or natural killer cells, as displayed in the Human Protein Atlas^50
at http://proteinatlas.org/ENSG00000130066-SAT1/tissue. Clusters
11, 16, 19, 21 and 27 with intermediate CD3E expression but clonotype
fractions above 40% were determined to represent T cells (Extended
Data Fig. 1a, b).
In support of this separation process, Extended Data Fig. 1i shows
that CD3-selected samples had relatively few cells classified as non-T
cells, and that CD45-selected samples all contributed to the collection
of T cells. Furthermore, subsequent cluster analyses of T and non-T
cells separately, as described below, confirmed our separation process,
resulting in no outlier clusters that were classified inappropriately.

Cluster analyses of T and non-T cells
On the basis of the above separation of T and non-T cells, the entire
process of data transformation, integration and cluster analysis was
repeated on the two divisions separately. The procedure was the same,
with two differences. First, the parameter features.to.integrate = all.
features was given to IntegrateData, where all.features was the inter-
section of genes from the SCTransform procedure across all patients,
in order to maximize the genes stored by Seurat into the integrated
assay, rather than limiting them to the genes stored as var.features.
Second, the resolution for the FindClusters procedure was set to its
default value of 0.8.
One cluster of 3,182 T cells was found to have two discrete com-
ponents in the UMAP representation, as well as a portion of 60 cells
that clustered with a mixture of cells from other clusters (the region
to the left 4.6b-Treg in Fig. 2a). The latter cells were considered pos-
sible contaminating or low quality cells and assigned no cluster. A
division of the remaining cells was made provisionally, applying a
k-means algorithm on the UMAP two-dimensional coordinates using
the k-means procedure in R. Because subsequent analyses revealed
distinct gene expression patterns for the subclusters, the division was
therefore kept, and constituted the clusters reported as ‘8.4-Chrom’
and ‘8.6-KLRB1’.

Identification of markers
To characterize each cluster, we applied both the FindMarkers proce-
dure in Seurat, which identified markers using log fold changes of mean
expression, as well as a logistic regression procedure, which identified
biomarkers that separated cells of a given cluster from other cells. The
logistic regression procedure was performed on each gene within each
patient for each cluster using the SCTransform assay data, and yielded
a signed deviance value, where the sign depended on the coefficient
for the gene expression term, as well as a t-statistic. For each cluster,
signed deviance values were combined across patients using the Rank
Product procedure in the RankProd package in R, and using the result-
ing one-sided q-values for ranking genes.

Gene set enrichment analysis
Gene sets from MSigDB^51 ,^52 v.7.0 were downloaded as Entrez gene iden-
tifiers in GMT format from http://software.broadinstitute.org/gsea/
downloads.jsp and parsed using the getGmt procedure from the GSEA-
Base package in R. For each cluster from the cluster analysis of T cells
and each of the 22,596 gene sets, we performed a one-sided z-test on
the t-statistics from the logistic regression procedure in ‘Identification
of markers’ section. This statistical test was previously proposed^53 as an
alternative to enrichment scores, provided that genes are given a value
that has zero mean and unit variance, which should be satisfied theo-
retically by t-statistics and verified by inspection of quantile–quantile
plots. To avoid spurious results, we considered only gene sets with five
or more genes matching those in our scRNA-seq dataset. Inspection
of results showed high ranking gene sets from clusters ‘8.4-Mitosis’
and ‘8.5-Chrom’ against Gene Ontology gene sets corresponding to
Free download pdf