Article
Analysis of data from Yost et al.^5
scRNA-seq counts and single-cell TCR sequences were obtained using
the GEOquery library in R and running the command getGEOSupp-
Files for GSE123813. Metadata consisting of the patient, treatment
status, and cluster assignment for each cell were obtained from the
corresponding authors for Yost et al.^5. We analysed expression data
that had matching metadata, resulting in 33,106 cells measured on
23,309 genes. Single-cell TCR sequences were available on 28,371
cells with matching metadata. Bulk TCR-seq sequences and template
counts were obtained from the ImmuneACCESS Web portal by Adap-
tive Biotechnologies (https://clients.adaptivebiotech.com/pub/yost-
2019-natmed).
The scTCR-seq data were provided as TCR-α and TCR-β nucleotide
and amino acid sequences without assigned clonotypes. We therefore
created clonotype groupings from pre- and post-treatment tumour
samples by matching all available TCR chains between cells of a given
patient, using a procedure analogous to our re-grouping of tumour,
NAT and blood clonotypes for our own dataset. Among the 11 patients
in the dataset, this procedure identified 17,369 distinct clonotypes.
Patient bcc.su003 had no clonotype matches between pre-treatment
and post-treatment tumour samples, and could not be analysed further.
The remaining patients had pre-treatment and post-treatment bulk
TCR-seq data from blood for five and six patients, respectively, with
patient bcc.su002 having no pre-treatment blood sample.
Bulk TCR-seq clonotypes were linked to tumour clonotypes by match-
ing the single TCR β-chain to any of the TCR β-chains from a clonotype
in the scTCR-seq data (Extended Data Fig. 7a). The bulk TCR-seq data
reports an 87-bp rearrangement sequence from each TCR β−chain that
includes the CDR3 sequence, so the scRNA-seq CDR3 (sc-CDR3-nt) was
required to match exactly at the nucleotide level to the corresponding
substring (bulk-CDR3-nt) in the bulk TCR-seq rearrangement. This
alignment process was facilitated by the fact that the immunoSEQ
output includes the amino acid CDR3 (bulk-CDR3-aa) sequence, which
we used to identify the corresponding bulk-CDR3-nt. We could then
create a hash table of all sc-CDR3-nt sequences and use that for lookup
from a given bulk-CDR3-nt sequence.
Because scTCR-seq clonotypes can contain TCR α-chains as well
as multiple TCR β-chains, the potential exists for a many-to-one cor-
respondence between a scTCR-seq clonotype from tumour and a bulk
TCR-seq clonotype from blood. For analyses requiring clones to be
classified by the presence or absence of a bulk TCR-seq match (that is,
blood association versus blood independence) (Fig. 3c), all matching
clonotypes were considered to have an association with blood. For
analyses requiring matching clone sizes between tumour and blood
(Fig. 3a, Extended Data Fig. 7b, c), multiple tumour clonotype matches
for a given blood clonotype were combined by summing the scTCR-seq
clone sizes and resolving multiple primary cell clusters by a majority
vote or leaving the primary cell cluster unassigned in case of a tie. For
the analysis comparing pre- and post-treatment bulk TCR sequences
(Extended Data Fig. 7d), clonotypes were matched when the entire
87-bp rearrangement sequences were identical.
Because bulk TCR-seq measures transcripts rather than cells, we
could not discriminate blood-expanded from non-expanded clones,
although we could distinguish between blood-independent and blood-
associated clones by the absence or presence of a matching transcript.
We assumed that if the number of transcripts per cell is relatively
uniform, the number of transcripts was approximately proportional
to the number of cells. Therefore, instead of measuring peripheral
clonal expansion, we measured blood clonal diversity using a Shan-
non entropy measure, the sum of p × log 2 (p), over all transcripts with
a count of 50 or more. This threshold was implemented because of the
heavily skewed distribution of transcript counts, with large numbers
of singleton transcripts that potentially introduced noise into the
calculations.
Completeness curves for Yost et al.^5
We used the transcripts field from each immunoSEQ analysis file for
bulk TCR-seq from a pre- or post-treatment blood sample as input to
the iNEXT procedure in the iNEXT package in R, with ‘abundance’ as
the datatype parameter. The resulting interpolated, observed, and
extrapolated values were then plotted using the m and SC variables in
the iNextEst object for each sample (Extended Data Fig. 7e).
Comparison of TCR sequences with VDJdb
We obtained the June 2018 (2018-06-04) release of VDJdb^54 (https://
github.com/antigenomics/vdjdb-db/releases). We analysed the file
vdjdb.slim.txt to extract entries where the TCR species was ‘Homo-
Sapiens’ and the epitope species was ‘CMV’, ‘EBV’ or ‘InfluenzaA’. We
compared each clonotype from our data against the cdr3 field in the
VDJdb database, requiring an exact match from any of the CDR3 amino
acid sequences for the clonotype.
Comparison of TCR sequences with TCGA
We obtained a set of 716,720 TCR sequences assembled from 9,142 bulk
RNA-seq samples in TCGA using the TRUST algorithm^55. We obtained
those sequences from the authors after gaining authorized access to
sequence-level data in TCGA. Each sequence was labelled with a puta-
tive CDR3 amino acid sequence in the header of the FASTA file. We
compared each clonotype from our data, requiring an exact match
from any of the CDR3 amino acid sequences for the clonotype against
any substring of the TCGA-derived CDR3 sequence. Each FASTA file was
annotated with an aliquot ID, which we linked to a cancer type using
the R library GenomicDataCommons. We considered a TCGA-derived
CDR3 sequence to be virally reactive when it was observed in three or
more individuals, all with different cancer types, and no two individuals
had the same cancer type.
Gene signatures for tissue expansion patterns
We applied a logistic regression procedure similar to that used for
finding characteristic biomarkers of T cell clusters, except that we
classified cells in tumour tissue on the basis of their parent clones
being tumour singletons, tumour multiplets, or dual-expanded clones.
The logistic regression procedure was performed on each gene within
each patient for each tissue expansion pattern using the scRNA-seq
assay data to test all available genes. The library size was included as a
covariate to help remove artefacts related to sequencing depth. Each
logistic regression fit yielded a t-statistic for a given gene in a given
tissue expansion pattern. For each tissue expansion pattern, t-statistics
were combined across patients using the Rank Product procedure in
the RankProd package in R, and using the resulting one-sided q-values
for ranking genes.
A total of 30 genes were selected for each gene expression signature.
Because gene signatures are intended for use in bulk RNA-seq data from
tumours, which contain non-T immune cells and non-immune cells, we
considered the problem that our genes were selected within a restricted
universe of T cells and that expression of our genes by non-T cells could
introduce noise. We therefore examined the co-expression patterns of
our gene signatures in bulk RNA-seq data from pre-treatment tumours,
obtained from three clinical trials, as discussed below (Extended Data
Fig. 9a). We computed the Pearson’s correlation coefficient of z-scores
for each pair of genes in each signature across patients in all three tri-
als, and performed hierarchical clustering of the correlation matrix.
We cut the hierarchical clustering at the first level and used the larger
of the two groups for the gene signatures. To compute a composite
score, we used the t-statistics from the logistic regression signatures
as weights. However, the relatively high degree of correlation of gene
expression meant that weighting had relatively minor effects, and
we obtained similar results by giving equal weighting to genes in our
gene signatures.