Article reSeArcH
Barcode UMI
Primer Shared Target
(R1.fastq)
(R2.fastq)
Linear
amplification
GoT reads Barcode UMI
Primer Shared Target
(R1.fastq)
(R2.fastq)
Circular
amplification PCR#2 Fw PCR#2 Rv
circGoT reads
a
b
Include reads with the known primer
and ‘shared’ sequences
allowing a mismatch ratio ≤ m
(1) Identification of reads with
proper priming
*CB lists published by 10x
(2) Identification of cell barcodes
within the whitelists*
Assess inter-duplicate reads
(i.e. reads with the same CB & UMI)
genotyping agreement
(4) Deduplication of reads
(5) Analyze reads with CB that
are also in the 10x
scRNA-seq data
(3) Replacement of CB that are not identical to the whitelist CB
Among candidate CBs which are 1- Hamming-distance away from the
whitelisted CB, compute the probability that the observed CB deviated from the
whitelisted CB due to a sequencing error at the differing base, and replace the
observed CB with whitelisted CB when the probability exceeds 0.99
YESYES
NO YES
0 0.25 0.50.751
120k
0
120k
0 0.25 0.50.751 0 0.25 0.50.751
To tal UMI
hg38
mm10
1291 / 1291 cells 1255 / 1291 cells 1251 / 1291 cells
120k
0
120k
To tal UMI
hg38
mm10
1291 / 1291 cells 1259 / 1291 cells 1255 / 1291 cells
120k
0
120k
To tal UMI
hg38
mm10
1291 / 1291 cells 1255 / 1291 cells 1251 / 1291 cells
No duplicate threshold Duplicate ≥ 2 Duplicate ≥ 3
Mismatch ratio
= 0
Mismatch ratio
= 0.6
Mismatch ratio
= 0.2
Murine cell
MUT CALR
Human cell
WT CALR Multiplets
0.6
1 0.85
2 3 4 5 6 7 8 9
10
0.975
1 1
2 3 4 5 6 7 8 9
10
0.75
0.9
00 .50.40.30.20.10.05 0.6
1 2 3 4 5 6 7 8 9
10
Mismatch ratio threshold
Duplicate threshold
Precision
Recall
F1 score
Ratio of barcodes
replaced with whitelist
Averaged base error
in amplicon reads
Averaged base error
in primer sequence
Averaged base error
in shared sequence
Averaged base error
in target sequence
Number of total
duplicates
0255075
Mean decrease accuracy
-0.25
0
0.25
0.50
0 0.05 0.10.2 0.3 0.4 0.5 0.6
Mismatch ratio
Cumulativ
e errors (Z-score)
-0.50 Ratio of cell lossOut-of-bag errors of prediction
-1
1
00 .50.40.30.20.10.05 0.6
1 2 3 4 5 6 7 8 9
10
Mismatch ratio threshold
Ratio of cell loss (Z-score)
-0.5
1
00 .50.40.30.20.10.05 0.6
1 2 3 4 5 6 7 8 9
10
Mismatch ratio threshold
Out-of-bag errors (Z-score)
c d
efg
Duplicate threshold Duplicate threshold
Mutant CALR UMI fraction
Extended Data Fig. 2 | Optimization of parameters in processing
targeted amplicon sequences in the IronThrone GoT pipeline.
a, Representation of amplicon reads. b, Flow chart of the GoT analysis
pipeline (Methods). CB, cell barcode. c, Mouse (green) and human (blue)
genome alignment of 10x data (y axes) with genotyping data by GoT
(x axes) with various thresholds for minimum duplicate reads (across)
and maximum mismatch ratio (down). d, Results of precision, recall
and F 1 score analysis for combinations of minimum duplicate reads and
maximum mismatch ratios. e, Measure of the importance of each variable
used for the calculation of splits in trees in random-forest classification
test. f, Ratio of cell loss and genotyping errors (z-score on y axis) based on
mismatch ratio thresholds (x axis). The area of intersection is highlighted
in grey around the mismatch ratio 0.2. g, Heat maps showing z-scores of
the number of filtered cells (left) and predicted error rates (right) from
random-forest classification tests for combinations of minimum duplicate
reads and maximum mismatch ratio thresholds.