Article reSeArcH
Barcode UMI
Primer Shared Target(R1.fastq)
(R2.fastq)Linear
amplificationGoT reads Barcode UMIPrimer Shared Target(R1.fastq)
(R2.fastq)Circular
amplification PCR#2 Fw PCR#2 RvcircGoT readsabInclude reads with the known primer
and ‘shared’ sequences
allowing a mismatch ratio ≤ m(1) Identification of reads with
proper priming*CB lists published by 10x(2) Identification of cell barcodes
within the whitelists*Assess inter-duplicate reads
(i.e. reads with the same CB & UMI)
genotyping agreement(4) Deduplication of reads
(5) Analyze reads with CB that
are also in the 10x
scRNA-seq data(3) Replacement of CB that are not identical to the whitelist CB
Among candidate CBs which are 1- Hamming-distance away from the
whitelisted CB, compute the probability that the observed CB deviated from the
whitelisted CB due to a sequencing error at the differing base, and replace the
observed CB with whitelisted CB when the probability exceeds 0.99YESYESNO YES0 0.25 0.50.751120k0120k0 0.25 0.50.751 0 0.25 0.50.751To tal UMIhg38mm101291 / 1291 cells 1255 / 1291 cells 1251 / 1291 cells120k0120kTo tal UMIhg38mm101291 / 1291 cells 1259 / 1291 cells 1255 / 1291 cells120k0120kTo tal UMIhg38mm101291 / 1291 cells 1255 / 1291 cells 1251 / 1291 cellsNo duplicate threshold Duplicate ≥ 2 Duplicate ≥ 3Mismatch ratio
= 0Mismatch ratio
= 0.6Mismatch ratio
= 0.2Murine cell
MUT CALR
Human cell
WT CALR Multiplets0.61 0.85
2 3 4 5 6 7 8 9100.9751 1
2 3 4 5 6 7 8 9100.750.900 .50.40.30.20.10.05 0.61 2 3 4 5 6 7 8 910
Mismatch ratio thresholdDuplicate thresholdPrecisionRecallF1 scoreRatio of barcodes
replaced with whitelistAveraged base error
in amplicon readsAveraged base error
in primer sequenceAveraged base error
in shared sequenceAveraged base error
in target sequenceNumber of total
duplicates0255075
Mean decrease accuracy-0.2500.250.500 0.05 0.10.2 0.3 0.4 0.5 0.6
Mismatch ratioCumulative errors (Z-score)-0.50 Ratio of cell lossOut-of-bag errors of prediction-1100 .50.40.30.20.10.05 0.61 2 3 4 5 6 7 8 910
Mismatch ratio thresholdRatio of cell loss (Z-score)-0.5100 .50.40.30.20.10.05 0.61 2 3 4 5 6 7 8 910
Mismatch ratio thresholdOut-of-bag errors (Z-score)c defgDuplicate threshold Duplicate thresholdMutant CALR UMI fractionExtended Data Fig. 2 | Optimization of parameters in processing
targeted amplicon sequences in the IronThrone GoT pipeline.
a, Representation of amplicon reads. b, Flow chart of the GoT analysis
pipeline (Methods). CB, cell barcode. c, Mouse (green) and human (blue)
genome alignment of 10x data (y axes) with genotyping data by GoT
(x axes) with various thresholds for minimum duplicate reads (across)
and maximum mismatch ratio (down). d, Results of precision, recall
and F 1 score analysis for combinations of minimum duplicate reads and
maximum mismatch ratios. e, Measure of the importance of each variable
used for the calculation of splits in trees in random-forest classification
test. f, Ratio of cell loss and genotyping errors (z-score on y axis) based on
mismatch ratio thresholds (x axis). The area of intersection is highlighted
in grey around the mismatch ratio 0.2. g, Heat maps showing z-scores of
the number of filtered cells (left) and predicted error rates (right) from
random-forest classification tests for combinations of minimum duplicate
reads and maximum mismatch ratio thresholds.