Nature - USA (2019-07-18)

(Antfer) #1

Article reSeArcH


Barcode UMI
Primer Shared Target

(R1.fastq)
(R2.fastq)

Linear
amplification

GoT reads Barcode UMI

Primer Shared Target

(R1.fastq)
(R2.fastq)

Circular
amplification PCR#2 Fw PCR#2 Rv

circGoT reads

a

b

Include reads with the known primer
and ‘shared’ sequences
allowing a mismatch ratio ≤ m

(1) Identification of reads with
proper priming

*CB lists published by 10x

(2) Identification of cell barcodes
within the whitelists*

Assess inter-duplicate reads
(i.e. reads with the same CB & UMI)
genotyping agreement

(4) Deduplication of reads
(5) Analyze reads with CB that
are also in the 10x
scRNA-seq data

(3) Replacement of CB that are not identical to the whitelist CB
Among candidate CBs which are 1- Hamming-distance away from the
whitelisted CB, compute the probability that the observed CB deviated from the
whitelisted CB due to a sequencing error at the differing base, and replace the
observed CB with whitelisted CB when the probability exceeds 0.99

YESYES

NO YES

0 0.25 0.50.751

120k

0

120k

0 0.25 0.50.751 0 0.25 0.50.751

To tal UMI

hg38

mm10

1291 / 1291 cells 1255 / 1291 cells 1251 / 1291 cells

120k

0

120k

To tal UMI

hg38

mm10

1291 / 1291 cells 1259 / 1291 cells 1255 / 1291 cells

120k

0

120k

To tal UMI

hg38

mm10

1291 / 1291 cells 1255 / 1291 cells 1251 / 1291 cells

No duplicate threshold Duplicate ≥ 2 Duplicate ≥ 3

Mismatch ratio
= 0

Mismatch ratio
= 0.6

Mismatch ratio
= 0.2

Murine cell
MUT CALR
Human cell
WT CALR Multiplets

0.6

1 0.85
2 3 4 5 6 7 8 9

10

0.975

1 1
2 3 4 5 6 7 8 9

10

0.75

0.9

00 .50.40.30.20.10.05 0.6

1 2 3 4 5 6 7 8 9

10
Mismatch ratio threshold

Duplicate threshold

Precision

Recall

F1 score

Ratio of barcodes
replaced with whitelist

Averaged base error
in amplicon reads

Averaged base error
in primer sequence

Averaged base error
in shared sequence

Averaged base error
in target sequence

Number of total
duplicates

0255075
Mean decrease accuracy

-0.25

0

0.25

0.50

0 0.05 0.10.2 0.3 0.4 0.5 0.6
Mismatch ratio

Cumulativ

e errors (Z-score)

-0.50 Ratio of cell lossOut-of-bag errors of prediction

-1

1

00 .50.40.30.20.10.05 0.6

1 2 3 4 5 6 7 8 9

10
Mismatch ratio threshold

Ratio of cell loss (Z-score)

-0.5

1

00 .50.40.30.20.10.05 0.6

1 2 3 4 5 6 7 8 9

10
Mismatch ratio threshold

Out-of-bag errors (Z-score)

c d

efg

Duplicate threshold Duplicate threshold

Mutant CALR UMI fraction

Extended Data Fig. 2 | Optimization of parameters in processing
targeted amplicon sequences in the IronThrone GoT pipeline.
a, Representation of amplicon reads. b, Flow chart of the GoT analysis
pipeline (Methods). CB, cell barcode. c, Mouse (green) and human (blue)
genome alignment of 10x data (y axes) with genotyping data by GoT
(x axes) with various thresholds for minimum duplicate reads (across)
and maximum mismatch ratio (down). d, Results of precision, recall
and F 1 score analysis for combinations of minimum duplicate reads and


maximum mismatch ratios. e, Measure of the importance of each variable
used for the calculation of splits in trees in random-forest classification
test. f, Ratio of cell loss and genotyping errors (z-score on y axis) based on
mismatch ratio thresholds (x axis). The area of intersection is highlighted
in grey around the mismatch ratio 0.2. g, Heat maps showing z-scores of
the number of filtered cells (left) and predicted error rates (right) from
random-forest classification tests for combinations of minimum duplicate
reads and maximum mismatch ratio thresholds.
Free download pdf