Science - USA (2021-12-17)

(Antfer) #1

where no haplotype variation is known. To
overcome these issues, Giraffe includes mech-
anisms for creating synthetic haplotype paths.
When real haplotypes are available, these syn-
thetic haplotype paths represent local haplotype
variation sampled according to haplotype fre-
quency, and we call the result a sampled GBWT
( 17 ). When no haplotypes are available, we call
the result a path cover GBWT. In this case, the
synthetic haplotypes represent random walks
through the graph. We evaluated the effects
of running our mapping evaluations with


sampled and path cover GBWTs [fig. S8 and
tables S7 and S8; ( 17 )]. The mapping benefit
of sampling more haplotypes plateaued at
64 haplotypes for the 1000GP graph (which
contains around 5000 haplotypes), with higher
accuracy than that achieved by mapping to the
full haplotype set. We used the HGSVC graph
(which contains just six haplotypes) for an
experiment on generating path covers without
known haplotypes. Path covers alone did not
outperform the full underlying haplotype
set for the HGSVC graph but came close to

matching its performance. We selected the
64-haplotype sampled GBWT for the 1000GP
graph and the full GBWT for the HGSVC graph
as the best-performing GBWTs, which we use
in the rest of the analysis.

Giraffe improves pangenome mapping speed
We measured the runtime (Fig. 3, A and B)
and memory usage (Fig. 3, C and D) of Giraffe
and competing tools when mapping real reads
( 17 ). Giraffe was more than an order of mag-
nitude faster than VG-MAP in all conditions.

Sirénet al.,Science 374 , eabg8871 (2021) 17 December 2021 3 of 11


A 1000GP/GRCh38 Single End C HGSVC/GRCh38 Single End E Five-strain yeast/S.c. S288C Single End

Log 10 False Discovery Rate (log 10 (1 - Precision))

VG-MAP

BWA-MEM
Bowtie2

Minimap2
HISAT2

GraphAligner
250000 500000 750000

True Positive Rate (Recall)

F Five-strain yeast/S.c. S288C Paired End

0.92

0.94

0.96

0.98

1.00

1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

0.90

0.93

0.96

0.99

1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

0.92

0.94

0.96

0.98

1.00

1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

60

60

0.90

0.93

0.96

0.99

1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

VG-MAP

BWA-MEM Bowtie2

Minimap2

HISAT2

GraphAligner

0.92

0.94

0.96

0.98

1.00

1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01

0.90

0.93

0.96

0.99

1e-07 1e-06 1e-05 1e-04 1e-03 1e-02 1e-01 1e+00

B 1000GP/GRCh38 Paired End D HGSVC/GRCh38 Paired End

Fig. 2. Simulated read mapping.(AtoF) Each panel shows recall versus
false discovery rate (or 1 minus precision) for a simulated read-mapping
experiment, comparing Giraffe with linear genome mappers (BWA-MEM, Bowtie2,
and Minimap2) and other genome graph mappers (VG-MAP, GraphAligner,
and HISAT2). Reads were simulated to match ~150-bp Illumina NovaSeq (for
human) or HiSeq 2500 (for yeast) reads, either as single-end reads [(A) to
(C)] or as paired-end reads [(D) to (F)] ( 17 ). Results for each mapper are shown
stratified by reported read-mapping quality; the size of each point represents
the log-scaled number of reads with the corresponding mapping quality.
Three different mapping scenarios are assessed: [(A) and (D)] Comparing


mapping to a graph derived from the 1000GP data to mapping to the linear
reference genome assembly upon which it is based (GRCh38); [(B) and (E)]
comparing mapping to a graph containing larger structural variants from the
HGSVC project to mapping to the GRCh38 assembly upon which it is based;
and [(C) and (F)] comparing mapping to a multiple sequence alignment–based
yeast graph to mapping to the singleS.c. S288Clinear reference, for reads
from the DBVPG6044 strain. For mapping with Giraffe, we used the full GBWT
that contains six haplotypes to map to the HGSVC graph and the 64-haplotype
sampled GBWT to map to the 1000GP graph.“Giraffe primary”represents
mapping with Giraffe to the linear reference.

RESEARCH | RESEARCH ARTICLE

Free download pdf