using default Giraffe and VG-MAP. For com-
parison,wemappedthesamereadstoGRCh38
with BWA-MEM.
Genotyping accuracy
We compared the performance of Giraffe, VG-
MAP, Illumina’s Dragen platform, and BWA-
MEM for genotyping SNVs and short indels.
The design of each calling pipeline is des-
cribed in section S4 of the supplementary
materials ( 17 ) and the parameters and indexes
for each experiment are described in table S22.
Thevariantsproducedbyeachpipelinewere
compared against the GIAB v4.2.1 HG002
high-confidence variant-calling benchmark
( 24 ) using the RealTimeGenomics vcfeval tool
( 46 ) and Illumina’shap.pytool( 47 ). This bench-
mark set covers 92.2% of the GRCh38 sequence.
We also evaluated a DeepVariant ( 25 ) pipe-
line that uses Giraffe mappings ( 17 ). Using the
default DeepVariant 1.1.0–trained model, we
tested genotyping of the HG003 sample across
theentiregenome.Thissamplewasnotused
in training the model.
Generalization to yeast
To evaluate Giraffe’s performance on more
diverged, nonhuman data, we used a yeast
graph built from a Cactus multiple sequence
alignment for five strains of theS. cerevisiae
andS. paradoxusyeasts ( 8 ). For the correspond-
ing negative-control primary graph, we used the
S.c. S288Cassembly. We collected basic statistics
about the yeast graph and decomposed the
graph for analysis using the method of ( 27 ).
We simulated 500,000 read pairs from a held-
outS. cerevisiaeyeast strain, DBVPG6044, not
included in the yeast graph, using an error and
length model for Illumina HiSeq 2500 reads ( 17 ).
SV genotyping
We built an SV pangenome from the HGSVC
( 22 ), GIAB ( 1 ), and SVPOP ( 28 ) sequence-
resolved catalogs. After filtering out erroneous
duplicates using a remapping approach, the
SVs were iteratively inserted in the genome
graph to minimize the effect of errors and
redundancy in the catalog. The SVs were then
genotyped across 5202 genomes by aligning
short-read sequencing data using Giraffe with
a workflow description language (WDL) work-
flow that we deposited in Dockstore ( 48 ). Two-
thousand samples were selected from the MESA
cohort to maximize sample diversity. The re-
maining 3202 samples are from the 1000
Genomes Project and include 2504 unrelated
individuals. The trios available in this latter
dataset were used to compute the rate of
Mendelian concordance in the genotypes.
The different SV alleles observed in the
population were clustered into SV sites based
on their reciprocal overlap (for deletions) and
sequence similarity (for insertions). We used
the frequency profile across alleles within an
SV site to identify the major allele and to fine-
tune variants with near duplicates in the
combined catalog that may have been due to
errors. Each variant was then annotated with
its presence in existing SV databases ( 28 , 32 , 33 ),
its repeat content, and its location relative to
gene annotations. We also compared the fre-
quency distributions across the SV databases
and how well the frequency estimates matched
for variants shared across databases.
PCA was performed on the SV genotypes,
and principal components were compared
with those produced from SNV-indel geno-
types. We defined strong intercluster or inter-
superpopulation frequency patterns by a
frequency in any cluster or superpopulation
differingbymorethan10%fromthemedian
frequency across all of them. For the 2000 MESA
samples, the clusters were defined using hierar-
chical clustering on the first three principal
components. For the 1000 Genomes Project,
we used their“superpopulation”assignments.
Permutations were used to contrast the number
of SVs with such patterns with an expected
baseline.
Finally, we examined the SV genotypes in a
subset of the samples that had gene-expression
data available from the GEUVADIS consortium
( 34 ). MatrixEQTL ( 49 ) identified SV-eQTLs
while controlling for sex and population
structures, as summarized by the first four
principal components. Separate analyses of
the four European-ancestry populations together
and the YRI population alone were performed
similarly. In addition, we performed a joint
eQTL analysis with publicly available SNVs
and indels ( 31 ). We used permutation to com-
pute enrichment of SV-eQTLs in gene regions,
gene families, or among lead-eQTLs (those
with the strongest association for a gene).
REFERENCESANDNOTES
- J. M. Zooket al., A robust benchmark for detection of germline
large deletions and insertions.Nat. Biotechnol. 38 , 1347– 1355
(2020). doi:10.1038/s41587-020-0538-8; pmid: 32541955 - M. Mahmoudet al., Structural variant calling: The long and the
short of it.Genome Biol. 20 , 246 (2019). doi:10.1186/s13059-
019-1828-7; pmid: 31747936 - J. Ebler, A. Schönhuth, T. Marschall, Genotyping inversions and
tandem duplications.Bioinformatics 33 , 4015–4023 (2017).
doi:10.1093/bioinformatics/btx020; pmid: 28169394 - D. M. Churchet al., Modernizing reference genome assemblies.
PLOS Biol. 9 , e1001091 (2011). doi:10.1371/journal.pbio.
1001091 ; pmid: 21750661 - The Computational Pan-Genomics Consortium, Computational
pan-genomics: status, promises and challenges.Brief.
Bioinform. 19 , 118–135 (2016). doi:10.1093/bib/bbw089 - R. M. Sherman, S. L. Salzberg, Pan-genomics in the human
genome era.Nat. Rev. Genet. 21 , 243–254 (2020).
doi:10.1038/s41576-020-0210-7; pmid: 32034321 - S. Ballouz, A. Dobin, J. A. Gillis, Is it time to change the
reference genome?Genome Biol. 20 , 159 (2019). doi:10.1186/
s13059-019-1774-4; pmid: 31399121 - G. Hickeyet al., Genotyping structural variants in pangenome
graphs using the vg toolkit.Genome Biol. 21 , 35 (2020).
doi:10.1186/s13059-020-1941-7; pmid: 32051000 - J. M. Eizengaet al., Pangenome graphs.Annu. Rev. Genomics
Hum. Genet. 21 , 139–162 (2020). doi:10.1146/annurev-genom-
120219-080406; pmid: 32453966
10. E. Garrisonet al., Variation graph toolkit improves read
mapping by representing genetic variation in the reference.
Nat. Biotechnol. 36 , 875–879 (2018). doi:10.1038/nbt.4227;
pmid: 30125266
11. D. Kim, J. M. Paggi, C. Park, C. Bennett, S. L. Salzberg, Graph-
based genome alignment and genotyping with HISAT2 and
HISAT-genotype.Nat. Biotechnol. 37 , 907–915 (2019).
doi:10.1038/s41587-019-0201-4; pmid: 31375807
12. M. Rautiainen, T. Marschall, GraphAligner: Rapid and
versatile sequence-to-graph alignment.Genome Biol. 21 ,
253 (2020). doi:10.1186/s13059-020-02157-2;
pmid: 32972461
13. G. Rakocevicet al., Fast and accurate genomic analyses using
genome graphs.Nat. Genet. 51 , 354–362 (2019). doi:10.1038/
s41588-018-0316-4; pmid: 30643257
14. Illumina, Accuracy improvements in germline small variant
calling with the DRAGEN platform;https://science-docs.
illumina.com/documents/Informatics/dragen-v3-accuracy-
appnote-html-970-2019-006/Content/ Source/Informatics/
Dragen/dragen-v3-accuracy-appnote-970-2019-006/ dragen-
v3-accuracy-appnote-970-2019-006.html.
15. J. Sirén, E. Garrison, A. M. Novak, B. Paten, R. Durbin,
Haplotype-aware graph indexes.Bioinformatics 36 , 400– 407
(2020). pmid: 31406990
16. M. Schirmer, R. D’Amore, U. Z. Ijaz, N. Hall, C. Quince, Illumina
error profiles: Resolving fine-scale variation in metagenomic
sequencing data.BMC Bioinformatics 17 , 125 (2016).
doi:10.1186/s12859-016-0976-y; pmid: 26968756
17. Materials and methods are available as supplementary
materials.
18. B. Langmead, S. L. Salzberg, Fast gapped-read alignment with
Bowtie 2.Nat. Methods 9 , 357–359 (2012). doi:10.1038/
nmeth.1923; pmid: 22388286
19. H. Li, Aligning sequence reads, clone sequences and assembly
contigs with BWA-MEM. arXiv:1303.3997 [q-bio.GN] (2013).
20. H. Li, Minimap2: Pairwise alignment for nucleotide sequences.
Bioinformatics 34 , 3094–3100 (2018). doi:10.1093/
bioinformatics/bty191; pmid: 29750242
21. A. Autonet al., A global reference for human genetic variation.
Nature 526 , 68–74 (2015). doi:10.1038/nature15393;
pmid: 26432245
22. M. J. P. Chaissonet al., Multi-platform discovery of haplotype-
resolved structural variation in human genomes.Nat. Commun.
10 , 1784 (2019). doi:10.1038/s41467-018-08148-z;
pmid: 30992455
23. J. Pritt, N.-C. Chen, B. Langmead, FORGe: Prioritizing variants
for graph genomes.Genome Biol. 19 , 220 (2018). doi:10.1186/
s13059-018-1595-x; pmid: 30558649
24. J. Wagneret al., Benchmarking challenging small variants with
linked and long reads.bioRxiv2020.07.24.212712 [Preprint]
(2020); doi:10.1101/2020.07.24.212712
25. R. Poplinet al., A universal SNP and small-indel variant caller
using deep neural networks.Nat. Biotechnol. 36 , 983– 987
(2018). doi:10.1038/nbt.4235; pmid: 30247488
26. H. P. Eggertssonet al., GraphTyper2 enables population-scale
genotyping of structural variation using pangenome graphs.
Nat. Commun. 10 , 5402 (2019). doi:10.1038/s41467-019-
13341-9; pmid: 31776332
27. B. Patenet al., Superbubbles, Ultrabubbles, and Cacti.
J. Comput. Biol. 25 , 649–663 (2018). doi:10.1089/
cmb.2017.0251; pmid: 29461862
28. P. A. Audanoet al., Characterizing the major structural variant
alleles of the human genome.Cell 176 , 663–675.e19 (2019).
doi:10.1016/j.cell.2018.12.019; pmid: 30661756
29. National Heart, Lung, and Blood Institute, National Institutes
of Health, US Department of Health and Human Services,
The NHLBI BioData catalyst.Zenodo(2020);https://doi.org/
10.5281/zenodo.3822858.
30. D. E. Bildet al., Multi-ethnic study of atherosclerosis:
Objectives and design.Am. J. Epidemiol. 156 , 871–881 (2002).
doi:10.1093/aje/kwf113; pmid: 12397006
31. M. Byrska-Bishopet al., High coverage whole genome
sequencing of the expanded 1000 Genomes Project cohort
including 602 trios.bioRxiv2021.02.06.430068 [Preprint]
(2021);https://doi.org/10.1101/2021.02.06.430068.
32. P. H. Sudmantet al., An integrated map of structural variation
in 2,504 human genomes.Nature 526 , 75–81 (2015).
doi:10.1038/nature15394; pmid: 26432246
33. R. L. Collinset al., A structural variation reference for medical
and population genetics.Nature 581 , 444–451 (2020).
doi:10.1038/s41586-020-2287-8; pmid: 32461652
34. T. Lappalainenet al., Transcriptome and genome sequencing
uncovers functional variation in humans.Nature 501 , 506– 511
(2013). doi:10.1038/nature12531; pmid: 24037378
Sirénet al.,Science 374 , eabg8871 (2021) 17 December 2021 10 of 11
RESEARCH | RESEARCH ARTICLE