Science - USA (2021-12-17)

(Antfer) #1

using default Giraffe and VG-MAP. For com-
parison,wemappedthesamereadstoGRCh38
with BWA-MEM.


Genotyping accuracy


We compared the performance of Giraffe, VG-
MAP, Illumina’s Dragen platform, and BWA-
MEM for genotyping SNVs and short indels.
The design of each calling pipeline is des-
cribed in section S4 of the supplementary
materials ( 17 ) and the parameters and indexes
for each experiment are described in table S22.
Thevariantsproducedbyeachpipelinewere
compared against the GIAB v4.2.1 HG002
high-confidence variant-calling benchmark
( 24 ) using the RealTimeGenomics vcfeval tool
( 46 ) and Illumina’shap.pytool( 47 ). This bench-
mark set covers 92.2% of the GRCh38 sequence.
We also evaluated a DeepVariant ( 25 ) pipe-
line that uses Giraffe mappings ( 17 ). Using the
default DeepVariant 1.1.0–trained model, we
tested genotyping of the HG003 sample across
theentiregenome.Thissamplewasnotused
in training the model.


Generalization to yeast


To evaluate Giraffe’s performance on more
diverged, nonhuman data, we used a yeast
graph built from a Cactus multiple sequence
alignment for five strains of theS. cerevisiae
andS. paradoxusyeasts ( 8 ). For the correspond-
ing negative-control primary graph, we used the
S.c. S288Cassembly. We collected basic statistics
about the yeast graph and decomposed the
graph for analysis using the method of ( 27 ).
We simulated 500,000 read pairs from a held-
outS. cerevisiaeyeast strain, DBVPG6044, not
included in the yeast graph, using an error and
length model for Illumina HiSeq 2500 reads ( 17 ).


SV genotyping


We built an SV pangenome from the HGSVC
( 22 ), GIAB ( 1 ), and SVPOP ( 28 ) sequence-
resolved catalogs. After filtering out erroneous
duplicates using a remapping approach, the
SVs were iteratively inserted in the genome
graph to minimize the effect of errors and
redundancy in the catalog. The SVs were then
genotyped across 5202 genomes by aligning
short-read sequencing data using Giraffe with
a workflow description language (WDL) work-
flow that we deposited in Dockstore ( 48 ). Two-
thousand samples were selected from the MESA
cohort to maximize sample diversity. The re-
maining 3202 samples are from the 1000
Genomes Project and include 2504 unrelated
individuals. The trios available in this latter
dataset were used to compute the rate of
Mendelian concordance in the genotypes.
The different SV alleles observed in the
population were clustered into SV sites based
on their reciprocal overlap (for deletions) and
sequence similarity (for insertions). We used
the frequency profile across alleles within an


SV site to identify the major allele and to fine-
tune variants with near duplicates in the
combined catalog that may have been due to
errors. Each variant was then annotated with
its presence in existing SV databases ( 28 , 32 , 33 ),
its repeat content, and its location relative to
gene annotations. We also compared the fre-
quency distributions across the SV databases
and how well the frequency estimates matched
for variants shared across databases.
PCA was performed on the SV genotypes,
and principal components were compared
with those produced from SNV-indel geno-
types. We defined strong intercluster or inter-
superpopulation frequency patterns by a
frequency in any cluster or superpopulation
differingbymorethan10%fromthemedian
frequency across all of them. For the 2000 MESA
samples, the clusters were defined using hierar-
chical clustering on the first three principal
components. For the 1000 Genomes Project,
we used their“superpopulation”assignments.
Permutations were used to contrast the number
of SVs with such patterns with an expected
baseline.
Finally, we examined the SV genotypes in a
subset of the samples that had gene-expression
data available from the GEUVADIS consortium
( 34 ). MatrixEQTL ( 49 ) identified SV-eQTLs
while controlling for sex and population
structures, as summarized by the first four
principal components. Separate analyses of
the four European-ancestry populations together
and the YRI population alone were performed
similarly. In addition, we performed a joint
eQTL analysis with publicly available SNVs
and indels ( 31 ). We used permutation to com-
pute enrichment of SV-eQTLs in gene regions,
gene families, or among lead-eQTLs (those
with the strongest association for a gene).

REFERENCESANDNOTES


  1. J. M. Zooket al., A robust benchmark for detection of germline
    large deletions and insertions.Nat. Biotechnol. 38 , 1347– 1355
    (2020). doi:10.1038/s41587-020-0538-8; pmid: 32541955

  2. M. Mahmoudet al., Structural variant calling: The long and the
    short of it.Genome Biol. 20 , 246 (2019). doi:10.1186/s13059-
    019-1828-7; pmid: 31747936

  3. J. Ebler, A. Schönhuth, T. Marschall, Genotyping inversions and
    tandem duplications.Bioinformatics 33 , 4015–4023 (2017).
    doi:10.1093/bioinformatics/btx020; pmid: 28169394

  4. D. M. Churchet al., Modernizing reference genome assemblies.
    PLOS Biol. 9 , e1001091 (2011). doi:10.1371/journal.pbio.
    1001091 ; pmid: 21750661

  5. The Computational Pan-Genomics Consortium, Computational
    pan-genomics: status, promises and challenges.Brief.
    Bioinform. 19 , 118–135 (2016). doi:10.1093/bib/bbw089

  6. R. M. Sherman, S. L. Salzberg, Pan-genomics in the human
    genome era.Nat. Rev. Genet. 21 , 243–254 (2020).
    doi:10.1038/s41576-020-0210-7; pmid: 32034321

  7. S. Ballouz, A. Dobin, J. A. Gillis, Is it time to change the
    reference genome?Genome Biol. 20 , 159 (2019). doi:10.1186/
    s13059-019-1774-4; pmid: 31399121

  8. G. Hickeyet al., Genotyping structural variants in pangenome
    graphs using the vg toolkit.Genome Biol. 21 , 35 (2020).
    doi:10.1186/s13059-020-1941-7; pmid: 32051000

  9. J. M. Eizengaet al., Pangenome graphs.Annu. Rev. Genomics
    Hum. Genet. 21 , 139–162 (2020). doi:10.1146/annurev-genom-
    120219-080406; pmid: 32453966
    10. E. Garrisonet al., Variation graph toolkit improves read
    mapping by representing genetic variation in the reference.
    Nat. Biotechnol. 36 , 875–879 (2018). doi:10.1038/nbt.4227;
    pmid: 30125266
    11. D. Kim, J. M. Paggi, C. Park, C. Bennett, S. L. Salzberg, Graph-
    based genome alignment and genotyping with HISAT2 and
    HISAT-genotype.Nat. Biotechnol. 37 , 907–915 (2019).
    doi:10.1038/s41587-019-0201-4; pmid: 31375807
    12. M. Rautiainen, T. Marschall, GraphAligner: Rapid and
    versatile sequence-to-graph alignment.Genome Biol. 21 ,
    253 (2020). doi:10.1186/s13059-020-02157-2;
    pmid: 32972461
    13. G. Rakocevicet al., Fast and accurate genomic analyses using
    genome graphs.Nat. Genet. 51 , 354–362 (2019). doi:10.1038/
    s41588-018-0316-4; pmid: 30643257
    14. Illumina, Accuracy improvements in germline small variant
    calling with the DRAGEN platform;https://science-docs.
    illumina.com/documents/Informatics/dragen-v3-accuracy-
    appnote-html-970-2019-006/Content/ Source/Informatics/
    Dragen/dragen-v3-accuracy-appnote-970-2019-006/ dragen-
    v3-accuracy-appnote-970-2019-006.html.
    15. J. Sirén, E. Garrison, A. M. Novak, B. Paten, R. Durbin,
    Haplotype-aware graph indexes.Bioinformatics 36 , 400– 407
    (2020). pmid: 31406990
    16. M. Schirmer, R. D’Amore, U. Z. Ijaz, N. Hall, C. Quince, Illumina
    error profiles: Resolving fine-scale variation in metagenomic
    sequencing data.BMC Bioinformatics 17 , 125 (2016).
    doi:10.1186/s12859-016-0976-y; pmid: 26968756
    17. Materials and methods are available as supplementary
    materials.
    18. B. Langmead, S. L. Salzberg, Fast gapped-read alignment with
    Bowtie 2.Nat. Methods 9 , 357–359 (2012). doi:10.1038/
    nmeth.1923; pmid: 22388286
    19. H. Li, Aligning sequence reads, clone sequences and assembly
    contigs with BWA-MEM. arXiv:1303.3997 [q-bio.GN] (2013).
    20. H. Li, Minimap2: Pairwise alignment for nucleotide sequences.
    Bioinformatics 34 , 3094–3100 (2018). doi:10.1093/
    bioinformatics/bty191; pmid: 29750242
    21. A. Autonet al., A global reference for human genetic variation.
    Nature 526 , 68–74 (2015). doi:10.1038/nature15393;
    pmid: 26432245
    22. M. J. P. Chaissonet al., Multi-platform discovery of haplotype-
    resolved structural variation in human genomes.Nat. Commun.
    10 , 1784 (2019). doi:10.1038/s41467-018-08148-z;
    pmid: 30992455
    23. J. Pritt, N.-C. Chen, B. Langmead, FORGe: Prioritizing variants
    for graph genomes.Genome Biol. 19 , 220 (2018). doi:10.1186/
    s13059-018-1595-x; pmid: 30558649
    24. J. Wagneret al., Benchmarking challenging small variants with
    linked and long reads.bioRxiv2020.07.24.212712 [Preprint]
    (2020); doi:10.1101/2020.07.24.212712
    25. R. Poplinet al., A universal SNP and small-indel variant caller
    using deep neural networks.Nat. Biotechnol. 36 , 983– 987
    (2018). doi:10.1038/nbt.4235; pmid: 30247488
    26. H. P. Eggertssonet al., GraphTyper2 enables population-scale
    genotyping of structural variation using pangenome graphs.
    Nat. Commun. 10 , 5402 (2019). doi:10.1038/s41467-019-
    13341-9; pmid: 31776332
    27. B. Patenet al., Superbubbles, Ultrabubbles, and Cacti.
    J. Comput. Biol. 25 , 649–663 (2018). doi:10.1089/
    cmb.2017.0251; pmid: 29461862
    28. P. A. Audanoet al., Characterizing the major structural variant
    alleles of the human genome.Cell 176 , 663–675.e19 (2019).
    doi:10.1016/j.cell.2018.12.019; pmid: 30661756
    29. National Heart, Lung, and Blood Institute, National Institutes
    of Health, US Department of Health and Human Services,
    The NHLBI BioData catalyst.Zenodo(2020);https://doi.org/
    10.5281/zenodo.3822858.
    30. D. E. Bildet al., Multi-ethnic study of atherosclerosis:
    Objectives and design.Am. J. Epidemiol. 156 , 871–881 (2002).
    doi:10.1093/aje/kwf113; pmid: 12397006
    31. M. Byrska-Bishopet al., High coverage whole genome
    sequencing of the expanded 1000 Genomes Project cohort
    including 602 trios.bioRxiv2021.02.06.430068 [Preprint]
    (2021);https://doi.org/10.1101/2021.02.06.430068.
    32. P. H. Sudmantet al., An integrated map of structural variation
    in 2,504 human genomes.Nature 526 , 75–81 (2015).
    doi:10.1038/nature15394; pmid: 26432246
    33. R. L. Collinset al., A structural variation reference for medical
    and population genetics.Nature 581 , 444–451 (2020).
    doi:10.1038/s41586-020-2287-8; pmid: 32461652
    34. T. Lappalainenet al., Transcriptome and genome sequencing
    uncovers functional variation in humans.Nature 501 , 506– 511
    (2013). doi:10.1038/nature12531; pmid: 24037378


Sirénet al.,Science 374 , eabg8871 (2021) 17 December 2021 10 of 11


RESEARCH | RESEARCH ARTICLE

Free download pdf