Science - USA (2021-12-17)

(Antfer) #1

RESEARCH ARTICLE SUMMARY



GENOMICS


Pangenomics enables genotyping of known


structural variants in 5202 diverse genomes


Jouni Sirén†, Jean Monlong†, Xian Chang†, Adam M. Novak†, Jordan M. Eizenga†, Charles Markello,
Jonas A. Sibbesen, Glenn Hickey, Pi-Chuan Chang, Andrew Carroll, Namrata Gupta, Stacey Gabriel,
Thomas W. Blackwell, Aakrosh Ratan, Kent D. Taylor, Stephen S. Rich, Jerome I. Rotter,
David Haussler, Erik Garrison, Benedict Paten*


INTRODUCTION:Modern genomics depends
on inexpensive short-read sequencing. Se-
quenced reads up to a few hundred base pairs
in length are computationally mapped to
estimated source locations in a reference ge-
nome. These read mappings are used in myr-
iad sequencing-based assays. For example,
through a process called genotyping, mapped
reads from a DNA sample can be used to infer
the combination of alleles present at each site
in the reference genome.


RATIONALE:A single reference genome cannot
capture the diversity within even a single per-
son (who gets a genome copy from each parent),
let alone in the whole human population.


Genomes differ not only by point variations,
where one or a few bases are different, but also
by structural variations, where differences can
be much larger than an individual read. When
a person’s genome differs from the reference
by a structural variation, the reference may
containnolocationtocorrectlymapthe
corresponding reads. Although newer long-
read sequencing allows structural variation
to be more directly observed in sequencing
reads, short-read sequencing is still less ex-
pensive and more widely available.

RESULTS:We present a short read–mapping
tool, Giraffe. Giraffe maps to a pangenome
reference that describes many genomes and

the differences between them. Giraffe can ac-
curately map reads to thousands of genomes
embedded in a pangenome reference as quickly
as existing tools map to a single reference
genome. Simulations in which the true map-
ping for each read is known show that Giraffe
is as accurate as the most accurate previous-
ly published tool. Giraffe achieves this speed
and accuracy by using a variety of algorith-
mic techniques. In particular, and in contrast
to previous tools, it focuses on mapping to the
paths in the pangenome that are observed
in individuals’genomes: the reference hap-
lotypes. This has two key benefits. First, it
prioritizes alignments that are consistent
with known sequences, avoiding combina-
tions of alleles that are biologically unlikely.
Second, it reduces the size of the problem
by limiting the sequence space to which the
reads could be aligned. This deals effectively
with complex graph regions where most paths
represent rare or nonexistent sequences.
Using Giraffe in place of a single reference
genome reduces mapping bias, which is the
tendency to incorrectly map reads that differ
from the reference genome. Combining Giraffe
with state-of-the-art genotyping algorithms dem-
onstrates that Giraffe mappings produce ac-
curate genotyping results.
Using mappings from Giraffe, we genotyped
167,000 recently discovered structural variations
in short-read samples for 5202 people at an
average computational cost of $1.50 per sample.
We present estimates for the frequency of
different versions of these structural variations
in the human population as a whole and within
individual subpopulations. We identify thou-
sands of these structural variations as expres-
sion quantitative trait loci (eQTLs), which are
associated with gene-expression levels.

CONCLUSION:Giraffe demonstrates the prac-
ticality of a pangenomic approach to short-
read mapping. This approach allows short-read
data to genotype single-nucleotide variations,
short insertions and deletions, and structural
variations more accurately. For structural
variations, this allowed the estimation of
population frequencies across a diverse cohort
of 5000 individuals. A single reference ge-
nome must choose one version of any varia-
tion to represent, leaving the other versions
unrepresented. By making more broadly
representative pangenome references prac-
tical, Giraffe attempts to make genomics more
inclusive.▪

RESEARCH

SCIENCEscience.org 17 DECEMBER 2021•VOL 374 ISSUE 6574 1461


The list of author affiliations is available in the full article online.
*Corresponding author. Email: [email protected]
†These authors contributed equally to this work.
Cite this article as J. Sirénet al.,Science 374 , eabg8871
(2021). DOI: 10.1126/science.abg8871

READ THE FULL ARTICLE AT
https://doi.org/10.1126/science.abg8871

Overview of the experiments.Variant calls from long readÐbased and large-scale sequencing studies were used
to construct pangenome reference graphs (top). Giraffe (and competing mappers) mapped reads to the graph or
to linear references, and mapping accuracy, allele coverage balance, and speed were evaluated (middle). Then,
mapped reads were used for variant calling, and variant call accuracy was evaluated (bottom). Structural variant
calls were analyzed alongside expression data to identify eQTLs and population frequency estimates.

Free download pdf