Building the human pangenome Benedict Paten - UC Santa Cruz Genomics - PowerPoint PPT Presentation

Building the human pangenome Benedict Paten - UC Santa Cruz Genomics Institute bpaten@ucsc.edu

Now the $1,000 individual genome is here… but $1B $300M $100M $10M $10M $1M $50K $100K $5K $3K $1K $10K 2002 2004 2006 2008 2010 2012 2014 2015 $1K Sources: NIH: www.genome.gov/sequencingcosts; UC San Diego, 1/14/14: Illumina breaks genome cost barrier

All variants are currently detected relative to a single human reference genome. A typical person is not the reference. A typical person has • Avg. of 5 million isolated single DNA base variations different from the reference (out of 3 billion) • Avg. of 20 million DNA bases in large segments of DNA that are not present in the same form in the reference genome • Many of these variants not currently assayed accurately: reference allele bias

Vision - The Human Pangenome Instead - imagine mapping to a reference structure that contains all common variation: a pangenome graph 4

This Talk ● Part 1: How do we make long-read reference quality assembly efficient and routine, so that we can create the genomes for the human pangenome ● Part 2: How do we build the pangenome and use it? 5

Genome assembly bottlenecks • Need for revolution in generation of high-quality genomes to ensure all variation is captured, bottlenecks: ○ Sequencing cost for high quality ○ Sequencing speed for high quality ○ Scalable and cheaper informatics 6

Solution • Nanopore 100kb+ sequencing • Scalable algorithms and informatics 7

Nanopore sequencing Data acquisition for 11 genomes in 9 days (>60x total coverage)

7x enrichment of reads >100kb using Circulomics SRE Short Read Eliminator Kit (https://www.circulomics.com) 10

Read N50 improvement is reproducible N50s: 42kb Read N50 (kb) Individual genomes https://github.com/human-pangenomics/hpgp-data 11

PromethION sequencing throughput Individual genomes Total throughput (Gb) 12

Median alignment identity is 90% 1.0 Alignment Identity (GRCh38) 0.9 0.8 0.7 0.6 Mode: 93% 0.5 Guppy 2.3.5 flip flop basecaller Median: 90% 0.4 00733 01109 01243 02055 24143 24149 24385 02080 02723 03098 Individual genomes Alignment identity = matches / (matches + mismatches + insertions + deletions) 13

Scalable assembly and polishing tools https://upload.wikimedia.org/wikipedia/commons/2/22/MtShasta_aerial.JPG

Pipeline 15

Shasta – a nanopore de novo long read assembler • New de novo assembler tailored for long reads and parameterized for ONT data - principally developed by Paolo Carnevali at CZI • Beautiful new algorithms ( https://chanzuckerberg.github.io/shasta/ComputationalMethods.html ) ○ Use run-length encoding (RLE) throughout to compress homopolymer confusion - the dominant source of error in ONT reads ○ Uses novel high-cardinality marker space representation for super efficient overlap alignment ○ Does everything in memory (requires 1.5TB of memory for 60x human) ○ Outputs GFA, intent for whole pipeline to use GFA to represent ambiguities https://github.com/chanzuckerberg/shasta 16

Run Length Encoding (RLE) 17

Marker Representation 18

Marker Representation 19

Assembly at a fraction of time and cost 20

Shasta GPU Acceleration 21

Comparable contig NG50 and lower misassemblies shasta flye canu + 10X wtdbg2 Number of misassemblies 1160 5580 6093 4164 22

Shasta assemblies are reproducible Median contig NG50 = 23 Mb 23

Two-step polishing of assemblies 1. MarginPolish 2. HELEN A graph-based alignment polisher A DNN-based consensus sequence polisher https://github.com/UCSC-nanopore-cgl/marginPolish https://github.com/kishwarshafin/helen 24

Polishing at a fraction of time and cost 25

MarginPolish and HELEN outperform other polishers Assembler Polisher Diploid Haploid (HG00733) (CHM13) - 98.78% 99.37% Racon 4x 99.16% 99.50% Racon 4x + Medaka 99.42% 99.58% Shasta MarginPolish 99.41% 99.62% MarginPolish + HELEN 99.47% 99.70% 26

Improvements in homopolymer length predictions Shasta Shasta + MarginPolish Shasta + MarginPolish Guppy basecaller + HELEN 27

Chromosome-level scaffolding using HiC data With HiC Without HiC 28

Near term future https://upload.wikimedia.org/wikipedia/commons/2/22/MtShasta_aerial.JPG

The near future: A reference-quality human-scale genome in ~7 days for < $10K 30

Key next steps • Faster basecalling (ONT) • Haplotype phasing (UCSC, CZI) • Exploring real-time applications • Integrating into human reference pan- genomes 31

Acknowledgements David Haussler Ed Green Sofie Salama Mark Akeson Sidney Bell Daniel Garalde Adam Novak Adam Phillippy (NHGRI) Charlotte Weaver Kristof Tigyi Rosemary Dokos Glenn Hickey Fritz Sedlazeck (Baylor) Michael Barrientos Nicholas Maurer Simon Mayes Jordan Eizenga Ryan King Yatish Turakhia Chris Seymour Erik Garrison Bruce Martin Kishwar Shafin Chris Wright Jean Monlong Phil Smoot Marina Haukness David Stoddart Xian Chang Cori Bargmann Trevor Pesout Dan Turner Colleen Bosworth Karen Miga Ryan Lorig-Roach Kelvin Liu Miten Jain Duncan Kilburn Hugh Olsen 32

Mapping everybody’s genome to one reference genome creates significant bias • Mapping is biased against Korean reference genome project variation De novo assembly and phasing of a Korean human genome • Structural variants particularly Jeong-Sun Seo et al. 2016 hard to map Danish reference genome project • Risk some genetic variants from Sequencing and de novo assembly of 150 other subpopulation groups genomes from Denmark as a population inaccurately represented reference Lasse Maretty et al. 2017 ... • Bias is unacceptable for global biomedicine

Human Pangenome Project Goals: • Develop next generation human genetic reference that includes known variation from all human ethnic populations • Build the software required to switch biomedicine over to using this new human genetic reference CREDIT: Kiran Garimella and Benedict Paten

Merging diverse genomes into one mathematical map The major histocompatibility complex: Kiran Garimella and Benedict Paten

Zooming in, you start to see structure of local genetic variants

At base level, we assign unique identifiers to genetic variants to enable precision

Variation Graphs – The Essentials Joins can connect either side of a sequence (bidirected edges) Walks encode DNA strings, with side of entry determining strand

The VG group is building a software ecosystem for pangenomics • Addresses all essential operations on genome graphs another variation variation graph graph https://github.com/vgteam/vg doi.org/10.1101/234856

The first human genome variation map combines information from 1000 human genomes View of genomes (gray to black) in an actual genome map, and DNA sequencing reads (colored worms) from a newly sequenced individual mapped to it

Genome Graph Models Naturally Represent All Variant Types Substitution

Genome Graph Models Naturally Represent All Variant Types Insertion or deletion

Genome Graph Models Naturally Represent All Variant Types Duplication (top path traverses same nodes multiple times)

Genome Graph Models Naturally Represent All Variant Types Inversion (red path traverses reverse complement)

Human Read Mapping with VG ● Simulation study to GRCh38 / Graph using 1000 Genomes (80 Million Variants) ● 10 million read pairs (2x150mers) ● ROC stratified by MAPQ Garrison et al. bioxriv: doi.org/10.1101/234856 ● Reads sampled from Ashkenazi Jewish sample not in 1000 Genomes

Human Read Mapping with VG - Indel Allele Balance Deletion Insertion Garrison et al. bioxriv: doi.org/10.1101/234856

Yeast Mapping with VG - A More Polymorphic Example Sample Genome Pan genome Reference genome Garrison et al. bioxriv: doi.org/10.1101/234856

VG - Take Homes ● VG is practical for mapping human genome scale samples against graph with 80 Million point variants ● First tool to work with arbitrary graphs (cycles, copy number variants are possible) ● Provides interchange formats and many, many utilities

THANKS! UC Santa Cruz Adam Novak Wolfgang Beyer Glenn Hickey Karen Miga Yohei Rosen Jouni Siren Jordan Eizenga Charles Markello David Haussler Xian Chang Yatish Turakhia The Rest of Team VG Erik Garrison Richard Durbin Eric Dawson Mike Lin (& many more) GA4GH collaborators Andres Kahles Heng Li Join us: https://cgl.genomics.ucsc.edu/opportunities/ Ben Murray Stephen Keenan Goran Rakocevic Gil McVean Alex Dilthey (& many more) Simons Foundation

Summary • Mapping is central to genomics, and reference genomes are perhaps the most important data structure in genomics • With vg we can generalize reference genomes to reference genome graphs, and practically map to a population cohort instead, alleviating bias • It’s not about replacing the reference with a graph, but with a population cohort

Embedding Haplotypes • Genome graphs do not encode linkage • To restrict linkage, natural solution is to duplicate paths: • But duplication creates mapping ambiguity

Building the human pangenome Benedict Paten - UC Santa Cruz Genomics - PowerPoint PPT Presentation

Building the human pangenome Benedict Paten - UC Santa Cruz Genomics Institute bpaten@ucsc.edu Now the $1,000 individual genome is here but $1B $300M $100M $10M $10M $1M $50K $100K $5K $3K $1K $10K 2002 2004 2006 2008

Genotyping structural variants in TOPMed using pangenome graphs Jean Monlong February 12-13,

Genotyping structural variants in pangenome graphs using the vg toolkit Jean Monlong November 7,

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

Energy Complex (EnCo) (New and Existing Building) 117,859 m 2 Building A 61,45 8 m 2 Building B

Optics of the Human Eye Optics of the Human Eye Optics of the Human Eye Optics of the Human Eye

Human Resources Human Resources Business Unit Business Unit DaVonna Johnson Human Resources

Tompkins County Tompkins County HUMAN SER HUMAN SERVICES CO HUMAN SER HUMAN SERVICES CO ICES

Human Design Welcome to the World of HUMAN DESIGN My Divine Human Design Our Host What

SPEAK TRUTH TO ROBERT F . KENNEDY POWER HUMAN A human rights education program RIGHTS ABOUT

I mproving Human Performance: Building a Culture of High Reliability Building a Culture of High

Building Sustainability: Building Sustainability: Building Sustainability: Building

Intelligent Building and Building Energy Efficiency Shengwei Wang Chair Professor of Building

City Building City Building (Glasgow) LLP City Building Overview City Building was formed in

Life Sciences Building Life Sciences Building Life Sciences Building Life Sciences Building

The Stephenson Building The Walker Building The Sheraton Building This is our Principal Becky

BURLINGAME POINT PROJECT BURLINGAME, CA GENSLER I GENZON AUGUST 22, 2016 SITE PLAN FEBRUARY

Swedish Neutron Education for Science & Society As. Prof. Martin Mnsson - condmat@kth.se

Do Now Open Curriculum Manager Launch Presentation U2-C2-L4 Stationary Movements

Ahmed Farooq Introductions Haptics, an Emerging Science Research In Public Kiosks

Co-operative Housing in Greater Manchester What is co-operative /mutual housing? No two schemes

Working on exercises (a few notes first) Comments Sometimes you want to make a comment in the

Dynamic Optimized Advanced Scheduling of Bandwidth Demands for Large Scale Science Applications

In-situ XRF analysis as a diagnostic analytical tool in the conservation field A.G. Karydas, V.

BYOS 2 SLIDE SYSTEM (BUILD YOUR OWN SLIDE) Assembly & Installation Instructions V i s

Building the human pangenome Benedict Paten - UC Santa Cruz Genomics - PowerPoint PPT Presentation

Building the human pangenome Benedict Paten - UC Santa Cruz Genomics Institute bpaten@ucsc.edu Now the $1,000 individual genome is here but $1B $300M $100M $10M $10M $1M $50K $100K $5K $3K $1K $10K 2002 2004 2006 2008

Genotyping structural variants in TOPMed using pangenome graphs Jean Monlong February 12-13,

Genotyping structural variants in pangenome graphs using the vg toolkit Jean Monlong November 7,

Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat Chicken Human 1 Human 2 Rat

Energy Complex (EnCo) (New and Existing Building) 117,859 m 2 Building A 61,45 8 m 2 Building B

Optics of the Human Eye Optics of the Human Eye Optics of the Human Eye Optics of the Human Eye

Human Resources Human Resources Business Unit Business Unit DaVonna Johnson Human Resources

Tompkins County Tompkins County HUMAN SER HUMAN SERVICES CO HUMAN SER HUMAN SERVICES CO ICES

Human Design Welcome to the World of HUMAN DESIGN My Divine Human Design Our Host What

SPEAK TRUTH TO ROBERT F . KENNEDY POWER HUMAN A human rights education program RIGHTS ABOUT

I mproving Human Performance: Building a Culture of High Reliability Building a Culture of High

Building Sustainability: Building Sustainability: Building Sustainability: Building

Intelligent Building and Building Energy Efficiency Shengwei Wang Chair Professor of Building

City Building City Building (Glasgow) LLP City Building Overview City Building was formed in

Life Sciences Building Life Sciences Building Life Sciences Building Life Sciences Building

The Stephenson Building The Walker Building The Sheraton Building This is our Principal Becky

BURLINGAME POINT PROJECT BURLINGAME, CA GENSLER I GENZON AUGUST 22, 2016 SITE PLAN FEBRUARY

Swedish Neutron Education for Science &amp; Society As. Prof. Martin Mnsson - condmat@kth.se

Do Now Open Curriculum Manager Launch Presentation U2-C2-L4 Stationary Movements

Ahmed Farooq Introductions Haptics, an Emerging Science Research In Public Kiosks

Co-operative Housing in Greater Manchester What is co-operative /mutual housing? No two schemes

Working on exercises (a few notes first) Comments Sometimes you want to make a comment in the

Dynamic Optimized Advanced Scheduling of Bandwidth Demands for Large Scale Science Applications

In-situ XRF analysis as a diagnostic analytical tool in the conservation field A.G. Karydas, V.

BYOS 2 SLIDE SYSTEM (BUILD YOUR OWN SLIDE) Assembly &amp; Installation Instructions V i s

Swedish Neutron Education for Science & Society As. Prof. Martin Mnsson - condmat@kth.se

BYOS 2 SLIDE SYSTEM (BUILD YOUR OWN SLIDE) Assembly & Installation Instructions V i s