Genotyping structural variants in TOPMed using pangenome graphs Jean Monlong February 12-13, 2020 GSP-TOPMed Analysis Workshop
Pangenome graphs and variant-aware read mapping Linear reference genome A C A T T G G C Seq. reads T G G G C 1KGP 1KGP 1KGP Variation graph C C C G G G T G G C Seq. reads T G G G C Introduction 2
Mapping reads across structural variants Structural variants (SVs) are genomic variants larger than 50 bp, e.g. insertions, deletions, inversions translocations. Linear reference genome DELETION Variation graph DELETION C C C G G G INSERTION Introduction 3
The vg toolkit is a complete, open source solution for graph construction , read mapping , and variant calling . https://github.com/vgteam/vg Garrison et al. Nature Biotech 2018 vg can genotype structural variants from short-read sequencing datasets starting from public SV catalogs or de novo assemblies. Hickey et al. bioRxiv 2019, in press at Genome Biology Introduction 4
Genotyping SVs from long-read sequencing studies Ref. Project Samples Human Genome Structural Chaisson et al. 2019 3 Variation Consortium ( HGSVC ) Audano et al. 2019 15 SVPOP Zook et al. 2019 Genome in a Bottle ( GIAB ) 1 GRCh38 HGSVC vg SV catalog VCF HG00514 vg short reads HG00514 genotyped SVs Evaluation HG00514 VCF SV genotyping with vg 5
SV genotyping accuracy for deletions and insertions whole − genome non − repeat regions 0.8 0.6 INS 0.4 0.2 0.0 F1 0.8 0.6 DEL 0.4 0.2 0.0 Paragraph vg BayesTyper Delly SVTyper Paragraph vg BayesTyper Delly SVTyper g r a p h - b a s e d a l t r a d i t i o n p e r s S V g e n o t y S V g e n o t y p e r s Non-repeat regions: regions not overlapping segmental duplications or simple repeats SV genotyping with vg 6
Combined SV catalogs from 3 long-read studies Ref. Project Samples Human Genome Structural Chaisson et al. 2019 3 Variation Consortium ( HGSVC ) Audano et al. 2019 SVPOP 15 Zook et al. 2019 Genome in a Bottle ( GIAB ) 1 15000 SV type DEL INS 10000 variant 5000 0 50 100 1,000 10,000 100,000 size (bp) 71K deletions and 70K insertions include most of the common deletions and insertions in the population. SV genotyping with vg 7
760 TOPMed samples genotyped in 5 days Using BioData Catalyst as an alpha user. Workflow in Dockstore . TOPMed data imported from Gen3 . Genotyping and exploratory analysis on Terra using workflows and notebooks. ∼ $12 per sample (soon < $4 with new read mapper). SV genotyping in the BioData Catalyst ecosystem 8
TOPMed data available in Gen3 I selected the MESA cohort and exported the CRAM files to Terra. SV genotyping in the BioData Catalyst ecosystem 9
WDL workflow for vg in Dockstore SV genotyping in the BioData Catalyst ecosystem 10
Genotyping and analysis on Terra SV genotyping in the BioData Catalyst ecosystem 11
SV genotyped in 760 diverse genomes SVs across 760 samples 12
Frequency estimates Insertions slightly more frequent than deletions... ...especially for larger variants. Hundreds of fixed SVs, especially insertions. SVs across 760 samples 13
Fixed insertions 736 insertions with allele frequency > 0.99. Two repeat expansions in coding regions of SAMD1 and FOXO6. Screenshots from https://gnomad.broadinstitute.org/ SVs across 760 samples 14
Fine-tuning breakpoints of deletions Although sequence-resolved, many deletions are extremely similar and likely near-duplicates of the same real deletion. Deletions SV catalog S Sample 1 S S Paragraph 1 3 2 Sample 1 vg S1 S1 Truth set Paragraph vg In > 9K clusters, the 760 samples supported mostly one variant. SVs across 760 samples 15
Coding deletions with fine-tuned breakpoints 95 of the fine-tuned deletions overlap coding regions. Two near-duplicated deletions overlapped DRD4 gene. Within long short tandem repeat... 96 bp or 97 bp deletion? → All samples supported the 96 bp deletion. Known 2-copies version of the 48nt repeat (DRD4-2R). hg38 Scale 200 bases chr11: 639,600 639,700 639,800 639,900 640,000 640,100 640,200 640,300 640,400 GENCODE v32 Comprehensive Transcript Set (only Basic displayed by default) DRD4 OMIM Genes - Dark Green Can Be Disease-causing 126452 Repeating Elements by RepeatMasker RepeatMasker Simple Tandem Repeats by TRF CGCCGCCCTCCCG... CGCCCCCCGCGCC... SVs across 760 samples 16
Structural Variation TOPMed GRCh38 Long-read studies Short reads Phenotypes HGSVC, SVPOP, GIAB Short-read studies SV genotypes vg vg gnomAD, TOPMed SV-WG 0/1 0/0 0/0 1/1 0/1 0/0 0/1 1/1 0/0 0/1 0/1 0/0 1/1 0/1 Human Pangenome Association study High-quality phased Annotated SV catalog assemblies Conclusions The vg toolkit can integrate and genotype SVs. 760 TOPMed samples genotyped in 5 days using the BioData Catalyst ecosystem. SV catalog from long-read studies annotated with frequencies and better breakpoint resolution. Conclusions and future directions 17
Conclusions The vg toolkit can integrate and genotype SVs. 760 TOPMed samples genotyped in 5 days using the BioData Catalyst ecosystem. SV catalog from long-read studies annotated with frequencies and better breakpoint resolution. Future directions Documented workflows for the BioData Catalyst community (and GSP through NHGRI AnVIL). More SVs genotyped in more TOPMed samples for association studies. Structural Variation TOPMed GRCh38 Long-read studies Short reads Phenotypes HGSVC, SVPOP, GIAB Short-read studies SV genotypes vg vg gnomAD, TOPMed SV-WG 0/1 0/0 0/0 1/1 0/1 0/0 0/1 1/1 0/0 0/1 0/1 0/0 1/1 0/1 Human Pangenome Association High-quality phased study Annotated SV catalog assemblies Conclusions and future directions 17
Acknowledgment vg Team BioData Catalyst Team Benedict Paten Beth Sheets (talk to her!) Glenn Hickey Michael Baumann David Heller Brian Hannafious Adam Novak Erik Garrison Jouni Siren Jordan Eizenga Charles Markello Xian Chang Robin Rounthwaite Jonas Sibbesen Eric T. Dawson Acknowledgment 18
19
Genotyping variants in vg deletion Linear reference insertion reads not mapped reads on linear reference Snarl 1 Snarl 2 Path coverage ratio 1:1.6 → het Path coverage ratio 0:2 → hom Read mapping to reference path Read mapping to variant path Reference path Variant path Graph reference insertion Genotyping is based on the path coverage. A snarl is a variant site in the graph, a “bubble”. 20
Results on HGSVC - Simulated reads whole − genome non − repeat regions 1.00 0.75 INS 0.50 0.25 0.00 F1 1.00 0.75 DEL 0.50 0.25 0.00 Paragraph vg BayesTyper Delly SVTyper Paragraph vg BayesTyper Delly SVTyper g r a p h - b a s e d o n a l t r a d i t i y p e r s S V g e n o t S V g e n o t y p e r s Non-repeat regions: regions not overlapping segmental duplications or simple repeats 21
Deletion correctly genotyped by vg reads GRCh38 chr2 deletion graph 51 bp homozygous deletion in the 3’ UTR of the LONRF2 gene. 3’ UTR of LONRF2 gene 22
Simple repeat/low complexity regions are challenging 1.00 repeat class/family SINE/Alu LTR/ERV1 0.75 LINE/L1 Retroposon/SVA Low_complexity recall Satellite 0.50 Satellite/centr Simple_repeat 0.25 SV type INS DEL 0.00 0.00 0.25 0.50 0.75 1.00 precision SV sequence annotated with RepeatMasker. Class assigned if covered ≥ 80% by a repeat element. 23
Frequency distribution vs variant size 24
UMAP 25
Genotype quality and samples with genotype calls 26
Recommend
More recommend