Genotyping structural variants in pangenome graphs using the vg toolkit Jean Monlong November 7, 2019 Genome Informatics
Pangenome graphs and variant-aware read mapping Linear reference genome A C A T T G G C Seq. reads T G G G C 1KGP 1KGP 1KGP Variation graph C C C G G G T G G C Seq. reads T G G G C Introduction 2
Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp, e.g. insertions, deletions, inversions translocations. Linear reference genome DELETION Variation graph DELETION C C C G G G INSERTION Introduction 3
SV catalogs from long-read sequencing studies Ref. Project Samples Human Genome Structural Chaisson et al. 2019 3 Variation Consortium ( HGSVC ) Audano et al. 2019 15 SVPOP Zook et al. 2019 Genome in a Bottle ( GIAB ) 1 6000 HGSVC 4000 2000 SV type variant 0 DEL 60000 gnomad−SV INS 40000 20000 0 10 100 1,000 10,000 100,000 size (bp) Introduction 4
The vg toolkit is a complete, open source solution for graph construction , read mapping , and variant calling . https://github.com/vgteam/vg Garrison et al. Nature Biotech 2018 Can we genotype SVs from short-read sequencing datasets with the vg toolkit? Starting from public SV catalogs or de novo assemblies. Hickey et al. bioRxiv 2019 Goal 5
Genotyping public SV catalogs in human GRCh38 HGSVC vg SV catalog VCF HG00514 vg short reads HG00514 genotyped SVs Evaluation HG00514 VCF Evaluate genotype predictions for a sample from the truth set (e.g. HG00514). From SV catalogs in human 6
Genotyping variants in vg deletion Linear reference insertion reads not mapped reads on linear reference Snarl 1 Snarl 2 Path coverage ratio 1:1.6 → het Path coverage ratio 0:2 → hom Read mapping to reference path Read mapping to variant path Reference path Variant path Graph reference insertion Genotyping is based on the path coverage. A snarl is a variant site in the graph, a “bubble”. From SV catalogs in human 7
Evaluating SV genotypes with a truth set Deletions/Inversions At least 50% coverage and 10% reciprocal overlap <50% coverage truth calls <10% rec. overlap Insertions At least 50% of inserted sequence matching nearby insertions truth 20bp 20bp calls R package: https://github.com/jmonlong/sveval From SV catalogs in human 8
Results on HGSVC - Simulated reads whole − genome non − repeat regions 1.00 0.75 INS 0.50 0.25 0.00 F1 1.00 0.75 DEL 0.50 0.25 0.00 Paragraph vg BayesTyper Delly SVTyper Paragraph vg BayesTyper Delly SVTyper g r a p h - b a s e d o n a l t r a d i t i y p e r s S V g e n o t S V g e n o t y p e r s Non-repeat regions: regions not overlapping segmental duplications or simple repeats From SV catalogs in human 9
Results on HGSVC - Real reads whole − genome non − repeat regions 0.8 0.6 INS 0.4 0.2 0.0 F1 0.8 0.6 DEL 0.4 0.2 0.0 Paragraph vg BayesTyper Delly SVTyper Paragraph vg BayesTyper Delly SVTyper g r a p h - b a s e d a l t r a d i t i o n p e r s S V g e n o t y S V g e n o t y p e r s Non-repeat regions: regions not overlapping segmental duplications or simple repeats From SV catalogs in human 10
Simple repeat/low complexity regions are challenging 1.00 repeat class/family SINE/Alu LTR/ERV1 0.75 LINE/L1 Retroposon/SVA Low_complexity recall Satellite 0.50 Satellite/centr Simple_repeat 0.25 SV type INS DEL 0.00 0.00 0.25 0.50 0.75 1.00 precision SV sequence annotated with RepeatMasker. Class assigned if covered ≥ 80% by a repeat element. From SV catalogs in human 11
Challenges with the VCF format Multiple equivalent representations, over-simplification, impractical. VCF v4.2 specs Why not start directly from de novo assemblies? From de novo assemblies in yeast 12
Analysis of 12 yeast strains from 2 clades Selected 5 strains to build graph: one reference + 2 per clade. VCF vg Pairwise alignment with reference Assemblies for 5 yeast strains Cactus aligner Cactus vg graph From de novo assemblies in yeast 13 VCF graph vg genotype vg VCF Compare map Illumina reads mapping map metrics Cactus graph vg genotype VCF vg
VCF vg Pairwise alignment with reference Assemblies for 5 yeast strains Cactus aligner Cactus vg graph VCF graph vg genotype VCF vg short reads map Compare mapping metrics map Cactus graph vg genotype VCF vg Evaluation Evaluating SV genotyping using mapping statistics No gold-standard to compare with. From de novo assemblies in yeast 14
VCF vg Pairwise alignment with reference Evaluating SV genotyping using mapping statistics Assemblies for 5 yeast strains Cactus aligner No gold-standard to compare with. Cactus vg graph Map reads to a sample graph built from the SV calls: VCF graph vg genotype VCF vg short reads map Compare mapping metrics map Cactus graph vg genotype VCF vg Evaluation Mapping quality ∼ Sample graph quality ∼ SV calls quality. From de novo assemblies in yeast 14
Better mapping for SVs called in the cactus graph Analysis restricted to reads at variation sites. YPS128 ● 0.8 Cactus: average mapping identity during graph Y12 ● construction SK1 0.7 UFRJ50816 excluded DBVPG6765 DBVPG6044 included ● UWOPS919171 CBS432 0.6 N44 clade UWOPS034614 YPS138 cerevisiae ● paradoxus 0.5 0.4 0.4 0.5 0.6 0.7 0.8 VCF: average mapping identity From de novo assemblies in yeast 15
Conclusions The vg toolkit can integrate and genotype SVs. Graphs from de novo assemblies alignment performs better. Hickey et al. bioRxiv 2019 https://jmonlong.github.io/manu-vgsv/ Conclusions and future directions 16
Conclusions The vg toolkit can integrate and genotype SVs. Graphs from de novo assemblies alignment performs better. Hickey et al. bioRxiv 2019 https://jmonlong.github.io/manu-vgsv/ Future directions Experiment with high-quality human de novo assemblies (e.g. the Human PanGenome Project). Combine public SV catalogs and genotype SVs in a large and diverse cohort. Conclusions and future directions 16
Acknowledgment Benedict Paten Glenn Hickey David Heller Adam Novak Erik Garrison Jouni Siren Jordan Eizenga Charles Markello Xian Chang Robin Rounthwaite Jonas Sibbesen Eric T. Dawson Acknowledgment 17
Universal genome graph 18
Some methods “over-genotype” similar variants experiment SMRT−SV v2 Genotyper ● ● HGSVC real reads ● ● GIAB ● CHM−PD ● SVPOP Delly Genotyper ● ● type ● INS SVTyper DEL method BayesTyper ● ● S S S 1 Paragraph 2 3 ● ● ● S1 S1 Truth set Paragraph vg vg ● ● ● ● 0.9 1.0 1.1 1.2 average number of genotyped calls per truth call 19
Deletion correctly genotyped by vg reads GRCh38 chr2 deletion graph 51 bp homozygous deletion in the 3’ UTR of the LONRF2 gene. 3’ UTR of LONRF2 gene 20
Simulation experiment True SVs in VCF Errors in VCF 1.00 0.75 INS 0.50 0.25 0.00 Method 1.00 vg 0.75 Best F1 Paragraph DEL 0.50 BayesTyper 0.25 SVTyper 0.00 Delly Genotyper 1.00 0.75 INV 0.50 0.25 0.00 1 3 7 13 20 1 3 7 13 20 Depth 21
SV catalog summary results SV evaluation vg BayesTyper Delly Genotyper presence genotype Method Paragraph SVTyper SMRT−SV v2 Genotyper HGSVC simulated reads HGSVC real reads GIAB CHM−PD SVPOP 1.00 0.75 INS 0.50 0.25 Best F1 0.00 1.00 0.75 DEL 0.50 0.25 0.00 all non−repeat all non−repeat all non−repeat all non−repeat all non−repeat Genomic regions 22
Precision-recall curve INS DEL 1.0 0.8 ● ● ●● Precision ● 0.6 ● ● ● ● 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Genomic regions ● ● ● ● all non−repeat vg BayesTyper Delly Genotyper Method ● ● Paragraph SVTyper 23
Evaluation per SV size INS INS DEL DEL all regions non−repeat regions all regions non−repeat regions 1.00 0.75 HGSVC 0.50 simulated reads 0.25 0.00 1.00 F1 score 0.75 HGSVC 0.50 real reads 0.25 0.00 1.00 0.75 0.50 GIAB 0.25 0.00 0 ] 0 ] 0 ] 0 ] 0 ] 0 ] K ] K ] K ] K 0 ] 0 ] 0 ] 0 ] 0 ] 0 ] K ] K ] K ] K 0 ] 0 ] 0 ] 0 ] 0 ] 0 ] K ] K ] K ] K 0 ] 0 ] 0 ] 0 ] 0 ] 0 ] K ] K ] K ] K 0 0 0 0 0 0 1 5 5 5 0 0 0 0 0 0 1 5 5 5 0 0 0 0 0 0 1 5 5 5 0 0 0 0 0 0 1 5 5 5 1 2 3 4 6 8 , . , > 1 2 3 4 6 8 , . , > 1 2 3 4 6 8 , . , > 1 2 3 4 6 8 , . , > 0 , 0 , 0 , 0 , 0 , 0 , 0 0 2 , K 0 , 0 , 0 , 0 , 0 , 0 , 0 0 , 2 K 0 , 0 , 0 , 0 , 0 , 0 , 0 0 , 2 K 0 , 0 , 0 , 0 , 0 , 0 , 0 0 , 2 K 5 0 0 0 0 0 8 K 5 5 0 0 0 0 0 8 K 5 5 0 0 0 0 0 8 K 5 5 0 0 0 0 0 8 K 5 [ 1 2 3 4 6 ( 1 2 . [ 1 2 3 4 6 ( 1 2 . [ 1 2 3 4 6 ( 1 2 . [ 1 2 3 4 6 ( 1 2 . ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( Size (bp) 24
Recommend
More recommend