genotyping structural variants in pangenome graphs using
play

Genotyping structural variants in pangenome graphs using the vg - PowerPoint PPT Presentation

Genotyping structural variants in pangenome graphs using the vg toolkit Jean Monlong November 7, 2019 Genome Informatics Pangenome graphs and variant-aware read mapping Linear reference genome A C A T T G G C Seq. reads T G G G C


  1. Genotyping structural variants in pangenome graphs using the vg toolkit Jean Monlong November 7, 2019 Genome Informatics

  2. Pangenome graphs and variant-aware read mapping Linear reference genome A C A T T G G C Seq. reads T G G G C 1KGP 1KGP 1KGP Variation graph C C C G G G T G G C Seq. reads T G G G C Introduction 2

  3. Mapping reads across structural variants Structural variants are genomic variants larger than 50 bp, e.g. insertions, deletions, inversions translocations. Linear reference genome DELETION Variation graph DELETION C C C G G G INSERTION Introduction 3

  4. SV catalogs from long-read sequencing studies Ref. Project Samples Human Genome Structural Chaisson et al. 2019 3 Variation Consortium ( HGSVC ) Audano et al. 2019 15 SVPOP Zook et al. 2019 Genome in a Bottle ( GIAB ) 1 6000 HGSVC 4000 2000 SV type variant 0 DEL 60000 gnomad−SV INS 40000 20000 0 10 100 1,000 10,000 100,000 size (bp) Introduction 4

  5. The vg toolkit is a complete, open source solution for graph construction , read mapping , and variant calling . https://github.com/vgteam/vg Garrison et al. Nature Biotech 2018 Can we genotype SVs from short-read sequencing datasets with the vg toolkit? Starting from public SV catalogs or de novo assemblies. Hickey et al. bioRxiv 2019 Goal 5

  6. Genotyping public SV catalogs in human GRCh38 HGSVC vg SV catalog VCF HG00514 vg short reads HG00514 genotyped SVs Evaluation HG00514 VCF Evaluate genotype predictions for a sample from the truth set (e.g. HG00514). From SV catalogs in human 6

  7. Genotyping variants in vg deletion Linear reference insertion reads not mapped reads on linear reference Snarl 1 Snarl 2 Path coverage ratio 1:1.6 → het Path coverage ratio 0:2 → hom Read mapping to reference path Read mapping to variant path Reference path Variant path Graph reference insertion Genotyping is based on the path coverage. A snarl is a variant site in the graph, a “bubble”. From SV catalogs in human 7

  8. Evaluating SV genotypes with a truth set Deletions/Inversions At least 50% coverage and 10% reciprocal overlap <50% coverage truth calls <10% rec. overlap Insertions At least 50% of inserted sequence matching nearby insertions truth 20bp 20bp calls R package: https://github.com/jmonlong/sveval From SV catalogs in human 8

  9. Results on HGSVC - Simulated reads whole − genome non − repeat regions 1.00 0.75 INS 0.50 0.25 0.00 F1 1.00 0.75 DEL 0.50 0.25 0.00 Paragraph vg BayesTyper Delly SVTyper Paragraph vg BayesTyper Delly SVTyper g r a p h - b a s e d o n a l t r a d i t i y p e r s S V g e n o t S V g e n o t y p e r s Non-repeat regions: regions not overlapping segmental duplications or simple repeats From SV catalogs in human 9

  10. Results on HGSVC - Real reads whole − genome non − repeat regions 0.8 0.6 INS 0.4 0.2 0.0 F1 0.8 0.6 DEL 0.4 0.2 0.0 Paragraph vg BayesTyper Delly SVTyper Paragraph vg BayesTyper Delly SVTyper g r a p h - b a s e d a l t r a d i t i o n p e r s S V g e n o t y S V g e n o t y p e r s Non-repeat regions: regions not overlapping segmental duplications or simple repeats From SV catalogs in human 10

  11. Simple repeat/low complexity regions are challenging 1.00 repeat class/family SINE/Alu LTR/ERV1 0.75 LINE/L1 Retroposon/SVA Low_complexity recall Satellite 0.50 Satellite/centr Simple_repeat 0.25 SV type INS DEL 0.00 0.00 0.25 0.50 0.75 1.00 precision SV sequence annotated with RepeatMasker. Class assigned if covered ≥ 80% by a repeat element. From SV catalogs in human 11

  12. Challenges with the VCF format Multiple equivalent representations, over-simplification, impractical. VCF v4.2 specs Why not start directly from de novo assemblies? From de novo assemblies in yeast 12

  13. Analysis of 12 yeast strains from 2 clades Selected 5 strains to build graph: one reference + 2 per clade. VCF vg Pairwise alignment with reference Assemblies for 5 yeast strains Cactus aligner Cactus vg graph From de novo assemblies in yeast 13 VCF graph vg genotype vg VCF Compare map Illumina reads mapping map metrics Cactus graph vg genotype VCF vg

  14. VCF vg Pairwise alignment with reference Assemblies for 5 yeast strains Cactus aligner Cactus vg graph VCF graph vg genotype VCF vg short reads map Compare mapping metrics map Cactus graph vg genotype VCF vg Evaluation Evaluating SV genotyping using mapping statistics No gold-standard to compare with. From de novo assemblies in yeast 14

  15. VCF vg Pairwise alignment with reference Evaluating SV genotyping using mapping statistics Assemblies for 5 yeast strains Cactus aligner No gold-standard to compare with. Cactus vg graph Map reads to a sample graph built from the SV calls: VCF graph vg genotype VCF vg short reads map Compare mapping metrics map Cactus graph vg genotype VCF vg Evaluation Mapping quality ∼ Sample graph quality ∼ SV calls quality. From de novo assemblies in yeast 14

  16. Better mapping for SVs called in the cactus graph Analysis restricted to reads at variation sites. YPS128 ● 0.8 Cactus: average mapping identity during graph Y12 ● construction SK1 0.7 UFRJ50816 excluded DBVPG6765 DBVPG6044 included ● UWOPS919171 CBS432 0.6 N44 clade UWOPS034614 YPS138 cerevisiae ● paradoxus 0.5 0.4 0.4 0.5 0.6 0.7 0.8 VCF: average mapping identity From de novo assemblies in yeast 15

  17. Conclusions The vg toolkit can integrate and genotype SVs. Graphs from de novo assemblies alignment performs better. Hickey et al. bioRxiv 2019 https://jmonlong.github.io/manu-vgsv/ Conclusions and future directions 16

  18. Conclusions The vg toolkit can integrate and genotype SVs. Graphs from de novo assemblies alignment performs better. Hickey et al. bioRxiv 2019 https://jmonlong.github.io/manu-vgsv/ Future directions Experiment with high-quality human de novo assemblies (e.g. the Human PanGenome Project). Combine public SV catalogs and genotype SVs in a large and diverse cohort. Conclusions and future directions 16

  19. Acknowledgment Benedict Paten Glenn Hickey David Heller Adam Novak Erik Garrison Jouni Siren Jordan Eizenga Charles Markello Xian Chang Robin Rounthwaite Jonas Sibbesen Eric T. Dawson Acknowledgment 17

  20. Universal genome graph 18

  21. Some methods “over-genotype” similar variants experiment SMRT−SV v2 Genotyper ● ● HGSVC real reads ● ● GIAB ● CHM−PD ● SVPOP Delly Genotyper ● ● type ● INS SVTyper DEL method BayesTyper ● ● S S S 1 Paragraph 2 3 ● ● ● S1 S1 Truth set Paragraph vg vg ● ● ● ● 0.9 1.0 1.1 1.2 average number of genotyped calls per truth call 19

  22. Deletion correctly genotyped by vg reads GRCh38 chr2 deletion graph 51 bp homozygous deletion in the 3’ UTR of the LONRF2 gene. 3’ UTR of LONRF2 gene 20

  23. Simulation experiment True SVs in VCF Errors in VCF 1.00 0.75 INS 0.50 0.25 0.00 Method 1.00 vg 0.75 Best F1 Paragraph DEL 0.50 BayesTyper 0.25 SVTyper 0.00 Delly Genotyper 1.00 0.75 INV 0.50 0.25 0.00 1 3 7 13 20 1 3 7 13 20 Depth 21

  24. SV catalog summary results SV evaluation vg BayesTyper Delly Genotyper presence genotype Method Paragraph SVTyper SMRT−SV v2 Genotyper HGSVC simulated reads HGSVC real reads GIAB CHM−PD SVPOP 1.00 0.75 INS 0.50 0.25 Best F1 0.00 1.00 0.75 DEL 0.50 0.25 0.00 all non−repeat all non−repeat all non−repeat all non−repeat all non−repeat Genomic regions 22

  25. Precision-recall curve INS DEL 1.0 0.8 ● ● ●● Precision ● 0.6 ● ● ● ● 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Genomic regions ● ● ● ● all non−repeat vg BayesTyper Delly Genotyper Method ● ● Paragraph SVTyper 23

  26. Evaluation per SV size INS INS DEL DEL all regions non−repeat regions all regions non−repeat regions 1.00 0.75 HGSVC 0.50 simulated reads 0.25 0.00 1.00 F1 score 0.75 HGSVC 0.50 real reads 0.25 0.00 1.00 0.75 0.50 GIAB 0.25 0.00 0 ] 0 ] 0 ] 0 ] 0 ] 0 ] K ] K ] K ] K 0 ] 0 ] 0 ] 0 ] 0 ] 0 ] K ] K ] K ] K 0 ] 0 ] 0 ] 0 ] 0 ] 0 ] K ] K ] K ] K 0 ] 0 ] 0 ] 0 ] 0 ] 0 ] K ] K ] K ] K 0 0 0 0 0 0 1 5 5 5 0 0 0 0 0 0 1 5 5 5 0 0 0 0 0 0 1 5 5 5 0 0 0 0 0 0 1 5 5 5 1 2 3 4 6 8 , . , > 1 2 3 4 6 8 , . , > 1 2 3 4 6 8 , . , > 1 2 3 4 6 8 , . , > 0 , 0 , 0 , 0 , 0 , 0 , 0 0 2 , K 0 , 0 , 0 , 0 , 0 , 0 , 0 0 , 2 K 0 , 0 , 0 , 0 , 0 , 0 , 0 0 , 2 K 0 , 0 , 0 , 0 , 0 , 0 , 0 0 , 2 K 5 0 0 0 0 0 8 K 5 5 0 0 0 0 0 8 K 5 5 0 0 0 0 0 8 K 5 5 0 0 0 0 0 8 K 5 [ 1 2 3 4 6 ( 1 2 . [ 1 2 3 4 6 ( 1 2 . [ 1 2 3 4 6 ( 1 2 . [ 1 2 3 4 6 ( 1 2 . ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( ( Size (bp) 24

Recommend


More recommend