Analysis of structural genome varia3on in whole genome and exome sequencing data Victor Guryev November 15, 2017 14th SNP’s and human diseases course Erasmus MC, Rotterdam
Our genomes: base and structural variants /a /g
NGS: how do we get our genomes?
1000 genomes project (1kG) Low coverage whole genome and deep exome sequencing of 2,500 individuals to discover 95% of variants at 1% frequency Small variants : The 1000 Genomes Project Consortium, 2015. Nature 526:68-74 Structural variants : Sudmant et al, 2015. Nature 526:75-81
Genome of the Netherlands (GoNL) Position paper : Boomsma et al, 2013 1000 G GoNL Small variants : Francioli et al, 2014 Structural variants : Hehir-Kwa et al, 2016 DNA source Cell lines Blood 500 bp Coverage 3-4x >12x Median base Data genera3on Mult. plaRorms BGI/Illumina 90 bp coverage: 12x Popula3on Mul3ple, Dutch only, unrelated trios, twins Phenotype info None Mul3ple
SV classes and detec3on methods Structural Genome Varia;ons (SVs) ABCD Copy-number variants Copy-balanced variants Inversion Dele;on Duplica;on Transloca;on ADCB ABD ABCCCD AB CD aCGH Di-tag fosmid and NGS sequencing Fibre-FISH
Single end vs paired-end sequencing chromosome chromosome
Advantages of paired-end sequencing 1) Twice as many bases per slide ! 2) Structural information !!! SINE SINE CONTIG 1 CONTIG 2 A G A T Genome assembly Molecular haplotyping (phasing) Better repeat coverage Structural variants Profiling of transcript isoforms
Paired-end vs mate-pair sequencing Fragmenta3on, size selec3on Internal adaptor liga3on, circulariza3on Adaptors liga3on Adaptors Paired-end library liga3on Insert size Mate-pair library Insert size 200-400 bp 0.6-25 kb
Method 1: Read depth analysis (RD) Expected distribu;on of tags Scope: Copy-number changes W R Average coverage: 5 WGS /site Distribu;on over duplicated site Tool examples: CNV-Seq (Xie &Tammi 2009) W CNVnator (Abyzov et al, R 2011) 5 WGS/site 10 WGS/site 5 WGS/site SegSeq (Chiang et al, 2009) DWAC-Seq (our tool)
Method 2: Discordant pairs (DP) reference sequenced /mapped Normal Inversion Tandem duplication Insertion Deletion Translocation Chr 7 Chr 5 Scope : copy-number and copy-neutral SV at resolu3on close to base-pair Tool exampless: Breakdancer (Chen et al, 2009); 123SV (our tool)
Method 3: Split-read mapping (SR) Scope : predic3on of copy-number and copy- neutral SV at nucleo3de resolu3on Tool examples: Pindel (Ye et al, 2009) SRiC (Zhang et al, 2011) Evidence from mul3ple reads Advantage of paired reads Anchor Split read Unmapped reads are good candidates for split-mapping
Method 4: Genome assembly (AS) Scope : various types of SVs including large inserts Tool examples : de novo assemblers SOAPdenovo, ABYSS, Allpaths-LG BLAST/BLAT/BWASW search for comparison of con3gs and genome reference Imperfect alignment Ref Con3g
Method applicability: base and physical coverage chromosome Base coverage: ~ 1x; Physical coverage ~ 4x Approach Base coverage Physical coverage Depth of coverage ! Discordant pairs ! Split-mapping ! De novo assembly ! !
GoNL pipeline for SV discovery
GoNL SV detec3on [Hehir-Kwa et al., 2016]
Popular SV mining tools PINDEL (http://gmt.genome.wustl.edu/packages/pindel/) Split-read mapping (very specific for short and mid-size variants) DELLY (https://github.com/dellytools/delly) Discordant read and split-read methods LUMPY-SV (https://github.com/arq5x/lumpy-sv) Multi-method tool
Genome sequencing: what do we get? GoNL variant list SNPs 20.4 M Short indels 1-20 bp 1.7M Dele3ons 20-99 bp 31.5k Dele3ons 100+ bp 20k Mobile Element Inser3ons 13k Inser3ons 2,2k Duplica3ons 1,8k Inversions 90 Interchromosomal events 60 Per individual genome (compared to reference genome) 3.7M SNPs 360k short indels (1-20bp) 5.2k medium deletions ( 20 – 100 bp) 3.3k large deletions ( 100+ bp)
Impact of Structural Variants GoNL: Bases affected Variant type Megabases SNVs 20.4 SNVs Indels 4.3 SVs 75.3 Indels Structural variants
Alu Ya4 inser3on in PRAMEF4 gene Alu Ya4 Alu Ya4 PRAME Family member 4 In constitutive exon Observed in 21 samples Mutations in gene are associated with melanoma [Hehir-Kwa et al., 2016]
Complex variants: gene retrotransposi3on inser3on polymorphism (GRIP) Chr15: 40.85Mb Chr7: 26.24 Mb to chr7 ------------------------------deletion------------------------ Chr15: 40.85Mb 1 210 Chr7:26.24Mb 534 to chr15 Prevalence : GoNL about 40 cases Mechanism : (retro)transposition Tools : Discordant pairs (1-2-3-SV) [Hehir-Kwa et al., 2016]
Genomes full of ‘knock-outs'? SKA3 cDNA
Complex variants: MNPs, complex indels Mechanism : polymerase errors Tool example : GATK Haplotype Caller Prevalence : ~3% of all indels are non-simple [Hehir-Kwa et al., 2016]
Complex variants: Non-allelic conversion Father Mother Child Mechanism : gene conversion Tool example : assembly, discordant pairs Prevalence : currently only several cases
Complex variants KRAB box domain containing 4, aka ZNF673, transcription regulator [Hehir-Kwa et al., 2016]
New genomic segments
New segments
New segments: example Allele frequency in GoNL: 28%. 50% of Dutch population have it as
Change in expression level
Change in transcript structure
Complex variants: Chromothripsis 57519917 57521100 57523787 55793180 chr10 57521088 55793182 57523805 57519913 55792170 57524597 chr1 50761470 50761463 105745953 102287386 chr4 105025700 105036708 102287791 104738996 105745783 105028395 102287798 105035150 105745828 104738136 105029770 105036735 105028400 =DNA double strand breaks
How dynamic our genomes are? x 250 1,169 de novo candidate indels Sized 1-20 bp; 99 children 601 de novo candidate SVs Sized 20+ bp; 250 families (258 non-identical children) Validation by PCR, sequencing 291 de novo indels 41 de novo SVs • 203 small deletions • 27 deletions • 74 insertions • 8 duplications • 14 complex indels • 5 Alu insertions • 1 complex event Genome Res (2015) 25:792–801
De novo SVs: size distribu3on
De novo muta3ons : parental and familial bias Non-uniform distribution of SVs, p = 0.0074 Indels SVs
What about targeted re-sequencing? WGS Father WGS Mother WGS Child WES Father WES Mother WES Child • Same methodologies are applicable for WES • RD analysis: need additional correction to account for variation in enrichment • Very limited sensitivity if SV breakpoint is outside of enriched area Tool examplea : GATK HaplotypeCaller, CONIFER, ExomeCNV
Catching SVs from targeted sequencing Del Father, WGS Mother, WGS Child, WGS Father, WES Mother, WES Child, WES Gene annotation
Not-catching SVs with targeted sequencing Heterozygous dele3on in Father inherited to Child Father, WGS Mother, WGS Child, WGS Father, WES Mother, WES Child, WES Gene annotation
SV imputa3on
SV imputa3on (2)
SV imputa3on (3)
PacBio and OxNano: true long reads
Moleculo, 10xGenomics: synthe3c long reads
Take home message: importance of SVs Variant Human Common Rare Individual/ De novo Soma;c, type Vs Variants variants family- Variants ageing- Chimp AF > 5% specific (avg per kid) related Single Base 1.23% of 5.948 Mb 6.625 Mb 6,989 Mb 45 bp ? Changes genome Structural 3% of 10.916 Mb 28.507 Mb 43,317 Mb 4,084 bp ? genome SNV:CNV 1 : 2 1 : 2 1 : 4 1 : 6 1 : 91 1 : ? ra3o [Chimp genome [ Hehir-Kwa, � Guryev, 2016 ] consortium, 2005] [Kloosterman, � Guryev, 2015]
Acknowledgements GoNL SV Team GoNL steering committee Wigard Kloosterman UMCU Paul de Bakker UMCU Laurent C. Francioli UMCU Dorret Boomsma VU Jayne Y. Hehir-Kwa UMCN Cornelia van Duin EMC Djie Tjwan Thung UMCN Gert-Jan van Ommen LUMC Tobias Marschall CWI/MPI Eline Slagboom LUMC Alexander Schoenhuth CWI Morris Swertz UMCG Matthijs Moed LUMC Cisca Wimenga UMCG Eric-Wubbo Lameijer LUMC Abdel Abdellaoui VU University of Washington Slavik Koval EMC/LUMC Fereydoun Hormozdiari Joep de Ligt UMCN Evan E. Eichler Najaf Amin EMC Freerk van Dijk UMCG BGI Shenzen Lennart Karssen EM/Polyomica Jun Wang Leon Mei LUMC Kai Ye LUMC/WASHU ERIBA, RuG, UMC Groningen Diana Spierings Marianna Bevova Rene Wardenaar Tristan de Jong Peter Lansdorp
Recommend
More recommend