Global patterns of copy number variation in humans from a population-based analysis. ICHG Kyoto Jean Monlong April 5, 2016 B OURQUE L AB M C G ILL U NIVERSITY H UMAN G ENETICS D EPT .
Disclosure Information I have no financial relationships to disclose 2
Copy-Number Variation 3 Copy-Number Variation
Copy Number Variation (CNV) Imbalanced genetic variation involving more than 500bp. 4 Copy-Number Variation
CNV detection from High-Throughput Sequencing Baker 2012, Nature Methods. 5 Copy-Number Variation
Low-mappability regions Repeat-rich regions, centromeres, telomeres. ∼ 13% of the human genome. 6 Copy-Number Variation
Low-mappability regions Repeat-rich regions, centromeres, telomeres. ∼ 13% of the human genome. More prone to CNV. Enriched in Segmental Duplications (Sharp Annual Review 2006) . Short Tandem Repeats highly polymorphic (Warbuton BMC Genomics 2008) . Transposons involved in CNV formation (Sen AJHG 2006) . 6 Copy-Number Variation
Low-mappability regions Repeat-rich regions, centromeres, telomeres. ∼ 13% of the human genome. More prone to CNV. Enriched in Segmental Duplications (Sharp Annual Review 2006) . Short Tandem Repeats highly polymorphic (Warbuton BMC Genomics 2008) . Transposons involved in CNV formation (Sen AJHG 2006) . Involved in phenotype and disease. Short Tandem Repeats and gene expression (Gymrek Nat. Genetics 2016) . Repeats CNV involved in ∼ 30 genetic disorders (Mirkin Nature 2007) . Retrotransposition in cancer (Lee Science 2012) . 6 Copy-Number Variation
PopSV approach 7 PopSV approach
PopSV approach Objective Test the entire genome , including low-mappability regions, and detect subtle abnormal coverage . PopSV: Population-based approach Use a set of reference experiments to detect abnormal patterns. number of reads mapped sample reference tested genomic window 8 PopSV approach
Benchmark and validation Existing methods FREEC LASSO-based segmentation; GC and mappability correction. cn.MOPS Multi-sample Bayesian-based segmentation. Whole-Genome Sequencing data 45 samples, including 10 twin families (i.e 2 twins + 2 parents) . 95 pairs of normal/tumor samples from Renal Cell Carcinoma (CageKid). 9 PopSV approach
Benchmark and validation Replication in the twins . Concordance with pedigree. Replication in the paired tumor . Concordance of different bin sizes PCR validation. Overall performance and in different repeat context . 10 PopSV approach
Validation conclusions PopSV detects 3-5x more variants . Wider genomic range . Robust across challenging regions: Low-coverage. Segmental duplications. DNA satellites. Short tandem repeats GC-rich/poor. Resolution down to half the bin size. 11 PopSV approach
CNV patterns in normal genomes 12 CNV patterns in normal genomes
CNV in normal genomes 640 normal genomes 45 samples from the Twin study ( ∼ 40X) 95 normal samples from Renal Cell Carcinoma ( ∼ 54X). 500 unrelated samples from GoNL ( ∼ 14X). 13 CNV patterns in normal genomes
CNV in normal genomes 640 normal genomes 45 samples from the Twin study ( ∼ 40X) 95 normal samples from Renal Cell Carcinoma ( ∼ 54X). 500 unrelated samples from GoNL ( ∼ 14X). Where are CNVs located ? In Centromere ? Telomere ? Segmental duplication ? DNA satellites ? Short tandem repeats ? Transposable Elements ? Exons ? Promoters ? 13 CNV patterns in normal genomes
CNV in normal genomes 640 normal genomes 45 samples from the Twin study ( ∼ 40X) 95 normal samples from Renal Cell Carcinoma ( ∼ 54X). 500 unrelated samples from GoNL ( ∼ 14X). Where are CNVs located ? In Centromere ? Telomere ? Segmental duplication ? DNA satellites ? Short tandem repeats ? Transposable Elements ? Exons ? Promoters ? Control regions Same size distribution. Randomly distributed. 13 CNV patterns in normal genomes
Enriched close to Centromere/Telomere/Gap (CTG) 1.00 0.75 cumulative proportion 0.50 0.25 region CNV control 0.00 0e+00 2e+07 4e+07 6e+07 distance to centromere/telomere/gap (bp) 14 CNV patterns in normal genomes
Enriched in SD and low-coverage regions 15 CNV patterns in normal genomes
Going further 1. Control for the SD and CTG patterns. 2. Look at other repeat classes. Control regions Randomly distributed. Same size distribution. 16 CNV patterns in normal genomes
Going further 1. Control for the SD and CTG patterns. 2. Look at other repeat classes. Control regions Randomly distributed. Same size distribution. Same proportion overlapping a segmental duplication . Similar distance to CTG . 16 CNV patterns in normal genomes
Controlling for SD and distance to CTG 17 CNV patterns in normal genomes
Controlling for SD and distance to CTG 17 CNV patterns in normal genomes
Controlling for SD and distance to CTG Satellites enrichment driven by ALR/Alpha , (GAATG)n/(CATTC)n families. Short Tandem Repeats Enrichment distributed across families... ... but stronger for larger STR . Transposable elements (TE): SVA class enriched. Expected: L1HS , L1PA2 to L1PA5 . Surprises: HERVH , LTR38 , LTR4 . 18 CNV patterns in normal genomes
Repeat CNVs and protein-coding genes Genes with CNVs Set CNVs Exon + Promoter + Intron All CNVs 91733 7206 11341 13259 Low coverage 26888 682 1151 1977 Extremely low coverage 10010 347 465 521 STR 4286 45 286 748 Satellite 1822 2 21 33 TE 20491 164 1747 3998 STR/Satellite/TE 22313 166 1760 4014 Repeat CNV: more than 90% of the CNV is annotated as repeat. 19 CNV patterns in normal genomes
Conclusion 20 Conclusion
Summary PopSV uses reference samples. detects more CNVs. is robust across the entire genome. 21 Conclusion
Summary PopSV uses reference samples. detects more CNVs. is robust across the entire genome. In normal genomes: CNVs enriched in low coverage regions . Specific enrichment in satellites, simple repeats, TEs . Not due to segmental duplication enrichment. Replicated across datasets but different from somatic patterns. Some CNVs in low coverage regions or repeats hit exonic sequence . 21 Conclusion
Guillaume Bourque Simon Gravel Mathieu Bourgey Mathieu Blanchette Louis Letourneau Francois Lefebvre Eric Audemard Toby Hocking Simon Girard Patrick Cossette Guy Rouleau Caroline Meloche
23
Workflow 24
Replication in twins 25
Robust across challenging regions 1.00 1.00 proportion of regions with concordant samples proportion of regions with concordant samples 0.75 0.75 set set PopSV PopSV call call 0.50 0.50 null null 0.25 0.25 0.00 0.00 low expected high [0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1] coverage class GC content 1.00 1.00 proportion of regions with concordant samples proportion of regions with concordant samples 0.75 0.75 set set PopSV PopSV 0.50 call 0.50 call null null 0.25 0.25 0.00 0.00 [0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1] [0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1] segmental duplication proportion simple repeat proportion 26
Robust across challenging regions 0.0 0.2 0.4 0.6 0.8 ● 1652−Mother Using only CNVs in extremely low coverage regions ! 1652−Father family ● 1652−Twin1 1652−Twin2 ● 1480−Mother 1480−Twin2 1480−Twin1 1121 ● 1389−Mother 1389−Twin1 ● 1389−Twin2 ● 1207 1207−Mother 1207−Father 1207−Twin1 ● 1207−Twin2 1286 1286−Father 1286−Twin1 1286−Twin2 ● ● 1286−Mother 1301 Father 1389−Father other5 ● 1301−Father PopSV sample ● 1480−Father 1323 Mother 1323−Father ● 1301−Mother ● 1301−Twin1 1301−Twin2 1389 ● 1323−Mother Twin 1323−Twin1 ● 1323−Twin2 other1 1443 ● 1443−Mother 1443−Father ● 1443−Twin2 1480 1443−Twin1 ● 1121−Mother 1121−Father ● 1121−Twin1 1490 1121−Twin2 other3 other2 ● other4 1652 1490−Father ● 1490−Mother 1490−Twin1 1490−Twin2 27
Resolution - 500 bp bins Vs 5 Kbp bins 1.00 proportion overlapping 5kbp−bin calls 0.75 0.50 0.25 0.00 0 2500 5000 7500 10000 12500 15000 17500 20000 size of the 500bp−bin call 28
Control regions QC − SD, low−coverage and CTG distance control 1.00 proportion overlapping the feature 0.75 set CNV 0.50 control 0.25 0.00 p p a u m d g w e o s l feature 29
Control regions QC − SD, low−coverage and CTG distance control 1.00 0.75 cumulative proportion 0.50 0.25 region CNV control 0.00 0e+00 2e+07 4e+07 6e+07 distance to centromere/telomere/gap (bp) 30
Control regions S/2 S/2 S/2 S/2 S/2 S/2 Random region of size S = Random base in green S 31
Controlling for SD and distance to CTG SINE SVA TE LTR LINE DNA SVA_F SVA_E SVA_D MER65A LTR4 TE top families LTR38−int L1PA5 L1PA4 L1PA3 L1PA2 L1HS HERVH−int AluY Twins CK Normal GoNL CK Somatic cohort Significance (−log10 Pvalue) 4 8 12 Depleted Enriched 32
Recommend
More recommend