Population-based Detection of Structural Variants in Normal and Aberrant Genomes. Jean Monlong Guillaume Bourque’s group Genome Informatics - September 21-24, 2014 Human Genetics Dept. 1 / 19
Structural variation Genetic variation involving more than 500bp. Baker 2012, Nature Methods. Raphael Lab, Brown University. Structural Variant: SV ; Copy Number Variation: CNV . 2 / 19
SV detection using High-Throughput Sequencing Baker 2012, Nature Methods. 3 / 19
Limitation Low mappability ◮ Noisy or reduced signal in repeat-rich regions, centromeres, telomeres. ◮ Unpredictable segmentation → reduced sensitivity/specificity. ◮ Filtering problematic regions reduces the genome range tested. number of reads mapped genomic window number of reads mapped genomic window 4 / 19
Objective Test the entire genome, including low-mappability regions, and detect subtle abnormal coverage. PopSV: Population-based approach Use a set of reference experiments to detect abnormal patterns. number of reads mapped sample reference tested genomic window 5 / 19
PopSV: Population-based approach number of reads mapped sample reference tested genomic window Workflow 1. Genome is fragmented in bins. 2. Reads in each bin are counted, for each sample. 3. Normalization of the bin counts. 4. Each sample and each bin is tested for divergence from reference samples (Z-score). 5. P-value estimation and multiple test correction. 6 / 19
PopSV: importance of normalization propotion of the studied genome 0.00 0.05 0.10 0.15 0.20 ◮ Naive normalization (linear, quantile) is often not enough. ◮ Experiment-specific technical bias. RS114677 coverage K2310006 LR354 lowest highest RS114676 RS114604 RS114528 K2310078 K2310004 RS114674 LR398 RS114605 LR405 K2110089 K2310061 LR417 RS114585 LR340 K2150051 LR364 K2310024 LR422 K2310030 K2310008 K2150053 LR380 RS114636 K2150052 K2310001 K2150045 K2310090 K2310080 RS114624 RS114539 RS114606 LR377 LR370.2 LR370 K2310038 sample K2110093 LR407 RS114646 RS114494 K2310007 K2150047 LR390 LR344 K2110118 LR371 RS114527 LR382 K2310025 K2110060 LR357 K2110078 RS114472 LR420 K2150024 K2110106 RS114511 RS114541 RS114563 LR404 LR389 RS114912 RS114728 RS114719 LR426 LR423 LR358 K2110068 LR413 K2110061 K2110073 K2110056 RS114532 K2150006 K2110059 K2110126 K2110085 K2110112 LR396 K1630028 K2110079 K1610359 K1620380 7 / 19 RS114670
PopSV: importance of normalization ◮ PCA-based normalization ( Krumm , 2012; Boeva , 2014). ◮ Targeted normalization: linear using a subset of the genome. Ref1 Ref2 Ref3 Ref4 T est T est 8 / 19
PopSV: Z-score and test For a sample s : ◮ For each bin b : z = BC b s − BC b reference sd b reference ◮ pv = P ( | z | ≤ | Z | ) with Z ∼ N (0 , σ ) where σ is estimated from the z distribution across all bins. 0.5 normalization targeted median 0.4 median+variance quantile 0.3 density 0.2 0.1 0.0 −5.0 −2.5 0.0 2.5 5.0 Z−scores 9 / 19
Application CageKid : Renal Cell Cancer Whole-Genome Sequencing of 100 individuals, ∼ 40X coverage, Illumina paired-end 100bp, normal and tumor paired samples. ◮ Normal samples → reference samples. ◮ 2kb bins. Read-Depth measure - 2 strategies ◮ concordant reads : only properly paired and mapped read pairs. ◮ discordant reads : improperly mapped read pairs or low mapping quality. 10 / 19
Using concordant reads 20 tumor sample Z−score 10 0 nb of bins (0,1] −10 (1,5] (5,10] (10,100] (100,1e+03] (1e+03,Inf] −20 −20 −10 0 10 20 normal sample Z−score “funky snowman” plot 11 / 19
Example: Telomeric region 6000 ● ● ● ● ● ● ● read coverage 4000 2000 normal sample: D000GQ9 ● abnormal ● normal 0 normal samples 135.11 135.13 135.15 position (Mb) Chr.10, overlapping genes (PRAP1, CALY), not detected by other approaches. 12 / 19
Example: Partial tumoral event ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 4000 ● ● ● ● read coverage ● ● 2000 tumor sample: D000GMU ● abnormal ● normal 0 normal samples 100.75 100.80 100.85 100.90 100.95 101.00 position (Mb) Chr.1, overlapping CDC14A gene (cell division cycle), not detected by other approaches. 13 / 19
Validation and benchmark ◮ Germline events detected in tumor samples ? ◮ Consistent with SNP-array calls ? ◮ Twin dataset: consistent with the pedigree ? Germline events detected in tumor samples PopSV ● all events all events FREEC cn.MOPS ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● PopSV low mappability ● ● ● ● low mappability FREEC ● cn.MOPS ● ● ● ● ● ● ● 0 200 400 600 0.00 0.25 0.50 0.75 1.00 number of germline events in tumor proportion of germline events in tumor PopSV detected more consistent calls than other methods with similar specificity. 14 / 19
Centromere/telomere/gap and systematic errors 1.00 method cn.MOPS FREEC PopSV 0.75 CNV frequency in normals 0.50 0.25 0.00 0 5 10 15 20 distance to centromere/telomere/gap (Mb) 15 / 19
PopSV using discordant reads ◮ Discordant reads support SVs. ◮ Goal: robust detection of an excess of discordant reads genome-wide. ◮ Challenging to estimate a background/expected model. Usage PopSV flags abnormal regions for further characterization using orthogonal approaches. Discordant versus concordant reads ◮ Heterogeneous coverage ⇒ hybrid Poisson-Normal Z-score. ◮ Targeted normalization from PopSV on concordant reads. 16 / 19
PopSV and BreakDancer ● 0.5 BreakDancer only BreakDancer + PopSV 0.4 proportion of BreakDancer calls ● 0.3 0.2 ● 0.1 ● ● ● 0.0 (0,2] (2,3] (3,4] (4,5] (5,10] (10,20] (20,50] (50,100] (100,Inf] number of supporting reads in BreakDancer BreakDancer: SV caller using paired-end mapping information ( Chen , 2009). 17 / 19
Conclusion PopSV: Robust and sensitive approach ◮ Superior to other Read-Depth methods. ◮ Wider range of the genome tested. ◮ Detection in low mappability regions and partial tumoral signal. Work in progress ◮ More than an CNV caller. ◮ Excess of discordant read pairs. ◮ Combination with orthogonal approaches (PEM, Assembly). ◮ Custom binning: repeat annotation, Whole-Exome Sequencing. 18 / 19
Acknowledgment ◮ Guillaume Bourque ◮ Mathieu Bourgey ◮ Simon Gravel ◮ Louis Letourneau ◮ Mathieu Blanchette ◮ Francois Lefebvre ◮ Eric Audemard ◮ Mehran Karimzadeh Reghbati ◮ Toby Hocking ◮ Simon Girard 19 / 19
Thank You ! 20 / 19
SNP-array concordance PopSV ● ● ● loose FREEC ● ● cn.MOPS ● ● ● PopSV ● ● stringent FREEC cn.MOPS ● ● ● 0.00 0.25 0.50 0.75 1.00 proportion of SNP−array GS event also in WGS calls 21 / 19
Copy-number distribution 800 600 number of events 400 200 0 0 1 2 3 4 5 copy number estimate 22 / 19
PCA vs Targeted normalization in tumor samples pca tn 40000 30000 D000GNY 20000 10000 0 40000 30000 D000GO1 20000 10000 count 0 40000 30000 D000GOC 20000 10000 0 40000 30000 D000GQK 20000 10000 0 −20 0 20 −20 0 20 z 23 / 19
PopSV and BreakDancer DEL BreakDancer only BreakDancer + PopSV proportion of BreakDancer calls 0.4 0.2 ● ● ● ● 0.0 None Simple_repeat Satellite DNA LTR SINE LINE Class of the repeat overlapping BreakDancer call BreakDancer: SV caller using paired-end mapping information ( Chen , 2009). 24 / 19
Recommend
More recommend