Discovering Individual Human Genetic Variation with CLEVER + SMART Alexander Sch¨ onhuth joint work with Tobias Marschall, Ivan Costa, Stefan Canzar Markus Bauer, Gunnar Klau, Alexander Schliep CWI Scientific Meeting March 30, 2012
Structural Variations The human reference genome: ...ACCGGAGTAGTATATTTCAGG... Assumption until 2006: only single nucleotide polymorphisms (SNPs) ...ACCGGAGTAGTATATTTCAGG... ...ACTGGAGTACTATATATCAGG... Since 2006: also insertions and deletions (indels), inversions, translocations ... ...ACCGGAGTAGTATATTT---CAGG... ...AC----GTAGATATTTTTTTCAGG... Structural Variation Discovery 2
Next-Generation Sequenced Genomes Figure: MRC National Center for Medical Research, London Structural Variation Discovery 3
Insert Size Distribution Paired-End Read End End Insert Paired-End Reads Read ends of known length Insert of unknown length Insert Size Distribution, Fragments from Yoruban Individual Structural Variation Discovery 4
Discovering Insertions and Deletions Current Challenges Small/mid-size deletions Repetitive regions Multiply mapped reads Possible Approaches Coverage based IGV Screenshot: Deletion Insert size based Red reads: Insert ≥ µ + 2 . 5 σ Split read based Structural Variation Discovery 5
Insertions and Deletions: Alignments I(B) = y_B - x_B - 1 Reference Genome x_B y_B Alignment B Insertion: I(B) too small Paired-End Read Deletion: I(A) too I(A) = y_A - x_A - 1 Reference Genome x_A y_A large Alignment A Paired-End Read Indels: Alignment length deviates from insert size distribution Structural Variation Discovery 6
The Read Alignment Graph C1 = (A1,A2,A3) Reference Genome A1 1 A2 2 A3 3 A4 4 5 A5 6 A6 A7 7 C2 = (A5,A6,A7) Read alignment graph Alignments Structural Variation Discovery 7
The Read Alignment Graph C1 = (A1,A2,A3) Reference Genome A1 1 A2 2 A3 3 A4 4 5 A5 6 A6 A7 7 C2 = (A5,A6,A7) Read alignment graph Alignments Idea: Find all maximal cliques . Structural Variation Discovery 7
Incompatible Alignments (NO edge): (1) Too large length (2) Long internal segments, difference but small overlap I(A)≈μ I(A)>μ A A B B I(B)>μ I(B)>μ I(A,B)>μ O(A,B) Compatible Alignments (edge): (3) Average internal segments (4) Long internal segments, lengths, small overlap sufficient overlap I(A)≈μ I(A)>μ A A B B I(B)≈μ I(B)>μ I(A,B)≈μ I(A,B)>μ O(A,B) O(A,B) Structural Variation Discovery 8
Significantly incompatible? Notations Difference of internal segment length: ∆ 12 Overlap of internal segments: ∩ 12 Mean internal segment length: ¯ I 12 Length compatiblity: U 12 := ¯ I 12 − ∩ 12 Statistical tests 1 Mean compatibility: P ( X ≥ ∆ 12 2 σ ) ≤ α = 0 . 1 √ √ 2( U 12 − µ ) 2 Intersection compatibility: P ( X ≥ ) ≤ α = 0 . 1 σ X is a N (0 , 1) distributed random variable. Structural Variation Discovery 9
CLEVER: Workflow 1 Compute all maximal cliques. 2 Evaluate all maximal cliques statistically. 3 Output: maximal cliques which statistically significantly indicate insertions or deletions. Structural Variation Discovery 10
CLEVER: Workflow 1 Compute all maximal cliques. 2 Evaluate all maximal cliques statistically. 3 Output: maximal cliques which statistically significantly indicate insertions or deletions. Structural Variation Discovery 11
Short Read Alignments Integrative Genomics Viewer (IGV) Screenshot Structural Variation Discovery 12
CLEVER: C lique- E numerating V ariant Find er Outline 1 Iterate over all alignments, sorted by position 2 Maintain a set of active cliques (and active alignments) 3 Output a clique once it “goes out of scope” (+free memory) For each alignment 1 Find set of adjacent nodes 2 Intersect with all active cliques and either Add to existing clique Split clique Create new clique 3 Eliminate duplicate and non-maximal cliques Structural Variation Discovery 13
Fast Implementation Techniques Active alignments: binary search tree (sorted by insert length) Cliques: store as bit-vectors over active alignments Clique intersection bit-parallel Reorganize storage now and then Runtime 30 × coverage, all reads, up to ≈ 650 alignments per read: Around 20 minutes for whole chromosome 1 Structural Variation Discovery 14
CLEVER: Workflow 1 Compute all maximal cliques. 2 Evaluate all maximal cliques statistically. 3 Output: maximal cliques which statistically significantly indicate insertions or deletions. Structural Variation Discovery 15
CLEVER: C lique- E numerating V ariant Find er Each max-clique C (accounting for multiply mapped reads): P ( H 0 | C ) = � P ( C correct and C \ C incorrect) · P ( H 0 | C correct) C ⊂C where H 0 null hypothesis of no variation, | C | · | ¯ I − µ | � P ( H 0 | C correct) = P ( X N (0 , 1) ≤ ) σ reflects Z-test for sample of size | C | . After correction for multiple hypothesis testing : predict indels from all significant cliques C Structural Variation Discovery 16
Prior Approaches Issues Discard massive amounts of read alignments (all alignments from concordant reads ). Statistically less principled definition of variant-related alignment groups No correction for multiple hypothesis testing Structural Variation Discovery 17
Evaluation Benchmarks 1 Simulated data: Venter’s Genome 2 Real data: Yoruban individual (NA18507) Structural Variation Discovery 18
Results (Hit Statistics) Simulation Study: Deletions in Venter’s Genome Length range → 10–49 50–99 100–499 ↓ Algorithm ↓ Prec. / Rec. Prec. / Rec. Prec. / Rec. PINDEL 42.0 / 42.0 52.0 / 35.3 85.8 / 40.5 CLEVER 51.9 / 22.7 51.1 / 76.5 82.5 / 72.4 BreakDancer 15.1 / 0.3 43.5 / 20.1 48.6 / 56.5 GASV 1.1 / 10.5 29.6 / 26.1 0.8 / 53.6 HYDRA – / 0.0 – / 0.0 85.7 / 61.3 VariationHunter 15.2 / 0.8 29.3 / 20.5 49.2 / 59.4 MoDIL 1 18.6 / 16.0 22.3 / 68.5 41.7 / 41.7 1 MoDIL: Run only on Chromosome 1. Structural Variation Discovery 19
Results (Hit Statistics) Real Data: Individual NA18507 Length range → 10–49 50–99 100–499 ↓ Algorithm ↓ Rec. / Excl. Rec. / Excl. Rec. / Excl. PINDEL 45.5 / 38.5 33.7 / 1.0 52.7 / 0.0 CLEVER 9.4 / 2.1 73.2 / 26.3 78.1 / 5.2 BreakDancer 0.2 / 0.0 22.1 / 0.5 63.7 / 0.4 GASV 5.0 / 1.8 27.4 / 0.5 61.7 / 2.3 HYDRA 0.0 / 0.0 0.0 / 0.0 67.2 / 0.4 VariationHunter 0.3 / 0.0 18.9 / 0.0 69.8 / 2.3 Used Annotations Mills et al., Genome Research, 2011. Structural Variation Discovery 20
Room for improvements Issues 1 Integrating split-read information 2 Multiply mapped reads 3 Discovering other types of variants 4 Overlapping cliques → Cluster Editing Structural Variation Discovery 21
Room for improvements Issues 1 Integrating split-read information 2 Multiply mapped reads 3 Discovering other types of variants 4 Overlapping cliques → Cluster Editing Structural Variation Discovery 22
Split-Read Information Integrative Genomics Viewer (IGV) Screenshot Structural Variation Discovery 23
Room for improvements Issues 1 Integrating split-read information 2 Multiply mapped reads 3 Discovering other types of variants 4 Overlapping cliques → Cluster Editing Structural Variation Discovery 24
Read Mapping Ambiguities Red(dish): Misplaced Alignments Goal: Determine correct alignment for multiply mapped reads. Structural Variation Discovery 25
Room for improvements Issues 1 Integrating split-read information 2 Multiply mapped reads 3 Discovering other types of variants 4 Overlapping cliques → Cluster Editing Structural Variation Discovery 26
Other Variations Figure: Feuk et al., 2006 Structural Variation Discovery 27
Room for improvements Issues 1 Integrating split-read information 2 Multiply mapped reads 3 Discovering other types of variants 4 Overlapping cliques → Cluster Editing Structural Variation Discovery 28
Software Clique Enumeration + Significance Testing CLEVER - CL ique E numerating V ariant find ER Availability: http://clever-sv.googlecode.com Fred Clever and Jeff Smart Resolving ambiguities with EM algorithm SMART - S parse M ixture A mbiguity R esolving T ool Availability: Coming Soon Structural Variation Discovery 29
Conclusions Summary Statistically sound criterion for edges Enumerating all maximal cliques is feasible Significance test for cliques Results are good (even without SMART) Future work Finish and benchmark SMART Try other read mappers / re-map reads Integrate split-read information into CLEVER Structural Variation Discovery 30
Recommend
More recommend