Cost Partitioning Techniques for Multiple Sequence Alignment Mirko Riesterer, 10.09.18
Agenda. 1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning 6 Experiments 7 Conclusion Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 2
Introduction Multiple Sequence Alignment − Biological sequences mutate during evolution − Insertion, deletion, substitution − Some mutations are more likely (A ↔ G / C ↔T ) − Observe phylogenetic relationships Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 3
Introduction Multiple Sequence Alignment Sequences Alignment − Insert gaps within sequences ACGTG ACGT-G − Maximize correspondence between ACTAG AC-TAG letters in columns CGTAG -CGTAG Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 4
Introduction Judging the alignment quality − Count matches/mismatches − Score matrix − Point accepted mutation ( 𝑄𝐵𝑁 𝑜 ) matrix (Dayhoff et al., 1978) − Blocks substitution matrix (BLOSUM) (Henikoff and Henikoff, 1992) Score matrix: A C T G – A 0 4 2 2 3 C 1 4 3 3 T 0 6 3 G 1 3 - 0 Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 5
Agenda. 1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning 6 Experiments 7 Conclusion Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 6
Formal Definition Family of sequences 𝑇 = {𝑡 1 , … , 𝑡 𝑜 } over alphabet Σ and Σ ′ = Σ ∪ − Score matrix: Alignment Matrix 𝐵 𝑜×𝑛 = a ij , where A C T G – A 0 4 2 2 3 C 1 4 3 3 − 𝑏 𝑗𝑘 ∈ Σ′ T 0 6 3 − 𝑏 𝑗 without − is exactly 𝑡 𝑗 G 1 3 − No column contains only − - 0 Alignment 𝐵 : Sequences: ACT A C T _ CTG _ C T G ___________ 𝐷 𝐵 =3+1+0+3=7 Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 7
Formal Definition Score matrix can be viewed as function 𝑡𝑣𝑐 ∶ Σ ′ × Σ ′ → ℕ Given alignment 𝐵 and score matrix 𝑡𝑣𝑐. 𝒏 Pair score 𝑩 = 𝑫 𝒋𝒌 𝒕𝒗𝒄(𝒃 𝒋𝒍 , 𝒃 𝒌𝒍 ) 𝒍=𝟐 Sum of pairs score 𝑫 𝑩 = 𝑩 𝑫 𝒋𝒌 𝟐≤𝒋<𝒌≤𝒐 Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 8
Formal Definition Shortest Path Problem Directed acyclic graph 𝐻 = 𝑊, 𝐹 𝑊 = 𝑦 1 , … , 𝑦 𝑜 𝑦 𝑗 = 0, … , 𝑚 𝑗 } 𝐹 = ∪ 𝑓∈ 0,1 𝑜 𝑤, 𝑤 + 𝑓 𝑤, 𝑤 + 𝑓 ∈ 𝑊, 𝑓 ≠ 0}. Figure: 3D edge structure Figure: 2D graph alignment (http://www.csbio.unc.edu/mcmillan/Comp555S16/Lecture14.html) Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 9
Agenda. 1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning 6 Experiments 7 Conclusion Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 10
Solving MSA Needleman-Wunsch algorithm Dynamic programming approach Generates zero-based index table with optimal scores Dim 𝑜 , lengths 𝑚 : Complexity 𝑃 𝑚 𝑜 Figure 2: Needleman-Wunsch score table using a score matrix Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 11
Solving MSA Pattern databases Family of sequences 𝑇 = {𝑡 1 , … , 𝑡 𝑜 } . A pattern is a subset 𝑄 ⊆ 𝑇, 𝑄 ≥ 2 . A pattern database (PDB) is the perfect heuristic ℎ ∗ for the subproblem induced by pattern P. Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 12
Solving MSA Heuristic search estimators Family of sequences 𝑇 = {𝑡 1 , … , 𝑡 𝑜 } . 𝒊 𝒒𝒃𝒋𝒔 (Ikeda and Imai, 1994): ℎ 𝑗𝑘 (𝑤) ℎ 𝑞𝑏𝑗𝑠 𝑤 = 1≤𝑗<𝑘≤𝑜 − Uses the information of every 2-dimensional PDB Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 13
Solving MSA Heuristic search estimators Family of sequences 𝑇 = {𝑡 1 , … , 𝑡 𝑜 } . 𝒊 𝒃𝒎𝒎,𝒍 (Kobayashi and Imai, 1998): 1 ℎ 𝑦 1 ,…,𝑦 𝑙 𝑤 ℎ 𝑏𝑚𝑚,𝑙 𝑤 = 𝑜−2 𝑙−2 1≤𝑦 1 <⋯<𝑦 𝑙 ≤𝑜 − Uses the information of every 3-dimensional PDB − Every pair of sequences appears 𝑜−2 𝑙−2 times → normalize − If 𝑙 = 3 , lenghts ~ 500, each PDB contains 10 8 vertices! − Branching factor 2 𝑜 − 1 Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 14
Solving MSA Heuristic search estimators Family of sequences 𝑇 = {𝑡 1 , … , 𝑡 𝑜 } . 𝒊 𝒑𝒐𝒇,𝒍 (Kobayashi and Imai, 1998): 𝑙 𝑜 ℎ 𝑝𝑜𝑓,𝑙 𝑤 = ℎ 𝑦 1 ,…,𝑦 𝑙 𝑤 + ℎ 𝑦 𝑙+1 ,…,𝑦 𝑜 𝑤 + ℎ 𝑦 𝑗 ,𝑦 𝑘 (𝑤) 𝑗=1 𝑘=𝑙+1 − 1 or 2 higher-dimensional PDBs + remaining 2-dimensional PDBs − Avoids normalization by choosing PDBs carefully 𝒊 𝒒𝒃𝒋𝒔 ≤ 𝒊 𝒑𝒐𝒇,𝒍 ≤ 𝒊 𝒃𝒎𝒎,𝒍 Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 15
Agenda. 1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning 6 Experiments 7 Conclusion Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 16
Combining Multiple Pattern Databases Additivity − A pattern collection of 𝑇 = 𝑡 1 , … , 𝑡 𝑜 is a collection 𝑄 = 𝑄 1 , … , 𝑄 𝑛 , P i ⊆ 𝑇 . − 𝑄 is non-conflicting, if no pair of elements of 𝑄 conflict. − Then the sum of PDBs is additive Pattern collection heuristic 𝑛 ℎ 𝑄 𝑤 = ℎ 𝑄 𝑗 (𝑤) 𝑗=1 − Admissible, if 𝑄 is non-conflicting Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 17
Combining Multiple Pattern Databases − Conflicting pattern collections may violate admissibility − Parts may still be useful? Canonical PDB heuristic (Haslum et al., 2007) ℎ CAN v = max ℎ 𝑄 (𝑤) s∈𝑁𝑂𝑇 𝑄∈𝑇 Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 18
Combining Multiple Pattern Databases Post-hoc optimization (Pommerening et al., 2013) − Use linear programming to solve constrained problem − Pattern collection is strictly conflicting if ∩ 𝑗=0 𝑛 𝑄 𝑗 > 1 Let 𝑥 1 , … , 𝑥 𝑛 be the solution to the linear program that maximizes 𝑛 ℎ 𝑄𝐼𝑃 𝑤 = 𝑥 𝑗 ℎ 𝑄 𝑗 (𝑤) 𝑗 𝑥 𝑗 ≤ 1 for all strictly conflicting pattern collections S ′ ⊆ 𝑄 𝑡. 𝑢. 𝑗:𝑄 𝑗 ∈𝑇 ′ 𝑡. 𝑢. 0 ≤ 𝑥 𝑗 ≤ 1 for all P i Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 19
Combining Multiple Pattern Databases Score matrix: Post-hoc optimization (Pommerening et al., 2013) ℎ 𝑏𝑚𝑚,𝑙 equals ℎ 𝑄𝐼𝑃 if we choose the same patterns A C T G – A 1 1 1 0 1 C 1 1 0 1 Proof sketch: T 1 0 1 G 1 1 Four sequences 𝑇 = 𝑡 1 , 𝑡 2 , 𝑡 3 , 𝑡 4 of length 1 - 1 𝑡 1 = 𝐵, 𝑡 2 = 𝐷, 𝑡 3 = 𝑈, 𝑡 4 = 𝐻 𝑄 = 𝑄 1 = 𝑡 1 , 𝑡 2 , 𝑡 3 , 𝑄 2 = 𝑡 1 , 𝑡 2 , 𝑡 4 , , 𝑄 3 = 𝑡 1 , 𝑡 3 , 𝑡 4 , , 𝑄 4 = 𝑡 2 , 𝑡 3 , 𝑡 4 ℎ 𝑄 1 𝑡 = 3 ℎ 𝑄 2 𝑡 = ℎ 𝑄 3 𝑡 = ℎ 𝑄 4 𝑡 = 1 → ℎ 𝑄𝐼𝑃 𝑡 = 1 ∗ 3 + 0 ∗ 1 + 0 ∗ 1 + 0 ∗ 1 = 𝟒 = 3+1+1+1 = ℎ 𝑏𝑚𝑚,3 2 Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 20
Combining Multiple Pattern Databases A factored representation of MSA with operators 𝑗,𝑘 𝑃 = 𝑝 𝑦,𝑧 → 𝑦 ′ ,𝑧 ′ 1 ≤ 𝑗 < 𝑘 ≤ 𝑜, 0 ≤ 𝑦 ≤ 𝑚 𝑗 , 0 ≤ 𝑧 ≤ 𝑚 𝑘 } affects heuristic ℎ 𝑄 if 𝑡 𝑗 , 𝑡 𝑗,𝑘 An operator 𝑝 𝑦,𝑧 → 𝑦 ′ ,𝑧 ′ 𝑘 ∈ 𝑄 Example: e.g. edge 3,3,5 → 4,3,6 is factored into 3 operators: 1,2 1,3 2,3 {𝑝 3,3 → 4,3 , 𝑝 3,5 → 4,6 , 𝑝 3,5 → 3,6 } − Basic factors for opreators in higher dimensions − Less operators than defining all operators Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 21
Agenda. 1 Introduction 2 Formal Definition 3 Solving MSA 4 Combining Multiple Pattern Databases 5 Cost Partitioning 6 Experiments 7 Conclusion Cost Partitioning Techniques for Multiple Sequence Alignment, Mirko Riesterer, 10.09.18 Universität Basel 22
Recommend
More recommend