Sample Complexity of Algorithm Configuration for Sequence Alignment - PowerPoint PPT Presentation

Sample Complexity of Algorithm Configuration for Sequence Alignment Travis Dick Nina Balcan Dan DeBlasio Carl Kingsford Tuomas Sandholm Ellen Vitercik

Sequence alignment Goal: Line up pairs of strings ( DNA, RNA, protein, …) Uncover functional, structural, or evolutionary relationships 𝑻 𝟐 = GRTCPKPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP 𝑻 𝟑 = EVKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGPEEIECTKLGNWSAMPSCKA GRTCP---KPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDN-GFVNYPAKPTLYYK-DKATFGCHDGY-SLDGPEEIECTKLGNWS-AMPSCKA

Sequence alignment algorithms Typically optimize for alignment features : Number of matching characters, number of gaps, … [Needleman and Wunsch ‘70; Gotoh ’82] Standard algos solve for alignment maximizing weighted sum How to tune the feature weights?

Sequence alignment algorithms Can sometimes access ground-truth alignment Requires extensive manual alignments Given set of application’s “typical” alignment problems, together with ground-truth alignments, can we learn parameters that recover ground truth?

Model 1. Fix a parameterized alignment optimization function 2. Receive sample problems from unknown distribution Sequence 𝑇 & Sequence 𝑇 ) ⋯ ' ' Sequence 𝑇 & Sequence 𝑇 ) Alignment Alignment 3. Find parameter values with best performance over samples Closest to ground truth, for example

Model 1. Fix a parameterized alignment optimization function 2. Receive sample problems from unknown distribution Sequence 𝑇 & Sequence 𝑇 ) ⋯ ' ' Sequence 𝑇 & Sequence 𝑇 ) Alignment Alignment 3. Find parameter values with best performance over samples Model studied from empirical perspective Kim and Kececioglu ’07; Xu, Hutter, Hoos, Leyton-Brown ’08; Dai, Khalil, Zhang, Dilkina, Song ’17 …

Model 1. Fix a parameterized alignment optimization function 2. Receive sample problems from unknown distribution Sequence 𝑇 & Sequence 𝑇 ) ⋯ ' ' Sequence 𝑇 & Sequence 𝑇 ) Alignment Alignment 3. Find parameter values with best performance over samples Model studied from theoretical perspective Gupta and Roughgarden ’16; Kleinberg, Leyton-Brown, Lucier ‘17; Weisz, György, Szepesvári ‘18 …

Questions Focus of this talk: Will those parameters have high performance in expectation? Sequence 𝑇′ ? Sequence 𝑇 & Sequence 𝑇 ) Sequence 𝑇 ⋯ ' ' Sequence 𝑇 & Sequence 𝑇 ) Alignment Alignment Focus of prior work [e.g., Kim and Kececioglu ’07] : Algorithmically, how to find good parameters over training set

Model 𝒠 : Distribution over sequence pairs (𝑇, 𝑇 ' ) ℝ 0 : Set of parameters For any sequence pair (𝑇, 𝑇 ' ) : 𝑣 𝝇 𝑇, 𝑇 ' = utility of using params 𝝇 ∈ ℝ 0 to align 𝑇, 𝑇 ' Similarity between algorithm’s output & ground truth ' , … , 𝑇 ) , 𝑇 ) ' Generalization: Given samples 𝑇 & , 𝑇 & ~𝒠 , ' − 𝔽 (<,< = )~𝒠 [𝑣 𝝇 𝑇, 𝑇′ ] ≤ ? ) 𝑣 𝝇 𝑇 8 , 𝑇 8 & for any 𝝇 ∈ ℝ 0 , ) ∑ 89&

Primary challenge: Algorithmic performance is volatile function of parameters Similarity to ground truth 𝜍 & 𝜍 B For well-understood functions in machine learning: Close connection between function parameters and value

Outline 1. Pairwise sequence alignment algorithms 2. Sample complexity for pairwise alignment 3. Multiple-sequence alignment algorithms 4. Sample complexity for multiple-sequence alignments 5. Additional applications

Pairwise sequence alignment Input: Two sequences 𝑇, 𝑇′ ∈ Σ D ∗ such that: Alignment: Sequences 𝜐, 𝜐′ ∈ Σ ∪ − Deleting “ − ” yields 𝑇 from 𝜐 and 𝑇′ from 𝜐′ Gap 𝑇 = A C T G 𝜐 = A – - C T G 𝑇′ = G T C A 𝜐′ = - G T C A - Mismatch Match Insertion/deletion ( indel )

Pairwise sequence alignment algorithms Standard algorithm with parameters 𝜍 & , 𝜍 B , 𝜍 H ≥ 0 : Use dynamic programming to find alignment 𝐵 maximizing: (# matches) − 𝜍 & L (# mismatches) − 𝜍 B L (# indels) − 𝜍 H L (# gaps) Gap 𝑇 = A C T G 𝜐 = A – - C T G 𝑇′ = G T C A 𝜐′ = - G T C A - Mismatch Match Insertion/deletion ( indel )

Pairwise sequence alignment algorithms More generally, given parameters 𝝇 ∈ ℝ 0 : Use dynamic programming to find alignment 𝐵 maximizing: 𝜍 & L 𝑔 & 𝐵 + ⋯ + 𝜍 0 L 𝑔 0 𝐵 0 𝐵 features of alignment 𝐵 (e.g., # matches, …) 𝑔 & 𝐵 , … , 𝑔

Pairwise sequence alignment algorithms -GRTCPKPDDLPFSTVVP-LKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP E-VKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGP-EEIECTKLGNWSAMPSC-KA Ground-truth alignment

Pairwise sequence alignment algorithms -GRTCPKPDDLPFSTVVP-LKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP E-VKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGP-EEIECTKLGNWSAMPSC-KA Ground-truth alignment GRTCP---KPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDN-GFVNYPAKPTLYYK-DKATFGCHDGY-SLDGPEEIECTKLGNWS-AMPSCKA Alignment by algorithm with poorly-tuned parameters

Pairwise sequence alignment algorithms -GRTCPKPDDLPFSTVVP-LKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP E-VKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGP-EEIECTKLGNWSAMPSC-KA Ground-truth alignment GRTCP---KPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDN-GFVNYPAKPTLYYK-DKATFGCHDGY-SLDGPEEIECTKLGNWS-AMPSCKA Alignment by algorithm with poorly-tuned parameters GRTCPKPDDLPFSTV-VPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGY-SLDGPEEIECTKLGNWSA-MPSCKA Alignment by algorithm with well-tuned parameters

Outline 1. Pairwise sequence alignment algorithms 2. Sample complexity for pairwise alignment 3. Multiple-sequence alignment algorithms 4. Sample complexity for multiple-sequence alignments 5. Additional applications

Piecewise-constant utility functions 𝑣 ` 𝝇 𝜍 & 𝜍 B 𝑦 = (𝑇, 𝑇 ' ) Theorem If for any problem 𝑦 , the func 𝜍 ↦ 𝑣 Q 𝑦 is piecewise constant and boundaries between pieces defined by 𝑙 hyperplanes: Pseudo-dimension of 𝑣 𝝇 𝝇 ∈ ℝ 0 is O 𝑒 log 𝑙 0 YZ[ \ An optimal 𝝇 on 𝑃 samples is 𝜗 -optimal on 𝒠 . ] ^ Need to show piecewise constant utilities and bound log(𝑙)

Key structural property Lemma: • For any sequence pair 𝑇, 𝑇 ' ∈ Σ D , there exists partition of ℝ 0 such that: For any region 𝑆 , across all 𝝇 ∈ 𝑆 , algorithm’s output is invariant • Partition induced by 𝐮𝐩𝐮𝐛𝐦 # 𝐛𝐦𝐣𝐡𝐨𝐧𝐟𝐨𝐮𝐭 hyperplanes B 𝜍 B 𝜍 &

Key structural property Lemma: • For any sequence pair 𝑇, 𝑇′ ∈ Σ D , there exists partition of ℝ 0 such that: For any region 𝑆 , across all 𝝇 ∈ 𝑆 , algorithm’s output is invariant • Partition induced by 𝐮𝐩𝐮𝐛𝐦 # 𝐛𝐦𝐣𝐡𝐨𝐧𝐟𝐨𝐮𝐭 hyperplanes B Proof: 𝜍 B • For any pair of alignments 𝐵, 𝐵′ , prefer 𝐵 over 𝐵 ' when 8 (𝐵 ' ) . ∑ 8 𝜍 8 ⋅ 𝑔 8 𝐵 > ∑ 8 𝜍 8 ⋅ 𝑔 𝐼 pp = • Preference for 𝐵 vs 𝐵 ' determined by hyperplane 𝐼 pp = . • Let ℋ = {𝐼 pp = ∣ 𝐵, 𝐵′ alignments } . • On any region 𝑆 in ℝ 0 ∖ ℋ , alignment ordering fixed. 𝜍 & • If DP solver breaks ties reasonably, output constant.

Key structural property Lemma: • For any sequence pair 𝑇, 𝑇′ ∈ Σ D , there exists partition of ℝ 0 such that: For any region 𝑆 , across all 𝝇 ∈ 𝑆 , algorithm’s output is invariant • Partition induced by 𝐮𝐩𝐮𝐛𝐦 # 𝐛𝐦𝐣𝐡𝐨𝐧𝐟𝐨𝐮𝐭 hyperplanes B Similarity to ground truth Corollary: • For fixed 𝑇, 𝑇′ , algorithm’s utility is 𝜍 & piecewise-constant function of 𝝇 𝜍 B

Key structural property Lemma: • For any sequence pair 𝑇, 𝑇′ ∈ Σ D , there exists partition of ℝ 0 such that: For any region 𝑆 , across all 𝝇 ∈ 𝑆 , algorithm’s output is invariant • Partition induced by 𝐮𝐩𝐮𝐛𝐦 # 𝐛𝐦𝐣𝐡𝐨𝐧𝐟𝐨𝐮𝐭 hyperplanes B Total # alignments when 𝑇 , 𝑇 ' ≤ 𝑜 at most 2 D 𝑜 BDx&

Generalization for pairwise alignment For any sequence pair (𝑇, 𝑇 ' ) : 𝑣 𝝇 𝑇, 𝑇 ' = utility of using params 𝝇 ∈ ℝ 0 to align 𝑇, 𝑇 ' Similarity between algorithm’s output & ground truth Theorem Pseudo-dimension of 𝑣 𝝇 | 𝝇 ∈ ℝ 0 is z 𝑃 𝑒𝑜 where 𝑜 = max |𝑇| Proof: Pseudo-dimension is 𝑃(𝑒 log 𝑙 ) where 𝑙 = 𝑃(2 D 𝑜 BDx& ) Corollary 0D Optimal 𝝇 on sample of size z ] ^ ) is 𝜗 - optimal for 𝒠 w.h.p. 𝑃(

Improvement for a special case Special case widely used in practice: Given parameters 𝜍 & , 𝜍 B , 𝜍 H ≥ 0 , find alignment maximizing: (# matches) − 𝜍 & L (# mismatches) − 𝜍 B L (# indels) − 𝜍 H L (# gaps) Theorem [Gusfield, Balasubramanian, Naor ’94; Fernández-Baca, Seppäläinen, Slutzki ‘04] • For any sequence pair 𝑇, 𝑇′ , there exists partition of ℝ H such that: For any region 𝑆 , across all 𝝇 ∈ 𝑆 , algorithm’s output is invariant • Partition induced by 𝑃 𝑜 ~ hyperplanes Improvement from ≈ 𝑜 D to 𝑜 ~

Sample Complexity of Algorithm Configuration for Sequence Alignment - PowerPoint PPT Presentation

Sample Complexity of Algorithm Configuration for Sequence Alignment Travis Dick Nina Balcan Dan DeBlasio Carl Kingsford Tuomas Sandholm Ellen Vitercik Sequence alignment Goal: Line up pairs of strings ( DNA, RNA, protein, ) Uncover

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Configuration management Configuration management Configuration management Configuration

Augeas a configuration API Raphal Pinson Configuration Management Sitewide configuration

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

CNC PINpad USA, December 2014 Configuration Configuration Description POS Dollar General

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

CS381V Experiment Presentation Chun-Chen Kuo The Paper Indoor Segmentation and Support

mask misalignment due to Double Patterning Arvind NV, Ajoy Mandal Texas Instruments India 1

Estimating PM2.5 using Fusion of Satellite Remote Sensing, GEOS-Chem, and other Parameters

Ground-Based Measurements of Ethane to Methane Ratios in the Barnett Shale BOGOS 2013 Tara

LXE EXPERIMENTS WITH & Scott Kravitz, E. P. Bernard, L. Hagaman, G. Orebi Gann, D. N.

Temperature sensors on ground plane A. Cervera, A. Izmaylov, M. Sorel, P . Novella, P .

Preparedness Leads to Readiness ARES AND FUN ACTIVITIES Keeping the Spirit Alive! THE NORM

Shadows What for? Shadows tell us about the relative locations and motions of objects Vienna

Sample Complexity of Algorithm Configuration for Sequence Alignment - PowerPoint PPT Presentation

Sample Complexity of Algorithm Configuration for Sequence Alignment Travis Dick Nina Balcan Dan DeBlasio Carl Kingsford Tuomas Sandholm Ellen Vitercik Sequence alignment Goal: Line up pairs of strings ( DNA, RNA, protein, ) Uncover

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Configuration management Configuration management Configuration management Configuration

Augeas a configuration API Raphal Pinson Configuration Management Sitewide configuration

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

CNC PINpad USA, December 2014 Configuration Configuration Description POS Dollar General

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

CS381V Experiment Presentation Chun-Chen Kuo The Paper Indoor Segmentation and Support

mask misalignment due to Double Patterning Arvind NV, Ajoy Mandal Texas Instruments India 1

Estimating PM2.5 using Fusion of Satellite Remote Sensing, GEOS-Chem, and other Parameters

Ground-Based Measurements of Ethane to Methane Ratios in the Barnett Shale BOGOS 2013 Tara

LXE EXPERIMENTS WITH &amp; Scott Kravitz, E. P. Bernard, L. Hagaman, G. Orebi Gann, D. N.

Temperature sensors on ground plane A. Cervera, A. Izmaylov, M. Sorel, P . Novella, P .

Preparedness Leads to Readiness ARES AND FUN ACTIVITIES Keeping the Spirit Alive! THE NORM

Shadows What for? Shadows tell us about the relative locations and motions of objects Vienna

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

LXE EXPERIMENTS WITH & Scott Kravitz, E. P. Bernard, L. Hagaman, G. Orebi Gann, D. N.