Sample Complexity of Algorithm Configuration for Sequence Alignment Travis Dick Nina Balcan Dan DeBlasio Carl Kingsford Tuomas Sandholm Ellen Vitercik
Sequence alignment Goal: Line up pairs of strings ( DNA, RNA, protein, โฆ) Uncover functional, structural, or evolutionary relationships ๐ป ๐ = GRTCPKPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP ๐ป ๐ = EVKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGPEEIECTKLGNWSAMPSCKA GRTCP---KPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDN-GFVNYPAKPTLYYK-DKATFGCHDGY-SLDGPEEIECTKLGNWS-AMPSCKA
Sequence alignment algorithms Typically optimize for alignment features : Number of matching characters, number of gaps, โฆ [Needleman and Wunsch โ70; Gotoh โ82] Standard algos solve for alignment maximizing weighted sum How to tune the feature weights?
Sequence alignment algorithms Can sometimes access ground-truth alignment Requires extensive manual alignments Given set of applicationโs โtypicalโ alignment problems, together with ground-truth alignments, can we learn parameters that recover ground truth?
Model 1. Fix a parameterized alignment optimization function 2. Receive sample problems from unknown distribution Sequence ๐ & Sequence ๐ ) โฏ ' ' Sequence ๐ & Sequence ๐ ) Alignment Alignment 3. Find parameter values with best performance over samples Closest to ground truth, for example
Model 1. Fix a parameterized alignment optimization function 2. Receive sample problems from unknown distribution Sequence ๐ & Sequence ๐ ) โฏ ' ' Sequence ๐ & Sequence ๐ ) Alignment Alignment 3. Find parameter values with best performance over samples Model studied from empirical perspective Kim and Kececioglu โ07; Xu, Hutter, Hoos, Leyton-Brown โ08; Dai, Khalil, Zhang, Dilkina, Song โ17 โฆ
Model 1. Fix a parameterized alignment optimization function 2. Receive sample problems from unknown distribution Sequence ๐ & Sequence ๐ ) โฏ ' ' Sequence ๐ & Sequence ๐ ) Alignment Alignment 3. Find parameter values with best performance over samples Model studied from theoretical perspective Gupta and Roughgarden โ16; Kleinberg, Leyton-Brown, Lucier โ17; Weisz, Gyรถrgy, Szepesvรกri โ18 โฆ
Questions Focus of this talk: Will those parameters have high performance in expectation? Sequence ๐โฒ ? Sequence ๐ & Sequence ๐ ) Sequence ๐ โฏ ' ' Sequence ๐ & Sequence ๐ ) Alignment Alignment Focus of prior work [e.g., Kim and Kececioglu โ07] : Algorithmically, how to find good parameters over training set
Model ๐ : Distribution over sequence pairs (๐, ๐ ' ) โ 0 : Set of parameters For any sequence pair (๐, ๐ ' ) : ๐ฃ ๐ ๐, ๐ ' = utility of using params ๐ โ โ 0 to align ๐, ๐ ' Similarity between algorithmโs output & ground truth ' , โฆ , ๐ ) , ๐ ) ' Generalization: Given samples ๐ & , ๐ & ~๐ , ' โ ๐ฝ (<,< = )~๐ [๐ฃ ๐ ๐, ๐โฒ ] โค ? ) ๐ฃ ๐ ๐ 8 , ๐ 8 & for any ๐ โ โ 0 , ) โ 89&
Primary challenge: Algorithmic performance is volatile function of parameters Similarity to ground truth ๐ & ๐ B For well-understood functions in machine learning: Close connection between function parameters and value
Outline 1. Pairwise sequence alignment algorithms 2. Sample complexity for pairwise alignment 3. Multiple-sequence alignment algorithms 4. Sample complexity for multiple-sequence alignments 5. Additional applications
Pairwise sequence alignment Input: Two sequences ๐, ๐โฒ โ ฮฃ D โ such that: Alignment: Sequences ๐, ๐โฒ โ ฮฃ โช โ Deleting โ โ โ yields ๐ from ๐ and ๐โฒ from ๐โฒ Gap ๐ = A C T G ๐ = A โ - C T G ๐โฒ = G T C A ๐โฒ = - G T C A - Mismatch Match Insertion/deletion ( indel )
Pairwise sequence alignment algorithms Standard algorithm with parameters ๐ & , ๐ B , ๐ H โฅ 0 : Use dynamic programming to find alignment ๐ต maximizing: (# matches) โ ๐ & L (# mismatches) โ ๐ B L (# indels) โ ๐ H L (# gaps) Gap ๐ = A C T G ๐ = A โ - C T G ๐โฒ = G T C A ๐โฒ = - G T C A - Mismatch Match Insertion/deletion ( indel )
Pairwise sequence alignment algorithms More generally, given parameters ๐ โ โ 0 : Use dynamic programming to find alignment ๐ต maximizing: ๐ & L ๐ & ๐ต + โฏ + ๐ 0 L ๐ 0 ๐ต 0 ๐ต features of alignment ๐ต (e.g., # matches, โฆ) ๐ & ๐ต , โฆ , ๐
Pairwise sequence alignment algorithms -GRTCPKPDDLPFSTVVP-LKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP E-VKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGP-EEIECTKLGNWSAMPSC-KA Ground-truth alignment
Pairwise sequence alignment algorithms -GRTCPKPDDLPFSTVVP-LKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP E-VKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGP-EEIECTKLGNWSAMPSC-KA Ground-truth alignment GRTCP---KPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDN-GFVNYPAKPTLYYK-DKATFGCHDGY-SLDGPEEIECTKLGNWS-AMPSCKA Alignment by algorithm with poorly-tuned parameters
Pairwise sequence alignment algorithms -GRTCPKPDDLPFSTVVP-LKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP E-VKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGP-EEIECTKLGNWSAMPSC-KA Ground-truth alignment GRTCP---KPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDN-GFVNYPAKPTLYYK-DKATFGCHDGY-SLDGPEEIECTKLGNWS-AMPSCKA Alignment by algorithm with poorly-tuned parameters GRTCPKPDDLPFSTV-VPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGY-SLDGPEEIECTKLGNWSA-MPSCKA Alignment by algorithm with well-tuned parameters
Outline 1. Pairwise sequence alignment algorithms 2. Sample complexity for pairwise alignment 3. Multiple-sequence alignment algorithms 4. Sample complexity for multiple-sequence alignments 5. Additional applications
Piecewise-constant utility functions ๐ฃ ` ๐ ๐ & ๐ B ๐ฆ = (๐, ๐ ' ) Theorem If for any problem ๐ฆ , the func ๐ โฆ ๐ฃ Q ๐ฆ is piecewise constant and boundaries between pieces defined by ๐ hyperplanes: Pseudo-dimension of ๐ฃ ๐ ๐ โ โ 0 is O ๐ log ๐ 0 YZ[ \ An optimal ๐ on ๐ samples is ๐ -optimal on ๐ . ] ^ Need to show piecewise constant utilities and bound log(๐)
Key structural property Lemma: โข For any sequence pair ๐, ๐ ' โ ฮฃ D , there exists partition of โ 0 such that: For any region ๐ , across all ๐ โ ๐ , algorithmโs output is invariant โข Partition induced by ๐ฎ๐ฉ๐ฎ๐๐ฆ # ๐๐ฆ๐ฃ๐ก๐จ๐ง๐๐จ๐ฎ๐ญ hyperplanes B ๐ B ๐ &
Key structural property Lemma: โข For any sequence pair ๐, ๐โฒ โ ฮฃ D , there exists partition of โ 0 such that: For any region ๐ , across all ๐ โ ๐ , algorithmโs output is invariant โข Partition induced by ๐ฎ๐ฉ๐ฎ๐๐ฆ # ๐๐ฆ๐ฃ๐ก๐จ๐ง๐๐จ๐ฎ๐ญ hyperplanes B Proof: ๐ B โข For any pair of alignments ๐ต, ๐ตโฒ , prefer ๐ต over ๐ต ' when 8 (๐ต ' ) . โ 8 ๐ 8 โ ๐ 8 ๐ต > โ 8 ๐ 8 โ ๐ ๐ผ pp = โข Preference for ๐ต vs ๐ต ' determined by hyperplane ๐ผ pp = . โข Let โ = {๐ผ pp = โฃ ๐ต, ๐ตโฒ alignments } . โข On any region ๐ in โ 0 โ โ , alignment ordering fixed. ๐ & โข If DP solver breaks ties reasonably, output constant.
Key structural property Lemma: โข For any sequence pair ๐, ๐โฒ โ ฮฃ D , there exists partition of โ 0 such that: For any region ๐ , across all ๐ โ ๐ , algorithmโs output is invariant โข Partition induced by ๐ฎ๐ฉ๐ฎ๐๐ฆ # ๐๐ฆ๐ฃ๐ก๐จ๐ง๐๐จ๐ฎ๐ญ hyperplanes B Similarity to ground truth Corollary: โข For fixed ๐, ๐โฒ , algorithmโs utility is ๐ & piecewise-constant function of ๐ ๐ B
Key structural property Lemma: โข For any sequence pair ๐, ๐โฒ โ ฮฃ D , there exists partition of โ 0 such that: For any region ๐ , across all ๐ โ ๐ , algorithmโs output is invariant โข Partition induced by ๐ฎ๐ฉ๐ฎ๐๐ฆ # ๐๐ฆ๐ฃ๐ก๐จ๐ง๐๐จ๐ฎ๐ญ hyperplanes B Total # alignments when ๐ , ๐ ' โค ๐ at most 2 D ๐ BDx&
Generalization for pairwise alignment For any sequence pair (๐, ๐ ' ) : ๐ฃ ๐ ๐, ๐ ' = utility of using params ๐ โ โ 0 to align ๐, ๐ ' Similarity between algorithmโs output & ground truth Theorem Pseudo-dimension of ๐ฃ ๐ | ๐ โ โ 0 is z ๐ ๐๐ where ๐ = max |๐| Proof: Pseudo-dimension is ๐(๐ log ๐ ) where ๐ = ๐(2 D ๐ BDx& ) Corollary 0D Optimal ๐ on sample of size z ] ^ ) is ๐ - optimal for ๐ w.h.p. ๐(
Improvement for a special case Special case widely used in practice: Given parameters ๐ & , ๐ B , ๐ H โฅ 0 , find alignment maximizing: (# matches) โ ๐ & L (# mismatches) โ ๐ B L (# indels) โ ๐ H L (# gaps) Theorem [Gusfield, Balasubramanian, Naor โ94; Fernรกndez-Baca, Seppรคlรคinen, Slutzki โ04] โข For any sequence pair ๐, ๐โฒ , there exists partition of โ H such that: For any region ๐ , across all ๐ โ ๐ , algorithmโs output is invariant โข Partition induced by ๐ ๐ ~ hyperplanes Improvement from โ ๐ D to ๐ ~
Recommend
More recommend