sample complexity of algorithm configuration for sequence
play

Sample Complexity of Algorithm Configuration for Sequence Alignment - PowerPoint PPT Presentation

Sample Complexity of Algorithm Configuration for Sequence Alignment Travis Dick Nina Balcan Dan DeBlasio Carl Kingsford Tuomas Sandholm Ellen Vitercik Sequence alignment Goal: Line up pairs of strings ( DNA, RNA, protein, ) Uncover


  1. Sample Complexity of Algorithm Configuration for Sequence Alignment Travis Dick Nina Balcan Dan DeBlasio Carl Kingsford Tuomas Sandholm Ellen Vitercik

  2. Sequence alignment Goal: Line up pairs of strings ( DNA, RNA, protein, โ€ฆ) Uncover functional, structural, or evolutionary relationships ๐‘ป ๐Ÿ = GRTCPKPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP ๐‘ป ๐Ÿ‘ = EVKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGPEEIECTKLGNWSAMPSCKA GRTCP---KPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDN-GFVNYPAKPTLYYK-DKATFGCHDGY-SLDGPEEIECTKLGNWS-AMPSCKA

  3. Sequence alignment algorithms Typically optimize for alignment features : Number of matching characters, number of gaps, โ€ฆ [Needleman and Wunsch โ€˜70; Gotoh โ€™82] Standard algos solve for alignment maximizing weighted sum How to tune the feature weights?

  4. Sequence alignment algorithms Can sometimes access ground-truth alignment Requires extensive manual alignments Given set of applicationโ€™s โ€œtypicalโ€ alignment problems, together with ground-truth alignments, can we learn parameters that recover ground truth?

  5. Model 1. Fix a parameterized alignment optimization function 2. Receive sample problems from unknown distribution Sequence ๐‘‡ & Sequence ๐‘‡ ) โ‹ฏ ' ' Sequence ๐‘‡ & Sequence ๐‘‡ ) Alignment Alignment 3. Find parameter values with best performance over samples Closest to ground truth, for example

  6. Model 1. Fix a parameterized alignment optimization function 2. Receive sample problems from unknown distribution Sequence ๐‘‡ & Sequence ๐‘‡ ) โ‹ฏ ' ' Sequence ๐‘‡ & Sequence ๐‘‡ ) Alignment Alignment 3. Find parameter values with best performance over samples Model studied from empirical perspective Kim and Kececioglu โ€™07; Xu, Hutter, Hoos, Leyton-Brown โ€™08; Dai, Khalil, Zhang, Dilkina, Song โ€™17 โ€ฆ

  7. Model 1. Fix a parameterized alignment optimization function 2. Receive sample problems from unknown distribution Sequence ๐‘‡ & Sequence ๐‘‡ ) โ‹ฏ ' ' Sequence ๐‘‡ & Sequence ๐‘‡ ) Alignment Alignment 3. Find parameter values with best performance over samples Model studied from theoretical perspective Gupta and Roughgarden โ€™16; Kleinberg, Leyton-Brown, Lucier โ€˜17; Weisz, Gyรถrgy, Szepesvรกri โ€˜18 โ€ฆ

  8. Questions Focus of this talk: Will those parameters have high performance in expectation? Sequence ๐‘‡โ€ฒ ? Sequence ๐‘‡ & Sequence ๐‘‡ ) Sequence ๐‘‡ โ‹ฏ ' ' Sequence ๐‘‡ & Sequence ๐‘‡ ) Alignment Alignment Focus of prior work [e.g., Kim and Kececioglu โ€™07] : Algorithmically, how to find good parameters over training set

  9. Model ๐’  : Distribution over sequence pairs (๐‘‡, ๐‘‡ ' ) โ„ 0 : Set of parameters For any sequence pair (๐‘‡, ๐‘‡ ' ) : ๐‘ฃ ๐‡ ๐‘‡, ๐‘‡ ' = utility of using params ๐‡ โˆˆ โ„ 0 to align ๐‘‡, ๐‘‡ ' Similarity between algorithmโ€™s output & ground truth ' , โ€ฆ , ๐‘‡ ) , ๐‘‡ ) ' Generalization: Given samples ๐‘‡ & , ๐‘‡ & ~๐’  , ' โˆ’ ๐”ฝ (<,< = )~๐’  [๐‘ฃ ๐‡ ๐‘‡, ๐‘‡โ€ฒ ] โ‰ค ? ) ๐‘ฃ ๐‡ ๐‘‡ 8 , ๐‘‡ 8 & for any ๐‡ โˆˆ โ„ 0 , ) โˆ‘ 89&

  10. Primary challenge: Algorithmic performance is volatile function of parameters Similarity to ground truth ๐œ & ๐œ B For well-understood functions in machine learning: Close connection between function parameters and value

  11. Outline 1. Pairwise sequence alignment algorithms 2. Sample complexity for pairwise alignment 3. Multiple-sequence alignment algorithms 4. Sample complexity for multiple-sequence alignments 5. Additional applications

  12. Pairwise sequence alignment Input: Two sequences ๐‘‡, ๐‘‡โ€ฒ โˆˆ ฮฃ D โˆ— such that: Alignment: Sequences ๐œ, ๐œโ€ฒ โˆˆ ฮฃ โˆช โˆ’ Deleting โ€œ โˆ’ โ€ yields ๐‘‡ from ๐œ and ๐‘‡โ€ฒ from ๐œโ€ฒ Gap ๐‘‡ = A C T G ๐œ = A โ€“ - C T G ๐‘‡โ€ฒ = G T C A ๐œโ€ฒ = - G T C A - Mismatch Match Insertion/deletion ( indel )

  13. Pairwise sequence alignment algorithms Standard algorithm with parameters ๐œ & , ๐œ B , ๐œ H โ‰ฅ 0 : Use dynamic programming to find alignment ๐ต maximizing: (# matches) โˆ’ ๐œ & L (# mismatches) โˆ’ ๐œ B L (# indels) โˆ’ ๐œ H L (# gaps) Gap ๐‘‡ = A C T G ๐œ = A โ€“ - C T G ๐‘‡โ€ฒ = G T C A ๐œโ€ฒ = - G T C A - Mismatch Match Insertion/deletion ( indel )

  14. Pairwise sequence alignment algorithms More generally, given parameters ๐‡ โˆˆ โ„ 0 : Use dynamic programming to find alignment ๐ต maximizing: ๐œ & L ๐‘” & ๐ต + โ‹ฏ + ๐œ 0 L ๐‘” 0 ๐ต 0 ๐ต features of alignment ๐ต (e.g., # matches, โ€ฆ) ๐‘” & ๐ต , โ€ฆ , ๐‘”

  15. Pairwise sequence alignment algorithms -GRTCPKPDDLPFSTVVP-LKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP E-VKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGP-EEIECTKLGNWSAMPSC-KA Ground-truth alignment

  16. Pairwise sequence alignment algorithms -GRTCPKPDDLPFSTVVP-LKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP E-VKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGP-EEIECTKLGNWSAMPSC-KA Ground-truth alignment GRTCP---KPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDN-GFVNYPAKPTLYYK-DKATFGCHDGY-SLDGPEEIECTKLGNWS-AMPSCKA Alignment by algorithm with poorly-tuned parameters

  17. Pairwise sequence alignment algorithms -GRTCPKPDDLPFSTVVP-LKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP E-VKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGYSLDGP-EEIECTKLGNWSAMPSC-KA Ground-truth alignment GRTCP---KPDDLPFSTVVPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDN-GFVNYPAKPTLYYK-DKATFGCHDGY-SLDGPEEIECTKLGNWS-AMPSCKA Alignment by algorithm with poorly-tuned parameters GRTCPKPDDLPFSTV-VPLKTFYEPGEEITYSCKPGYVSRGGMRKFICPLTGLWPINTLKCTP EVKCPFPSRPDNGFVNYPAKPTLYYKDKATFGCHDGY-SLDGPEEIECTKLGNWSA-MPSCKA Alignment by algorithm with well-tuned parameters

  18. Outline 1. Pairwise sequence alignment algorithms 2. Sample complexity for pairwise alignment 3. Multiple-sequence alignment algorithms 4. Sample complexity for multiple-sequence alignments 5. Additional applications

  19. Piecewise-constant utility functions ๐‘ฃ ` ๐‡ ๐œ & ๐œ B ๐‘ฆ = (๐‘‡, ๐‘‡ ' ) Theorem If for any problem ๐‘ฆ , the func ๐œ โ†ฆ ๐‘ฃ Q ๐‘ฆ is piecewise constant and boundaries between pieces defined by ๐‘™ hyperplanes: Pseudo-dimension of ๐‘ฃ ๐‡ ๐‡ โˆˆ โ„ 0 is O ๐‘’ log ๐‘™ 0 YZ[ \ An optimal ๐‡ on ๐‘ƒ samples is ๐œ— -optimal on ๐’  . ] ^ Need to show piecewise constant utilities and bound log(๐‘™)

  20. Key structural property Lemma: โ€ข For any sequence pair ๐‘‡, ๐‘‡ ' โˆˆ ฮฃ D , there exists partition of โ„ 0 such that: For any region ๐‘† , across all ๐‡ โˆˆ ๐‘† , algorithmโ€™s output is invariant โ€ข Partition induced by ๐ฎ๐ฉ๐ฎ๐›๐ฆ # ๐›๐ฆ๐ฃ๐ก๐จ๐ง๐Ÿ๐จ๐ฎ๐ญ hyperplanes B ๐œ B ๐œ &

  21. Key structural property Lemma: โ€ข For any sequence pair ๐‘‡, ๐‘‡โ€ฒ โˆˆ ฮฃ D , there exists partition of โ„ 0 such that: For any region ๐‘† , across all ๐‡ โˆˆ ๐‘† , algorithmโ€™s output is invariant โ€ข Partition induced by ๐ฎ๐ฉ๐ฎ๐›๐ฆ # ๐›๐ฆ๐ฃ๐ก๐จ๐ง๐Ÿ๐จ๐ฎ๐ญ hyperplanes B Proof: ๐œ B โ€ข For any pair of alignments ๐ต, ๐ตโ€ฒ , prefer ๐ต over ๐ต ' when 8 (๐ต ' ) . โˆ‘ 8 ๐œ 8 โ‹… ๐‘” 8 ๐ต > โˆ‘ 8 ๐œ 8 โ‹… ๐‘” ๐ผ pp = โ€ข Preference for ๐ต vs ๐ต ' determined by hyperplane ๐ผ pp = . โ€ข Let โ„‹ = {๐ผ pp = โˆฃ ๐ต, ๐ตโ€ฒ alignments } . โ€ข On any region ๐‘† in โ„ 0 โˆ– โ„‹ , alignment ordering fixed. ๐œ & โ€ข If DP solver breaks ties reasonably, output constant.

  22. Key structural property Lemma: โ€ข For any sequence pair ๐‘‡, ๐‘‡โ€ฒ โˆˆ ฮฃ D , there exists partition of โ„ 0 such that: For any region ๐‘† , across all ๐‡ โˆˆ ๐‘† , algorithmโ€™s output is invariant โ€ข Partition induced by ๐ฎ๐ฉ๐ฎ๐›๐ฆ # ๐›๐ฆ๐ฃ๐ก๐จ๐ง๐Ÿ๐จ๐ฎ๐ญ hyperplanes B Similarity to ground truth Corollary: โ€ข For fixed ๐‘‡, ๐‘‡โ€ฒ , algorithmโ€™s utility is ๐œ & piecewise-constant function of ๐‡ ๐œ B

  23. Key structural property Lemma: โ€ข For any sequence pair ๐‘‡, ๐‘‡โ€ฒ โˆˆ ฮฃ D , there exists partition of โ„ 0 such that: For any region ๐‘† , across all ๐‡ โˆˆ ๐‘† , algorithmโ€™s output is invariant โ€ข Partition induced by ๐ฎ๐ฉ๐ฎ๐›๐ฆ # ๐›๐ฆ๐ฃ๐ก๐จ๐ง๐Ÿ๐จ๐ฎ๐ญ hyperplanes B Total # alignments when ๐‘‡ , ๐‘‡ ' โ‰ค ๐‘œ at most 2 D ๐‘œ BDx&

  24. Generalization for pairwise alignment For any sequence pair (๐‘‡, ๐‘‡ ' ) : ๐‘ฃ ๐‡ ๐‘‡, ๐‘‡ ' = utility of using params ๐‡ โˆˆ โ„ 0 to align ๐‘‡, ๐‘‡ ' Similarity between algorithmโ€™s output & ground truth Theorem Pseudo-dimension of ๐‘ฃ ๐‡ | ๐‡ โˆˆ โ„ 0 is z ๐‘ƒ ๐‘’๐‘œ where ๐‘œ = max |๐‘‡| Proof: Pseudo-dimension is ๐‘ƒ(๐‘’ log ๐‘™ ) where ๐‘™ = ๐‘ƒ(2 D ๐‘œ BDx& ) Corollary 0D Optimal ๐‡ on sample of size z ] ^ ) is ๐œ— - optimal for ๐’  w.h.p. ๐‘ƒ(

  25. Improvement for a special case Special case widely used in practice: Given parameters ๐œ & , ๐œ B , ๐œ H โ‰ฅ 0 , find alignment maximizing: (# matches) โˆ’ ๐œ & L (# mismatches) โˆ’ ๐œ B L (# indels) โˆ’ ๐œ H L (# gaps) Theorem [Gusfield, Balasubramanian, Naor โ€™94; Fernรกndez-Baca, Seppรคlรคinen, Slutzki โ€˜04] โ€ข For any sequence pair ๐‘‡, ๐‘‡โ€ฒ , there exists partition of โ„ H such that: For any region ๐‘† , across all ๐‡ โˆˆ ๐‘† , algorithmโ€™s output is invariant โ€ข Partition induced by ๐‘ƒ ๐‘œ ~ hyperplanes Improvement from โ‰ˆ ๐‘œ D to ๐‘œ ~

Recommend


More recommend