Learning meets Sequencing: a Generality Framework for Read-Sets Filip ˇ Zelezn´ y, Karel Jalovec, Jakub Tolar Czech Technical University in Prague University of Minnesota ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 1 / 11
Sequencing ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 2 / 11
Sequencing ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 2 / 11
Sequencing expensive gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 2 / 11
Sequencing cheaper gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 2 / 11
Assembly gcgatgcatg gtacgtca gtacg cagtacgtcagt acgtca catgacgta gtgtggg gaa gaacgtacatg tactgt tgcgc gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
Assembly gcgatgcatg gtacgtca gtacg cagtacgtcagt acgtca catgacgta gtgtggg gaacgtacatg gaa tactgt tgcgc gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete) ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
Assembly gcgatgcatg gtacgtca gtacg cagtacgtcagt acgtca catgacgta gtgtggg gaacgtacatg gaa tactgt tgcgc gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete) ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
Assembly gcgatgcatg gtacgtca gtacg cagtacgtcagt acgtca catgacgta gtgtggg gaacgtacatg gaacgtacatg gaa tactgt tgcgc gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete) ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
Assembly gcgatgcatg gtacgtca gtacg cagtacgtcagt acgtca catgacgta gtgtggg gaacgtacatg gaacgtacatg gaa tactgt tgcgc tgcgc gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete) ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
Assembly gcgatgcatg gcgatgcatg gtacgtca gtacg cagtacgtcagt acgtca catgacgta gtgtggg gaacgtacatg gaacgtacatg gaa tactgt tgcgc tgcgc gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete) ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
Assembly gcgatgcatg gcgatgcatg gtacgtca gtacg cagtacgtcagt acgtca catgacgta catgacgta gtgtggg gaacgtacatg gaacgtacatg gaa tactgt tgcgc tgcgc gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete) ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
Assembly gcgatgcatg gcgatgcatg gtacgtca gtacg cagtacgtcagt acgtca catgacgta catgacgta gtgtggg gaacgtacatg gaacgtacatg gaa tactgt tactgt tgcgc tgcgc gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete) ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
Assembly gcgatgcatg gcgatgcatg gtacgtca gtacgtca gtacg cagtacgtcagt acgtca catgacgta catgacgta gtgtggg gaacgtacatg gaacgtacatg gaa tactgt tactgt tgcgc tgcgc gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete) ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
Assembly gcgatgcatg gcgatgcatg gtacgtca gtacgtca gtacg cagtacgtcagt cagtacgtcagt acgtca catgacgta catgacgta gtgtggg gaacgtacatg gaacgtacatg gaa tactgt tactgt tgcgc tgcgc gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete) ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
Assembly gcgatgcatg gcgatgcatg gtacgtca gtacgtca gtacg cagtacgtcagt cagtacgtcagt acgtca catgacgta catgacgta gtgtggg gtgtggg gaacgtacatg gaacgtacatg gaa tactgt tactgt tgcgc tgcgc gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete) ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
Assembly gcgatgcatg gcgatgcatg gtacgtca gtacgtca gtacg cagtacgtcagt cagtacgtcagt acgtca catgacgta catgacgta gtgtggg gtgtggg gaacgtacatg gaacgtacatg gaa tactgt tactgt tgcgc tgcgc gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg gaacgtacatgcgcgatgcatgacgtactgtacgtcagtacgtcagtgtggg Overlap graph. Here ≥ 2 shared letters Seaching a Hamiltonian path (NP complete) Reads shorter ⇒ task harder ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 3 / 11
Classification Learning Controls Cases agcgacc cgcg gcaacg taaaaagct ccacgacgt accattg atcgatcg gtca gggc ttctcggct gctgctt aaaagcaaa Find a string consistent with examples of only one class Example = read set Consistent with a read set = substring of a string assembled from the reads ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 4 / 11
Classification Learning (cont’d) agcgacc cgcg gcaacg taaaaagct ccacgacgt accattg atcgatcg gtca gggc ttctcggct gctgctt aaaagcaaa ⇓ ⇓ ⇓ ⇓ assembly assembly assembly assembly ⇓ ⇓ ⇓ ⇓ learning (searching discriminative substrings) Baseline approach: first assemble, then learn Can use existing algorithms ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 5 / 11
Classification Learning (cont’d) agcgacc cgcg gcaacg taaaaagct ccacgacgt accattg atcgatcg gtca gggc ttctcggct gctgctt aaaagcaaa ⇓ ⇓ ⇓ ⇓ learning from read sets directly Proposed approach: blend assembly with learning No existing algorithm (?) ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 6 / 11
Generality of Read Sets An ILP-inspired approach: search in the generality lattice of read sets Extension Ext ( S ) of read set S : set of all strings consistent with S Extensions may be infinite due to loops ab ⊆ { a , ab , aba , abab , . . . } Ext ba abc Read set S 1 is more general than S 2 , S 1 � S 2 iff Ext ( S 1 ) ⊇ Ext ( S 2 ) ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 7 / 11
Intuitive Analogy to ILP lifted: clause read set p ( x ) ← q ( x ) { ab , ba , abc } ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 8 / 11
Intuitive Analogy to ILP lifted: clause read set p ( x ) ← q ( x ) { ab , ba , abc } ground: models “models” { p ( a ) } a { p ( a ) , q (a) } ab { p (f(a)) , q (f(a)) } aba ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 8 / 11
Least General Generalization Lgg ( S 1 , S 2 ) = S : iff S 1 � S and S 2 � S and there is no S ′ such that S 1 � S ′ and S 2 � S ′ and at the same time S ′ � S and S � S ′ . is it simply Lgg ( S 1 , S 2 ) = S 1 ∪ S 2 ? S 1 = { ab , bc } Ext ( S 1 ) = { a , b , c , ab , bc , abc } S 2 = { bc , cd } Ext ( S 2 ) = { b , c , d , bc , cd , bcd } S = S 1 ∪ S 2 = { ab , bc , cd } Ext ( S ) = { a , b , c , ab , bc , cd , abc , bcd , abcd = Ext ( S 1 ) ∪ Ext ( S 2 ) ∪ { abcd } is it really least given the “extra” string abcd ? ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 9 / 11
Most General Specialization Mgs ( S 1 , S 2 ) as a read-set S such that S � S 1 and S � S 2 and there is no S ′ such that S ′ � S 1 and S ′ � S 2 and at the same time S � S ′ and S ′ � S . is it simply Mgs ( S 1 , S 2 ) = S 1 ∩ S 2 ? S 1 = { ab , ba } Ext ( S 1 ) = { a , b , ab , ba , aba , bab , . . . } S 2 = { aba } Ext ( S 2 ) = { a , b , ab , ba , aba } S = S 1 ∪ S 2 = ∅ Ext ( S ) = ∅ � = Ext ( S 1 ) ∪ Ext ( S 2 ) ∋ { abcd } is it really most given the “missing” string abcd ? ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 10 / 11
Concluding question Relevant work? Erratum Several errors in the submission pdf corrected on Sep 13. Apologies to reviewers. ˇ Zelezn´ y, Jalovec, Tolar (CTU Prague) Learning Meets Sequencing 11 / 11
Recommend
More recommend