Polynomial-Time Approximation Algorithms for Weighted LCS Problem Marek Cygan 1 , Marcin Kubica 1 , Jakub Radoszewski 1 , Wojciech Rytter 1 , 2 and Tomasz Waleń 1 1 University of Warsaw, Poland 2 Copernicus University, Toruń, Poland CPM 2011, 2011–06–29 1/23
Definitions Definition of a weighted sequence A weighted sequence X = x 1 x 2 . . . x n of length | X | = n over an alphabet Σ = { σ 1 , σ 2 , . . . , σ K } is a sequence of sets of pairs of the form: x i = { ( σ j , p ( X ) ( σ j )) : j = 1 , 2 , . . . , K } . i Here p i ( σ j ) is the occurrence probability of the character σ j at the position i , these values are non-negative and sum up to 1 for a given i . WS (Σ) is the set of all weighted sequences over the alphabet Σ . We assume that | Σ | = O ( 1 ) . 2/23
Definitions Example x 1 x 2 x 3 x 4 p 1 ( a ) = 1 / 3 p 2 ( a ) = 1 p 3 ( a ) = 0 p 4 ( a ) = 1 / 2 p 1 ( b ) = 1 / 3 p 2 ( b ) = 0 p 3 ( b ) = 1 / 2 p 4 ( b ) = 1 / 4 p 1 ( c ) = 1 / 3 p 2 ( c ) = 0 p 3 ( c ) = 1 / 2 p 4 ( c ) = 1 / 4 A weighted sequence X = x 1 x 2 x 3 x 4 over the alphabet Σ = { a , b , c } 3/23
Background Weighted sequences are also referred to in the literature as p-weighted sequences or Position Weighted Matrices (PWM) [Amir et al. 2010, Thompson et al. 1994]. The notion of a weighted sequence was introduced as a tool for motif discovery and local alignment, and is extensively used in computational molecular biology. Multiple algorithmic results related to combinatorics of weighted sequences, i.e., repetitions, regularities and pattern matching, have already been presented. 4/23
Background Weighted sequences are also referred to in the literature as p-weighted sequences or Position Weighted Matrices (PWM) [Amir et al. 2010, Thompson et al. 1994]. The notion of a weighted sequence was introduced as a tool for motif discovery and local alignment, and is extensively used in computational molecular biology. Multiple algorithmic results related to combinatorics of weighted sequences, i.e., repetitions, regularities and pattern matching, have already been presented. 4/23
Background Weighted sequences are also referred to in the literature as p-weighted sequences or Position Weighted Matrices (PWM) [Amir et al. 2010, Thompson et al. 1994]. The notion of a weighted sequence was introduced as a tool for motif discovery and local alignment, and is extensively used in computational molecular biology. Multiple algorithmic results related to combinatorics of weighted sequences, i.e., repetitions, regularities and pattern matching, have already been presented. 4/23
Definitions Definition (Occurence of subsequence s in weighted sequence X ) | s | = d , π = ( i 1 , i 2 , . . . , i d ) , 1 ≤ i 1 < i 2 < . . . < i d ≤ | X | , d p ( X ) � P X ( π, s ) = ( s k ) . i k k = 1 � s ∈ Σ ∗ : ∃ � � � π ∈ Seq | X | SUBS ( X , α ) = P X ( π, s ) ≥ α . | s | In other words SUBS ( X , α ) is the set of deterministic strings which match a subsequence of X with probability at least α . 5/23
Problems α -LCWS problem Input: Two weighted sequences X , Y ∈ WS (Σ) and a cut-off probability α . Output: The longest string s ∈ Σ ∗ such that � | s | , π ′ ∈ Seq | Y | � π ∈ Seq | X | P X ( π, s ) · P Y ( π ′ , s ) ≥ α. ∃ | s | Equivalently, s is the longest string in SUBS ( X , α 1 ) ∩ SUBS ( Y , α 2 ) for some α 1 · α 2 ≥ α . ( α 1 , α 2 )-LCWS2 problem Input: Two weighted sequences X , Y and two cut-off probabilities α 1 , α 2 . Output: The longest string s ∈ SUBS ( X , α 1 ) ∩ SUBS ( Y , α 2 ) . 6/23
Problems α -LCWS problem Input: Two weighted sequences X , Y ∈ WS (Σ) and a cut-off probability α . Output: The longest string s ∈ Σ ∗ such that � | s | , π ′ ∈ Seq | Y | � π ∈ Seq | X | P X ( π, s ) · P Y ( π ′ , s ) ≥ α. ∃ | s | Equivalently, s is the longest string in SUBS ( X , α 1 ) ∩ SUBS ( Y , α 2 ) for some α 1 · α 2 ≥ α . ( α 1 , α 2 )-LCWS2 problem Input: Two weighted sequences X , Y and two cut-off probabilities α 1 , α 2 . Output: The longest string s ∈ SUBS ( X , α 1 ) ∩ SUBS ( Y , α 2 ) . 6/23
Example: α -LCWS problem ( s , π, π ′ ) is the solution for α - X LCWS problem for α = 0 . 23. 1 2 3 4 5 0.2 1.0 0.3 0.9 0.9 a s = abba 0.1 0.8 0.0 0.7 0.1 b π = ( 1 , 2 , 4 , 5 ) π ′ = ( 1 , 3 , 4 , 5 ) Y 1 2 3 4 5 P X ( π, s ) = 0 . 9 · 0 . 8 · 0 . 7 · 0 . 9 = 0.5 0.1 0.2 0.9 0.8 a 0 . 4536 P Y ( π ′ , s ) = 0 . 9 · 0 . 9 · 0 . 8 · 0 . 8 = 0.1 0.5 0.9 0.8 0.2 b 0 . 5184 P X ( π, s ) ·P Y ( π ′ , s ) = 0 . 23514624 7/23
Example: ( α 1 , α 2 )-LCWS2 problem Solution for ( α 1 , α 2 )-LCWS2 for X 1 2 3 4 5 α 1 = 0 . 7, α 2 = 0 . 6. 0.2 0.3 0.9 0.9 1.0 a s = aba 0.1 0.8 0.0 0.7 0.1 b π = ( 1 , 2 , 3 ) π ′ = ( 1 , 3 , 5 ) Y 1 2 3 4 5 0.9 0.5 0.1 0.2 0.8 P X ( π, s ) = 0 . 9 · 0 . 8 · 1 . 0 = 0 . 72 a P Y ( π ′ , s ) = 0 . 9 · 0 . 9 · 0 . 8 = 0 . 648 0.1 0.5 0.8 0.2 0.9 b 8/23
Results summary Previous results for α -LCWS [Amir et al. 2010] The α -LCWS problem can be solved in O ( n 3 ) time and O ( n 2 ) space. If we are only interested in the length of the output, the problem can be solved in O ( Ln 2 ) time, where L is the length of the solution. NP-hardness for integer version of ( α 1 , α 2 )-LCWS2 Previous work Our results unbounded alphabet | Σ | = 2 Approximation results for ( α 1 , α 2 )-LCWS2 Previous work Our results 0.5 ( O ( n 5 ) time, O ( n 2 ) space) ( 1 / | Σ | ) PTAS ( O ( n 5 ) space) 9/23
( α 1 , α 2 )-LCWS2 and α -LCWS2 problems Definition ( α -LCWS2 problem) Input: Two weighted sequences X , Y ∈ WS (Σ) and a cut-off probability α . Output: The longest string s ∈ SUBS ( X , α ) ∩ SUBS ( Y , α ) . The following lemma shows that the ( α 1 , α 2 )-LCWS2 and α -LCWS2 problems are equivalent. Lemma The ( α 1 , α 2 )-LCWS2 problem can be reduced in linear time to the α -LCWS2 problem (with α = min ( α 1 , α 2 ) ). 10/23
( α 1 , α 2 )-LCWS2 and α -LCWS2 problems Proof. Solution: just rescale probabilities, and add special symbol # that will sum new probabilities to 1. Let α 1 < α 2 , and γ = log α 2 α 1 . p ( X ′ ) ( σ j ) = p ( X ) p ( X ′ ) ( σ j ) , (#) = 0 i i i k p ( Y ′ ) ( σ j ) = p ( Y ) p ( Y ′ ) p ( Y ′ ) ( σ j ) γ , � (#) = 1 − ( σ j ) . i i i i j = 1 11/23
NP-hardness Definition Define an I-weighted sequence X over the alphabet Σ = { σ 1 , σ 2 , . . . , σ K } as a sequence of sets of pairs of the form: x i = { ( σ j , w ( X ) where w ( X ) ( σ j )) : j = 1 , 2 , . . . , K } , ( σ j ) ∈ Z + . i i Definition For an I-weighted sequence X and s ∈ Σ d , define: d w ( X ) for π = ( i 1 , . . . , i d ) ∈ Seq | X | � W X ( π, s ) = ( s k ) d . i k k = 1 For an I-weighted sequence X and α ∈ Z + , denote: � s ∈ Σ ∗ : ∃ � � � π ∈ Seq | X | SUBS ( X , α ) = W X ( π, s ) ≤ α . | s | 12/23
NP-hardness Definition ( α -LCIWS2 problem) Input: Two I-weighted sequences X , Y and a cut-off value α ∈ Z + . Output: The longest string s ∈ SUBS ( X , α ) ∩ SUBS ( Y , α ) . Definition (Partition problem) Input: A finite set S , S ⊆ Z + . Binary output: Is there a subset S ′ ⊆ S such that � S ′ = � S \ S ′ . 13/23
NP-hardness Theorem LCIWS2 problem over a binary alphabet is NP-hard. Proof. For instance of Partition Problem, set S = { q 1 , q 2 , . . . , q n } we construct I-weighted sequences X = x 1 x 2 . . . x n and Y = y 1 y 2 . . . y n over the alphabet Σ = { a , b } with the following weights of letters from Σ : w ( X ) ( a ) = q i + c , w ( X ) w ( Y ) ( a ) = c , w ( Y ) ( b ) = c , ( b ) = q i + c . i i i i Here c > 0 is an arbitrary positive integer. Finally let � S + nc . α = 1 2 The Partition problem for an instance S has a positive answer iff the length of the solution to α -LCIWS2 for X and Y is n . 14/23
Approximation results Theorem (Amir et al. 2010) The α -LCWS problem can be solved in O ( n 3 ) time and O ( n 2 ) space. If we are only interested in the length of the output, the problem can be solved in O ( Ln 2 ) time, where L is the length of the solution. Theorem We can compute a solution to the α -LCWS2 problem for X , Y ∈ WS (Σ) of length at least ⌊ OPT ( X , Y , α ) / 2 ⌋ in O ( n 3 ) time and O ( n 2 ) space. Proof idea Solve α 2 -LCWS in O ( n 3 ) time, and then extract a solution for α -LCWS2 of size ⌊ OPT ( X , Y , α ) / 2 ⌋ . 15/23
Approximation results Proof sketch Let ( s , π, π ′ ) be the solution of α 2 -LCWS P X ( π, s ) · P Y ( π ′ , s ) ≥ α 2 . (1) � d � We can split this solution to two parts. Let g = . Obtaining 2 partial probabilities: g g p ( X ) p ( Y ) � � A = ( s j ) , B = ( s j ) , i j i ′ j j = 1 j = 1 d d p ( X ) p ( Y ) � � C = ( s j ) , D = ( s j ) . i j i ′ j j = g + 1 j = g + 1 Observe that only one of A , B , C , D can be smaller then α . So either ( A , B ) or ( C , D ) forms a solution with weight ≥ α . 16/23
Approximation results Theorem There exists a ( 1 / 2 ) -approximation algorithm for the α -LCWS2 problem which runs in O ( n 5 ) time and O ( n 2 ) space. Proof. Basically it is a consequence of previous lemma. To obtain the exact approximation ratio, we have to deal with the odd n case (this causes an O ( n 2 ) increase in the time complexity). 17/23
Recommend
More recommend