The Closest Substring problem with small distances D´ aniel Marx dmarx@informatik.hu-berlin.de June 10, 2005 The Closest Substring problem with small distances – p.1/28
The Closest String problem C LOSEST S TRING Strings s 1 , . . . , s k of length L Input: Solution: A string s of length L (center string) max k Minimize: i =1 d ( s, s i ) d ( w 1 , w 2 ) : the number of positions where w 1 and w 2 differ (Hamming distance). Applications: computational biology (e.g., finding common ancestors) Problem is NP-hard even with binary alphabet [Frances and Litman, 1997]. The Closest Substring problem with small distances – p.2/28
The Closest Substring problem C LOSEST S UBSTRING Strings s 1 , . . . , s k , an integer L Input: Solution: — string s of length L (center string), — a length L substring s ′ i of s i for every i max k i =1 d ( s, s ′ Minimize: i ) Remark: For a given s , it is easy to find the best s ′ i for every i . Applications: finding common patterns, drug design. Problem is NP-hard even with binary alphabet (C LOSEST S TRING is the special case | s i | = L .) C LOSEST S UBSTRING admits a PTAS [Li, Ma, & Wang, 2002]: for every ǫ > 0 there is an n O (1 /ǫ 4 ) algorithm that produces a (1 + ǫ ) -approximation. The Closest Substring problem with small distances – p.3/28
Parameterized Complexity Goal: restrict the exponential growth of the running time to one parameter of the input. Definition: Problem is fixed-parameter tractable (FPT) with parameter k if there is an algorithm with running time f ( k ) · n c where c is a fixed constant not depending on k . Definition: Problem is fixed-parameter tractable (FPT) with parameters k 1 and k 2 if there is an algorithm with running time f ( k 1 , k 2 ) · n c where c is a fixed constant not depending on k 1 and k 2 . The Closest Substring problem with small distances – p.4/28
Parameterized intractability We expect that M AXIMUM I NDEPENDENT S ET is not fixed-parameter tractable, no n o ( k ) algorithm is known. W[1]-complete ≈ “as hard as M AXIMUM I NDEPENDENT S ET ” The Closest Substring problem with small distances – p.5/28
Parameterized intractability We expect that M AXIMUM I NDEPENDENT S ET is not fixed-parameter tractable, no n o ( k ) algorithm is known. W[1]-complete ≈ “as hard as M AXIMUM I NDEPENDENT S ET ” Parameterized reductions: L 1 is reducible to L 2 , if there is a function f that transforms ( x, k ) to ( x ′ , k ′ ) such that ( x, k ) ∈ L 1 if and only if ( x ′ , k ′ ) ∈ L 2 , f can be computed in f ( k ) | x | c time, k ′ depends only on k If L 1 is reducible to L 2 , and L 2 is in FPT, then L 1 is in FPT as well. Most NP-completeness proofs are not good for parameterized reductions. The Closest Substring problem with small distances – p.5/28
Parameterized Closest Substring C LOSEST S UBSTRING Strings s 1 , . . . , s k over Σ , integers L and d Input: k, L, d, | Σ | Possible parameters: Find: — string s of length L (center string), — a length L substring s ′ i of s i for every i such that d ( s, s ′ i ) ≤ d for every i Possible parameters: k : might be small d : might be small L : usually large | Σ | : usually a small constant The Closest Substring problem with small distances – p.6/28
Closest Substring—Results parameter | Σ | is constant | Σ | is parameter | Σ | is unbounded d ? ? W[1]-hard k W[1]-hard W[1]-hard W[1]-hard d,k ? ? W[1]-hard L FPT FPT W[1]-hard d,k,L FPT FPT W[1]-hard (Hardness results by [Fellows, Gramm, Niedermeier 2002].) The Closest Substring problem with small distances – p.7/28
Closest Substring—Results parameter | Σ | is constant | Σ | is parameter | Σ | is unbounded d W[1]-hard W[1]-hard W[1]-hard k W[1]-hard W[1]-hard W[1]-hard d,k W[1]-hard W[1]-hard W[1]-hard L FPT FPT W[1]-hard d,k,L FPT FPT W[1]-hard (Hardness results by [Fellows, Gramm, Niedermeier 2002].) Theorem: [D.M.] C LOSEST S UBTRING is W[1]-hard with parameters k and d , even if | Σ | = 2 . (In the rest of the talk, Σ is always { 0 , 1 } .) The Closest Substring problem with small distances – p.7/28
Hardness of Closest Substring Theorem: [D.M.] C LOSEST S UBTRING is W[1]-hard with parameters k and d . Proof by parameterized reduction from M AXIMUM I NDEPENDENT S ET . C LOSEST S UBSTRING M AXIMUM I NDEPENDENT S ET k = 2 2 O ( t ) ⇒ ( G, t ) d = 2 O ( t ) Corollary: No f ( k, d ) · n c algorithm for C LOSEST S UBSTRING unless FPT=W[1]. The Closest Substring problem with small distances – p.8/28
Hardness of Closest Substring Theorem: [D.M.] C LOSEST S UBTRING is W[1]-hard with parameters k and d . Proof by parameterized reduction from M AXIMUM I NDEPENDENT S ET . C LOSEST S UBSTRING M AXIMUM I NDEPENDENT S ET k = 2 2 O ( t ) ⇒ ( G, t ) d = 2 O ( t ) Corollary: No f ( k, d ) · n c algorithm for C LOSEST S UBSTRING unless FPT=W[1]. Corollary: No f ( k, d ) · n o (log d ) or f ( k, d ) · n o (log log k ) algorithm for C LOS - EST S UBSTRING unless M AXIMUM I NDEPENDENT S ET has an f ( t ) · n o ( t ) algo- rithm. The Closest Substring problem with small distances – p.8/28
Hardness of Closest Substring Corollary: No f ( k, d ) · n o (log d ) or f ( k, d ) · n o (log log k ) algorithm for C LOSEST S UBSTRING unless M AXIMUM I NDEPENDENT S ET has an f ( t ) · n o ( t ) algorithm. M AXIMUM I NDEPENDENT S ET has an f ( t ) · n o ( t ) algorithm ⇓ n variable 3-SAT can be solved in 2 o ( n ) time � FPT=M[1] The Closest Substring problem with small distances – p.9/28
Hardness of Closest Substring Corollary: No f ( k, d ) · n o (log d ) or f ( k, d ) · n o (log log k ) algorithm for C LOSEST S UBSTRING unless M AXIMUM I NDEPENDENT S ET has an f ( t ) · n o ( t ) algorithm. M AXIMUM I NDEPENDENT S ET has an f ( t ) · n o ( t ) algorithm ⇓ n variable 3-SAT can be solved in 2 o ( n ) time � FPT=M[1] The lower bound on the exponent of n is best possible: Theorem: [D.M.] C LOSEST S UBSTRING can be solved in f 1 ( d, k ) · n O (log d ) time. Theorem: [D.M.] C LOSEST S UBSTRING can be solved in f 2 ( d, k ) · n O (log log k ) time. The Closest Substring problem with small distances – p.9/28
Relation to approximability PTAS: algorithm that produces a (1 + ǫ ) -approximation in time n f ( ǫ ) . EPTAS: (efficient PTAS) a PTAS with running time f ( ǫ ) · n O (1) . 1 Observation: if ǫ = d +1 , then a (1 + ǫ ) -approximation algorithm can correctly decide whether the optimum is d or d + 1 ⇒ if an optimization problem has an EPTAS, then it is FPT. Corollary: C LOSEST S UBSTRING has no EPTAS, unless FPT=W[1]. Corollary: C LOSEST S UBSTRING has no f ( ǫ ) · n o (log ǫ ) time PTAS, unless FPT=M[1]. The Closest Substring problem with small distances – p.10/28
What’s next? f 1 ( d, k ) · n O (log d ) time algorithm Some results on hypergraphs f 2 ( d, k ) · n O (log log k ) time algorithm Sketch of the completeness proof Conclusions Lunch The Closest Substring problem with small distances – p.11/28
The first algorithm Definition: A solution is a minimal solution if � k i =1 d ( s, s ′ i ) is as small as possible (and d ( s, s ′ i ) ≤ d for every i ). The Closest Substring problem with small distances – p.12/28
The first algorithm Definition: A solution is a minimal solution if � k i =1 d ( s, s ′ i ) is as small as possible (and d ( s, s ′ i ) ≤ d for every i ). Definition: A set of length L strings G generates a length L string s if whenever the strings in G agree at the i -th position, then s has the same character at this position. Example: G 1 generates s but G 2 does not. 1 1 0 1 0 1 1 1 0 1 1 1 G 1 G 2 0 1 0 1 1 1 0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 s 1 1 0 1 0 1 s 1 1 0 1 0 1 The Closest Substring problem with small distances – p.12/28
First algorithm Let S be the set of all length L substrings of s 1 , . . . , s k . Clearly, |S| ≤ n . Lemma: If s is the center string of a minimal solution, then S has a subset G of size O (log d ) that generates s , and the strings in G agree in all but at most O ( d log d ) positions. The Closest Substring problem with small distances – p.13/28
First algorithm Let S be the set of all length L substrings of s 1 , . . . , s k . Clearly, |S| ≤ n . Lemma: If s is the center string of a minimal solution, then S has a subset G of size O (log d ) that generates s , and the strings in G agree in all but at most O ( d log d ) positions. Algorithm: Construct the set S . Consider every subset G ⊆ S of size O (log d ) . If there are at most O ( d log d ) positions in G where they disagree, then try every center string generated by G . Running time: | Σ | O ( d log d ) · n O (log d ) . The Closest Substring problem with small distances – p.13/28
Proof of the lemma Lemma: If s is the center string of a minimal solution, then S has a subset G of size O (log d ) that generates s , and the strings in G agree in all but at most O ( d log d ) positions. Proof: Let ( s, s ′ 1 , . . . , s ′ k ) be a minimal solution. We show that { s ′ 1 , . . . , s ′ k } has a O (log d ) subset that generates s . The bad positions of a set of strings are the positions where they agree, but s is different. Clearly, { s ′ 1 } has at most d bad positions. We show that if a set of strings has p bad positions, then we can decrease the number of bad positions to p/ 2 by adding a string s ′ i ⇒ no bad position remains after adding log d strings. The Closest Substring problem with small distances – p.14/28
Recommend
More recommend