Streaming for Aibohphobes: Longest Near-Palindrome under Hamming Distance Elena Grigorescu, Purdue University Erfan Sadeqi Azer, Indiana University Samson Zhou, Purdue University
Structure of Talk ❖ Background ❖ 1-Pass Additive Algorithm ❖ 2-Pass Exact Algorithm ❖ Lower Bounds
FSTTCSIITKANPURPATTERNINDIAP Finding Structure in ALPPATTERNSZXAITIIKIICU JBFQWA Noisy Data FSTTCSPATTERNIITKANPURINDIAO STREAMINGALGORITHMPATTERNU PERIODPERIODPERIODPERIODPER FSTTCSTHEORYCSASBRICBCAUON LONGPALINDROMEEMORDNILAPGN OLFSTTCSIITKANPURINDIAGENXAS
Palindrome ❖ A string that reads the same forwards and backwards ❖ 𝑇 = 𝑇 𝑆 ❖ RACECAR ❖ RACECAR ❖ AIBOHPHOBIA ❖ AIBOHPHOBIA
𝑒 -Near-Palindrome ❖ A string that “almost” reads the same forwards and backwards ❖ Given a metric 𝑒𝑗𝑡𝑢 , a 𝑒 -near-palindrome has 𝑒𝑗𝑡𝑢 𝑇, 𝑇 𝑆 ≤ 𝑒 . ❖ RACECAR ❖ FACECAR
Hamming Distance ❖ Given strings 𝑌, 𝑍, the Hamming distance between 𝑌 and 𝑍 is defined as the positions 𝑗 at which 𝑌 𝑗 ≠ 𝑍 𝑗 . ❖ 𝑇 = FACECAR ❖ 𝑇 𝑆 = RACECAF ❖ HAM(𝑇, 𝑇 𝑆 ) = 2
Streaming Model ❖ String of length 𝑜 arrives one symbol at a time ❖ Use 𝑝(𝑜) space, ideally 𝑃(𝑞𝑝𝑚𝑧𝑚𝑝 𝑜) abaacabaccbabbbcbabbccababbccb abaacabaccbabbbcbabbccababbccb abaacabaccbabbbcbabbccababbccb
Longest 𝑒 -Near-Palindrome Problem ❖ Given a string 𝑇 of length 𝑜 , which arrives in a data stream, identify the longest 𝑒 -near-palindrome in space 𝑝 𝑜 . ❖ Given a string 𝑇 of length 𝑜 , which arrives in a data stream, find a “long” 𝑒 -near-palindrome in space 𝑝 𝑜 .
Applications TGCTTAAGCGCTTGCAAGCGCTTAAGCA CAAGCGCTTAAGCA ACGAATTCGCGAAC
Related Work (Palindromes in Data Streams) ❖ 𝑃(log 𝑜) space to provide a 1 + 𝜁 multiplicative approximation to the length of the longest palindrome (Berenbrink,Ergün,Mallmann- Trenn,Sadeqi Azer ‘14) ❖ 𝑃( 𝑜) space to provide a 𝑜 additive approximation to the length of the longest palindrome (BEMS14) ❖ 𝑃( 𝑜) space to find the longest palindrome in two passes (BEMS14) log 𝑜 ❖ Ω 𝜁 log(1+𝜁) space for 1 + 𝜁 multiplicative approximation (Gawrychowski,Merkurev,Shur,Uznanski’16) 𝑜 ❖ Ω 𝐹 space for 𝐹 additive approximation (GMSU16)
Our Results 𝑒 log 7 𝑜 ❖ 𝑃 𝜁 log(1+𝜁) space to provide a 1 + 𝜁 multiplicative approximation to the length of the longest 𝑒 -near-palindrome ❖ 𝑃(𝑒 𝑜 log 6 𝑜) space to provide a 𝑜 additive approximation to the length of the longest 𝑒 -near-palindrome ❖ 𝑃(𝑒 2 𝑜 log 6 𝑜) space to find the longest 𝑒 -near-palindrome in two passes ❖ Ω 𝑒 log 𝑜 space LB for 1 + 𝜁 multiplicative approximation 𝑒𝑜 ❖ Ω space LB for 𝐹 additive approximation 𝐹
Comparison Longest 𝑒 -Near- Longest Palindrome Palindrome (Here) 𝑃(log 2 𝑜) (BEMS14) 𝑒 log 7 𝑜 1 + 𝜁 multiplicative 𝑃 𝜁 log(1 + 𝜁) 𝑃(𝑒 𝑜 log 6 𝑜) 𝑜 additive 𝑃( 𝑜 log 𝑜) (BEMS14) 𝑃(𝑒 2 𝑜 log 6 𝑜) two pass exact 𝑃( 𝑜 log 𝑜) (BEMS14) log 𝑜 1 + 𝜁 multiplicative LB Ω Ω 𝑒 log 𝑜 log(1+𝜁) (GMSU16) 𝑜 E additive LB Ω 𝑒𝑜 Ω 𝐹 (GMSU16) 𝐹
Structure of Talk ❖ Background ❖ 1-Pass Additive Algorithm ❖ 2-Pass Exact Algorithm ❖ Lower Bounds
Warm-Up ❖ Suppose we see string 𝑇 , followed by string 𝑈 . How can we determine if 𝑇 = 𝑈 , with high probability?
Karp-Rabin Fingerprints 𝑜 ❖ Given base 𝐶 and a prime 𝑄 , define 𝜚 𝑇 = σ 𝑗=1 𝐶 𝑗 𝑇 𝑗 𝑛𝑝𝑒 𝑄 ❖ If 𝑇 = 𝑈 , then 𝜚 𝑇 = 𝜚 𝑈 ❖ If 𝑇 ≠ 𝑈 , then 𝜚 𝑇 ≠ 𝜚 𝑈 w.h.p. (Schwartz-Zippel)
Properties of Karp-Rabin Fingerprints ❖ 𝜚 𝑇[1: 𝑧] = 𝜚 𝑇[1: 𝑦] + 𝐶 𝑦 𝜚 𝑇[𝑦: 𝑧] (concatenation) ❖ Define 𝜚 𝑆 𝑇 = σ 𝑗=1 𝑜 𝐶 −𝑗 𝑇 𝑗 𝑛𝑝𝑒 𝑄 (reversal) ❖ 𝜚 𝑇 𝑆 [1: 𝑦] = 𝐶 𝑦+1 𝜚 𝑆 𝑇[1: 𝑦] ❖ 𝜚 𝑆 𝑇[1: 𝑧] = 𝜚 𝑆 𝑇[1: 𝑦] + 𝐶 −𝑦 𝜚 𝑆 𝑇[𝑦: 𝑧] ❖ Can be computed on the fly
Identifying Palindromes ❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001
Identifying Near-Palindromes? ❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001
Identifying Near-Palindromes? ❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001
Identifying Near-Palindromes? ❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001
Identifying Near-Palindromes? ❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001
Identifying Near-Palindromes? ❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001
Identifying Near-Palindromes? (CFP+16)
Karp-Rabin Fingerprints for Subpatterns ❖ 𝑇 𝑏,𝑐 = 𝑇 𝑏 𝑇 𝑏 + 𝑐 𝑇 𝑏 + 2𝑐 𝑇 𝑏 + 3𝑐 … ❖ 𝜚 𝑏,𝑐 𝑇 = 𝜚 𝑇 𝑏,𝑐 = 𝐶 ∗ 𝑇 𝑏 + 𝐶 2 ∗ 𝑇 𝑏 + 𝑐 + 𝐶 3 ∗ 𝑇 𝑏 + 2𝑐 …
Identifying Near-Palindromes? 𝑆 ❖ Let ∆ = # 𝑏 𝜚 𝑏,𝑐 𝑇 ≠ 𝐶 𝑙 𝜚 𝑏,𝑐 𝑇 } ❖ Then ∆ ≤ HAM(𝑇, 𝑇 𝑆 )
Identifying Near-Palindromes? ❖ Sample log 𝑜 primes 𝑞 1 , 𝑞 2 , … from 16 𝑒 log 2 𝑜, 544 𝑒 log 2 𝑜 . ❖ Let ∆ = max # 𝑏 𝜚 𝑏,𝑞 𝑗 𝑇 ≠ 𝐶 𝑙 𝜚 𝑏,𝑞 𝑗 𝑆 𝑇 } ❖ ∆ ≤ HAM(𝑇, 𝑇 𝑆 ) What about ❖ If HAM 𝑇, 𝑇 𝑆 > 2𝑒 , then ∆ > 1 + 1 16 𝑒 w.h.p. (CFP+16) HAM 𝑇, 𝑇 𝑆 ≤ 2𝑒 ?
Karp-Rabin Fingerprints for Sub-Subpatterns
Second-Level Karp-Rabin Fingerprints ❖ Call a mismatch isolated under 𝑞 𝑗 if it is the only mismatch under some subpattern 𝑇 𝑏,𝑞 𝑗 . Let 𝐽 be the number of isolated mismatches. ❖ If HAM 𝑇, 𝑇 𝑆 ≤ 2𝑒 , then 𝐽 = HAM 𝑇, 𝑇 𝑆 w.h.p. (CFP+16)
In Review ❖ There exists a data structure of size 𝑃 𝑒 log 6 𝑜 bits that recognizes whether HAM 𝑇, 𝑇 𝑆 ≤ 𝑒 w.h.p. ❖ Recently, this has been improved to 𝑃 𝑒 log 𝑜 . (Clifford, Kociumaka, Porat ‘17) ❖ Through black-box reduction, improves our results by 𝑃 log 5 𝑜 .
Additive Error Algorithm 𝑜 ❖ Initialize a data structure every 2 positions!
Additive Error Algorithm ❖ 2 𝑜 sketches, each of size 𝑃 𝑒 log 6 𝑜 bits ❖ Total space: 𝑃 𝑒 𝑜 log 6 𝑜 bits
Structure of Talk ❖ Background ❖ 1-Pass Additive Algorithm ❖ 2-Pass Exact Algorithm ❖ Lower Bounds
2-Pass Exact Algorithm ❖ Can we modify 1-pass additive algorithm to 2-pass exact? ❖ Missing characters before checkpoint!
2-Pass Exact Algorithm ❖ Idea: keep all characters before each checkpoint in the second pass ❖ What if there are Ω 𝑜 candidates? ❖ Structural result of palindromes (BEMS14)
Structural Result of Near-Palindromes ❖ Goal #1: Recover fingerprints of all overlapping “long” near - palindromes ❖ Goal #2: Use sublinear space in compression
Structural Result of Near-Palindromes ❖ Goal #1: Recover fingerprints of all overlapping “long” near - palindromes ❖ Goal #2: Use sublinear space in compression
Structural Result of Near-Palindromes ❖ Goal #1: Recover fingerprints of all overlapping “long” near - palindromes ❖ Goal #2: Use sublinear space in compression
Structural Result of Near-Palindromes ❖ Goal #1: Recover fingerprints of all overlapping “long” near - palindromes ❖ Goal #2: Use sublinear space in compression
Structural Result of Near-Palindromes ❖ Goal #1: Recover fingerprints of all overlapping “long” near - palindromes ❖ Goal #2: Use sublinear space in compression
Structural Result of Near-Palindromes
Structural Result of Near-Palindromes ❖ Goal #1: Recover fingerprints of all overlapping “long” near - palindromes ❖ Goal #2: Use sublinear space in compression
Structural Result of Near-Palindromes ❖ Not quite periodic (at most 2𝑒 − 1 different words) ❖ Need to save at most 2𝑒 − 1 fingerprints of words
2-Pass Exact Algorithm ❖ First pass: 𝑃 𝑒 2 𝑜 log 6 𝑜 bits ❖ At most 2𝑒 − 1 fingerprints, each of size 𝑃 𝑒 log 6 𝑜 words ❖ Need to save at 𝑜 characters before 2𝑒 − 1 checkpoints: 𝑃 𝑒 𝑜 ❖ Total space: 𝑃 𝑒 2 𝑜 log 6 𝑜 bits
Structure of Talk ❖ Background ❖ 1-Pass Additive Algorithm ❖ 2-Pass Exact Algorithm ❖ Lower Bounds
Multiplicative Lower Bounds ❖ Yao’s Principle: find “hard” distribution for deterministic algorithms ❖ Let 𝜉 be the prefix of 10110011100011110000 … = 1 1 0 1 1 2 0 2 … 𝑜 of length 4 (GMSU16). 𝑜 ❖ Take 𝑦 ∈ 𝑌 = strings of length 4 with weight 𝑒 ❖ Take 𝑧 ∈ 𝑍 = 𝑧 | HAM 𝑦, 𝑧 = 𝑒 or HAM 𝑦, 𝑧 = 𝑒 + 1 ❖ Define 𝑡 𝑦, 𝑧 = 𝜉 𝑆 𝑦𝑧 𝑆 𝜉 .
Recommend
More recommend