Longest Near-Palindrome under Hamming Distance Elena Grigorescu, - PowerPoint PPT Presentation

Streaming for Aibohphobes: Longest Near-Palindrome under Hamming Distance Elena Grigorescu, Purdue University Erfan Sadeqi Azer, Indiana University Samson Zhou, Purdue University

Structure of Talk ❖ Background ❖ 1-Pass Additive Algorithm ❖ 2-Pass Exact Algorithm ❖ Lower Bounds

FSTTCSIITKANPURPATTERNINDIAP Finding Structure in ALPPATTERNSZXAITIIKIICU JBFQWA Noisy Data FSTTCSPATTERNIITKANPURINDIAO STREAMINGALGORITHMPATTERNU PERIODPERIODPERIODPERIODPER FSTTCSTHEORYCSASBRICBCAUON LONGPALINDROMEEMORDNILAPGN OLFSTTCSIITKANPURINDIAGENXAS

Palindrome ❖ A string that reads the same forwards and backwards ❖ 𝑇 = 𝑇 𝑆 ❖ RACECAR ❖ RACECAR ❖ AIBOHPHOBIA ❖ AIBOHPHOBIA

𝑒 -Near-Palindrome ❖ A string that “almost” reads the same forwards and backwards ❖ Given a metric 𝑒𝑗𝑡𝑢 , a 𝑒 -near-palindrome has 𝑒𝑗𝑡𝑢 𝑇, 𝑇 𝑆 ≤ 𝑒 . ❖ RACECAR ❖ FACECAR

Hamming Distance ❖ Given strings 𝑌, 𝑍, the Hamming distance between 𝑌 and 𝑍 is defined as the positions 𝑗 at which 𝑌 𝑗 ≠ 𝑍 𝑗 . ❖ 𝑇 = FACECAR ❖ 𝑇 𝑆 = RACECAF ❖ HAM(𝑇, 𝑇 𝑆 ) = 2

Streaming Model ❖ String of length 𝑜 arrives one symbol at a time ❖ Use 𝑝(𝑜) space, ideally 𝑃(𝑞𝑝𝑚𝑧𝑚𝑝𝑕 𝑜) abaacabaccbabbbcbabbccababbccb abaacabaccbabbbcbabbccababbccb abaacabaccbabbbcbabbccababbccb

Longest 𝑒 -Near-Palindrome Problem ❖ Given a string 𝑇 of length 𝑜 , which arrives in a data stream, identify the longest 𝑒 -near-palindrome in space 𝑝 𝑜 . ❖ Given a string 𝑇 of length 𝑜 , which arrives in a data stream, find a “long” 𝑒 -near-palindrome in space 𝑝 𝑜 .

Applications TGCTTAAGCGCTTGCAAGCGCTTAAGCA CAAGCGCTTAAGCA ACGAATTCGCGAAC

Related Work (Palindromes in Data Streams) ❖ 𝑃(log 𝑜) space to provide a 1 + 𝜁 multiplicative approximation to the length of the longest palindrome (Berenbrink,Ergün,Mallmann- Trenn,Sadeqi Azer ‘14) ❖ 𝑃( 𝑜) space to provide a 𝑜 additive approximation to the length of the longest palindrome (BEMS14) ❖ 𝑃( 𝑜) space to find the longest palindrome in two passes (BEMS14) log 𝑜 ❖ Ω 𝜁 log(1+𝜁) space for 1 + 𝜁 multiplicative approximation (Gawrychowski,Merkurev,Shur,Uznanski’16) 𝑜 ❖ Ω 𝐹 space for 𝐹 additive approximation (GMSU16)

Our Results 𝑒 log 7 𝑜 ❖ 𝑃 𝜁 log(1+𝜁) space to provide a 1 + 𝜁 multiplicative approximation to the length of the longest 𝑒 -near-palindrome ❖ 𝑃(𝑒 𝑜 log 6 𝑜) space to provide a 𝑜 additive approximation to the length of the longest 𝑒 -near-palindrome ❖ 𝑃(𝑒 2 𝑜 log 6 𝑜) space to find the longest 𝑒 -near-palindrome in two passes ❖ Ω 𝑒 log 𝑜 space LB for 1 + 𝜁 multiplicative approximation 𝑒𝑜 ❖ Ω space LB for 𝐹 additive approximation 𝐹

Comparison Longest 𝑒 -Near- Longest Palindrome Palindrome (Here) 𝑃(log 2 𝑜) (BEMS14) 𝑒 log 7 𝑜 1 + 𝜁 multiplicative 𝑃 𝜁 log(1 + 𝜁) 𝑃(𝑒 𝑜 log 6 𝑜) 𝑜 additive 𝑃( 𝑜 log 𝑜) (BEMS14) 𝑃(𝑒 2 𝑜 log 6 𝑜) two pass exact 𝑃( 𝑜 log 𝑜) (BEMS14) log 𝑜 1 + 𝜁 multiplicative LB Ω Ω 𝑒 log 𝑜 log(1+𝜁) (GMSU16) 𝑜 E additive LB Ω 𝑒𝑜 Ω 𝐹 (GMSU16) 𝐹

Warm-Up ❖ Suppose we see string 𝑇 , followed by string 𝑈 . How can we determine if 𝑇 = 𝑈 , with high probability?

Karp-Rabin Fingerprints 𝑜 ❖ Given base 𝐶 and a prime 𝑄 , define 𝜚 𝑇 = σ 𝑗=1 𝐶 𝑗 𝑇 𝑗 𝑛𝑝𝑒 𝑄 ❖ If 𝑇 = 𝑈 , then 𝜚 𝑇 = 𝜚 𝑈 ❖ If 𝑇 ≠ 𝑈 , then 𝜚 𝑇 ≠ 𝜚 𝑈 w.h.p. (Schwartz-Zippel)

Properties of Karp-Rabin Fingerprints ❖ 𝜚 𝑇[1: 𝑧] = 𝜚 𝑇[1: 𝑦] + 𝐶 𝑦 𝜚 𝑇[𝑦: 𝑧] (concatenation) ❖ Define 𝜚 𝑆 𝑇 = σ 𝑗=1 𝑜 𝐶 −𝑗 𝑇 𝑗 𝑛𝑝𝑒 𝑄 (reversal) ❖ 𝜚 𝑇 𝑆 [1: 𝑦] = 𝐶 𝑦+1 𝜚 𝑆 𝑇[1: 𝑦] ❖ 𝜚 𝑆 𝑇[1: 𝑧] = 𝜚 𝑆 𝑇[1: 𝑦] + 𝐶 −𝑦 𝜚 𝑆 𝑇[𝑦: 𝑧] ❖ Can be computed on the fly

Identifying Palindromes ❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001

Identifying Near-Palindromes? ❖ 111101011100001010010101001111101011100001010010101001 ❖ 111101011100001010010101001111101011100001010010101001

Identifying Near-Palindromes? (CFP+16)

Karp-Rabin Fingerprints for Subpatterns ❖ 𝑇 𝑏,𝑐 = 𝑇 𝑏 𝑇 𝑏 + 𝑐 𝑇 𝑏 + 2𝑐 𝑇 𝑏 + 3𝑐 … ❖ 𝜚 𝑏,𝑐 𝑇 = 𝜚 𝑇 𝑏,𝑐 = 𝐶 ∗ 𝑇 𝑏 + 𝐶 2 ∗ 𝑇 𝑏 + 𝑐 + 𝐶 3 ∗ 𝑇 𝑏 + 2𝑐 …

Identifying Near-Palindromes? 𝑆 ❖ Let ∆ = # 𝑏 𝜚 𝑏,𝑐 𝑇 ≠ 𝐶 𝑙 𝜚 𝑏,𝑐 𝑇 } ❖ Then ∆ ≤ HAM(𝑇, 𝑇 𝑆 )

Identifying Near-Palindromes? ❖ Sample log 𝑜 primes 𝑞 1 , 𝑞 2 , … from 16 𝑒 log 2 𝑜, 544 𝑒 log 2 𝑜 . ❖ Let ∆ = max # 𝑏 𝜚 𝑏,𝑞 𝑗 𝑇 ≠ 𝐶 𝑙 𝜚 𝑏,𝑞 𝑗 𝑆 𝑇 } ❖ ∆ ≤ HAM(𝑇, 𝑇 𝑆 ) What about ❖ If HAM 𝑇, 𝑇 𝑆 > 2𝑒 , then ∆ > 1 + 1 16 𝑒 w.h.p. (CFP+16) HAM 𝑇, 𝑇 𝑆 ≤ 2𝑒 ?

Karp-Rabin Fingerprints for Sub-Subpatterns

Second-Level Karp-Rabin Fingerprints ❖ Call a mismatch isolated under 𝑞 𝑗 if it is the only mismatch under some subpattern 𝑇 𝑏,𝑞 𝑗 . Let 𝐽 be the number of isolated mismatches. ❖ If HAM 𝑇, 𝑇 𝑆 ≤ 2𝑒 , then 𝐽 = HAM 𝑇, 𝑇 𝑆 w.h.p. (CFP+16)

In Review ❖ There exists a data structure of size 𝑃 𝑒 log 6 𝑜 bits that recognizes whether HAM 𝑇, 𝑇 𝑆 ≤ 𝑒 w.h.p. ❖ Recently, this has been improved to 𝑃 𝑒 log 𝑜 . (Clifford, Kociumaka, Porat ‘17) ❖ Through black-box reduction, improves our results by 𝑃 log 5 𝑜 .

Additive Error Algorithm 𝑜 ❖ Initialize a data structure every 2 positions!

Additive Error Algorithm ❖ 2 𝑜 sketches, each of size 𝑃 𝑒 log 6 𝑜 bits ❖ Total space: 𝑃 𝑒 𝑜 log 6 𝑜 bits

2-Pass Exact Algorithm ❖ Can we modify 1-pass additive algorithm to 2-pass exact? ❖ Missing characters before checkpoint!

2-Pass Exact Algorithm ❖ Idea: keep all characters before each checkpoint in the second pass ❖ What if there are Ω 𝑜 candidates? ❖ Structural result of palindromes (BEMS14)

Structural Result of Near-Palindromes ❖ Goal #1: Recover fingerprints of all overlapping “long” near - palindromes ❖ Goal #2: Use sublinear space in compression

Structural Result of Near-Palindromes

Structural Result of Near-Palindromes ❖ Goal #1: Recover fingerprints of all overlapping “long” near - palindromes ❖ Goal #2: Use sublinear space in compression

Structural Result of Near-Palindromes ❖ Not quite periodic (at most 2𝑒 − 1 different words) ❖ Need to save at most 2𝑒 − 1 fingerprints of words

2-Pass Exact Algorithm ❖ First pass: 𝑃 𝑒 2 𝑜 log 6 𝑜 bits ❖ At most 2𝑒 − 1 fingerprints, each of size 𝑃 𝑒 log 6 𝑜 words ❖ Need to save at 𝑜 characters before 2𝑒 − 1 checkpoints: 𝑃 𝑒 𝑜 ❖ Total space: 𝑃 𝑒 2 𝑜 log 6 𝑜 bits

Multiplicative Lower Bounds ❖ Yao’s Principle: find “hard” distribution for deterministic algorithms ❖ Let 𝜉 be the prefix of 10110011100011110000 … = 1 1 0 1 1 2 0 2 … 𝑜 of length 4 (GMSU16). 𝑜 ❖ Take 𝑦 ∈ 𝑌 = strings of length 4 with weight 𝑒 ❖ Take 𝑧 ∈ 𝑍 = 𝑧 | HAM 𝑦, 𝑧 = 𝑒 or HAM 𝑦, 𝑧 = 𝑒 + 1 ❖ Define 𝑡 𝑦, 𝑧 = 𝜉 𝑆 𝑦𝑧 𝑆 𝜉 .

Longest Near-Palindrome under Hamming Distance Elena Grigorescu, - PowerPoint PPT Presentation

Streaming for Aibohphobes: Longest Near-Palindrome under Hamming Distance Elena Grigorescu, Purdue University Erfan Sadeqi Azer, Indiana University Samson Zhou, Purdue University Structure of Talk Background 1-Pass Additive Algorithm

Palindrome Recognition in the Streaming Model P. Berenbrink, F. Ergun, F. Mallmann-Tren, E.

CSE 421 Longest Path in a DAG, LIS, Shortest Path with Negative Weights Shayan Oveis Gharan 1

The Origin of Near Earth The Origin of Near Earth The Origin of Near Earth The Origin of Near

PowerUp Team Palindrome Key Challenges Reactive, rather than proactive, safety management

Tout ce que je sais sur Banderier Wednesday, April 2, 2014 Def: Let be an involution on a

The Non-Regularity of an Even-Length Palindrome with Suffix Zach Tomaszewski ICS641 16 Mar 2012

More Counting Principles MDM4U: Mathematics of Data Management A palindrome is a word, phrase,

CS32 Final Practice 1. A palindrome is a word, phrase or sequence that reads the same backward as

Liquid Argon Near Detector Simulation Liquid Argon Near Detector Simulation Jonathan Asaadi 1

On the Length of the Longest Common Subsequence Peter Rabinovitch Summary Consider two

My Longest Journey Poem Reading week 5 session 3 1 star - Miss Crook's English set 2 stars - Ms

The Undersea Internet Backbone The Story of Really Really Long Wires Trivia What is the longest

Fast Parallel Longest Common Subsequence with General Integer Scoring Support Adnan Ozsoy , Arun

BIBLICAL SURVEY Joseph: the Background Story I wish life had a rewind button Our study of Joseph

Tree terminology refresh height(G)=2 A longest path Depth of node X: number of edges on

DUNE Near Detector: Perspective from NDDG A. D. Bross (FNAL), H.A. Tanaka (SLAC/Stanford) for the

ILLUM INATED CHARTERS IN THE DIGITAL AGE RULES AND OPPORTUNITIES M artin ROLAND

17/03/2020 1 2 Henry the Lion duke of Bavaria and Saxony and Matilda of England 3 4 The

Connie Ozawa, PhD Alan Yeakley, PhD Research Assistants Khanh Pham and Denisse Fisher Part

Therapies In Schools Project BRINGING THERAPIES INTO THE HEART OF SCHOOLS Why TIS

CSE 311: Foundations of Computing announcements Fall 2013 Reading assignment Lecture 17:

Turing Machine Recap DFA with (infinite) tape. One move: read, write, move, change state.

A New Format for Finite Automata We give an alternate format for describing FA: The finite

Make Pals Your Pals Arseny M. Shur Ural Federal University, Ekaterinburg, Russia Joint work with

Sambuz

Useful Links

Newsletter

Mail Us