Computi ting l longes est c common square s e subsequen ences - PowerPoint PPT Presentation

CPM 2018 Computi ting l longes est c common square s e subsequen ences Takafumi Inoue 1 , Shunsuke Inenaga 1 , Heikki Hyyrö 2 , Hideo Bannai 1 , Masayuki Takeda 1 1 Kyushu University 2 University of Tampere

Longest Common Subsequence (LCS) LCS Problem Input: two strings A and B of length n each Output: (length of) LCS of A and B  LCS is a classical measure for string comparison.  Standard DP solves this in O ( n 2 ) time. E.g.) A = aacaabad vs B = cacbcbbd

Longest Common Subsequence (LCS) LCS Problem Input: two strings A and B of length n each Output: (length of) LCS of A and B  LCS is a classical measure for string comparison.  Standard DP solves this in O ( n 2 ) time. E.g.) A = a a c aa b a d vs B = c ac bcb bd

Constrained/Restricted LCS  Variants of LCS problem where the solution must satisfy pre-determined constraints.  Attempt to reflect user’s a-priori knowledge to the solutions.  STR-IC-LCS, STR-EC-LCS, SEQ-IC-LCS, SEQ-EC-LCS LCS of A and B that includes (excludes) given pattern P as a substring (subsequence). (See [Kuboi et al, CPM 2017] and references therein)  Longest common palindromic subsequence (LCPS) [Chowdhury et al. 2014, Inenaga & Hyyrö 2018, Bae & Lee 2018]

Longest Common Square Subseq. (LCSS)  This work considers new variant of LCS, called LCSS, where the solution has to be square .  Square (a.k.a. tandem repeat) is string of form xx .  aabaab  abababab  abcbbabcbb

Longest Common Square Subseq. (LCSS) LCSS Problem Input: two strings A and B of length n each Output: (length of) LCSS of A and B E.g .) A = monsterstrike vs B = fourstringmasters

Longest Common Square Subseq. (LCSS) LCSS Problem Input: two strings A and B of length n each Output: (length of) LCSS of A and B E.g .) A = mon st e rstr ike vs B = four str ingma st e r s

Our Results Upper bounds (algorithms) for LCSS algorithm time space O ( n 6 ) O ( n 4 ) Naïve O ( Mn 4 ) O ( n 4 ) Simple O ( σ M 3 + n ) O ( M 2 + n ) Matching rectangle 1 O ( M 3 log 2 n loglog n + n ) O ( M 3 + n ) Matching rectangle 2  n is the length of the input strings.  M is the number of matching points, i.e., M = |{( i , j ) | A [ i ] = B [ j ], 1 ≤ i , j ≤ n }| .  σ is the alphabet size.

Matching Points  M is the number of matching points, i.e., M = |{( i , j ) | A [ i ] = B [ j ], 1 ≤ i , j ≤ n }| . ● ● ● a ● ● ● b ● ● ● a A ● ● ● b ● ● ● b ● ● ● a a b b a b a B

Matching Points  M is the number of matching points, i.e., M = |{( i , j ) | A [ i ] = B [ j ], 1 ≤ i , j ≤ n }| . ● ● ● a ● ● ● b A [3] = B [5] ● ● ● a A ● ● ● b ● ● ● M = # of ● ’s b ● ● ●  M = O ( n 2 ) a a b b a b a B

Matching Points [Cont.]  But M can be much smaller than O ( n 2 ) in many cases e ● ● i k o o ● c b i s c u i t

Our Results Upper bounds (algorithms) for LCSS algorithm time space O ( n 6 ) O ( n 4 ) Naïve O ( Mn 4 ) O ( n 4 ) Simple O ( σ M 3 + n ) O ( M 2 + n ) Matching rectangle 1 O ( M 3 log 2 n loglog n + n ) O ( M 3 + n ) Matching rectangle 2 M is at most O ( n 2 )  n is the length of the input strings. and can be much smaller  M is the number of matching points, i.e., M = |{( i , j ) | A [ i ] = B [ j ], 1 ≤ i , j ≤ n }| .  σ is the alphabet size.

Matching Rectangles  Tuple r = ( i , j , k , l ) is called matching rectangle if A [ i ] = A [ j ] = B [ k ] = B [ l ]. n +1 r j i j A c c l k B c c i 0 k l n +1

Partial Order of Matching Rectangles  For matching rectangles r = ( i , j , k , l ) and r ’ = ( i ’, j ’, k ’, l ’), r < r ’ iff i < i ’, j < j ’, k < k ’, and l < l ’. Namely, r < r ’ iff r lies strictly more left-lower than r ’ . r ’ j ’ r ’ j ’ i ’ r r j j i ’ i i k k ’ l l ’ k l k ’ l ’

Observation  Each common square subsequence has corresponding sequence of matching rectangles. … c … b … a A … c … b … a … … a … b … c … a … b … c … B

CSS and matching rectangle  Sequence r 1 , …, r s of s matching rectangles represents CSS of length s iff  r 1 < r 2 ... < r s  i s < j 1 , k s < l 1 where r 1 = ( i 1 , j 1 , k 1 , l 1 ), r s = ( i s , j s , k s , l s )

CSS and matching rectangle  Sequence r 1 , …, r s of s matching rectangles represents CSS of length s iff  r 1 < r 2 ... < r s  i s < j 1 , k s < l 1 where r 1 = ( i 1 , j 1 , k 1 , l 1 ), r s = ( i s , j s , k s , l s ) is strictly more left-lower than

LCSS → Longest sequence of DOMRs  Computing LCSS reduces to finding longest sequence of diagonally overlapping matching rectangles (DOMRs). 18

Basic Algorithm  For each matching rectangle r , maintain DP table D r of size M 2 such that D r [ r’ ] stores length of longest sequence of DOMRs that begins with r and ends with r’ .  For each character c , find the “closest” matching rectangle r c w.r.t. c that can be added after r’ . Update D r [ r c ] if needed. r a a r’ a r a a

Basic Algorithm  For each matching rectangle r , maintain DP table D r of size M 2 such that D r [ r’ ] stores length of longest sequence of DOMRs that begins with r and ends with r’ .  For each character c , find the “closest” matching rectangle r c w.r.t. c that can be added after r’ . Update D r [ r c ] if needed. r b b r’ b r b b

Basic Algorithm  For each matching rectangle r , maintain DP table D r of size M 2 such that D r [ r’ ] stores length of longest sequence of DOMRs that begins with r and ends with r’ .  For each character c , find the “closest” matching rectangle r c w.r.t. c that can be added after r’ . Update D r [ r c ] if needed. r c c r’ c r c c

Basic Algorithm [Cont.]  Let R be # of matching rectangles ( R = O ( M 2 ) ).  We compute D r [ r’ ] for R 2 = O ( M 4 ) pairs of matching rectangles ( r , r’ ) .  We test σ characters to extend the current sequence of DOMRs w.r.t. D r [ r’ ] .  Each extension can be obtained in O (1) time after suitable preprocessing.  O ( σ R 2 + n ) = O ( σ M 4 + n ) time… Slow? Can be improved to O ( σ Μ R + n ) = O ( σ M 3 + n ) time

On Start Matching Rectangle  Always better to use a start matching rectangle that has the “smallest” left-lower corner for each character. Try each matching point m for a a a a a a a a a a a a a Can always use this fixed point for a

Improved Algorithm  We compute D m [ r’ ] for MR = O ( M 3 ) pairs of matching points and matching rectangles ( m , r’ ) .  We test σ characters to extend the current sequence of DOMRs.  Each extension can be obtained in O (1) time after suitable preprocessing.  O ( σ MR + n ) = O ( σ M 3 + n ) time!

Improved Algorithm [Cont.] Theorem The LCSS problem can be solved in O ( σ MR + n ) = O ( σ M 3 + n ) time with O ( M 2 + n ) space. Corollary The expected running time of this algorithm is O ( n 6 / σ 3 ) .  For random text M ≈ n 2 / σ and R ≈ M 2 / σ ≈ n 4 / σ 3 .

Hardness of LCSS Lemma LCSS for two strings is at least as hard as LCS for four strings.

4-LCS  2-LCSS Computing LCS for A , B , C , D of length n each reduces to computing LCSS of A’ , B’ of length 4 n +2 each. A C B D | A | = | B | = | C | = | D | = n A ’ $ n +1 $ n +1 B ’ $ n +1 $ n +1

Conditional Lower Bound for LCSS Lemma [Abboud et al. 2015] There is no algorithm which solves the LCS problem for k strings in O ( n k - ε ) time with constant ε > 0 , unless the strong exponential time hypothesis (SETH) fails. Corollary There is no algorithm which solves the LCSS problem for two strings in O ( n 4- ε ) time with constant ε > 0 , unless SETH fails.

Conclusions & Open Problem M = O ( n 2 ) Upper bounds for LCSS algorithm time space O ( n 6 ) O ( n 4 ) Naïve O ( Mn 4 ) O ( n 4 ) Simple O ( σ M 3 + n ) O ( M 2 + n ) Matching rectangle 1 O ( M 3 log 2 n loglog n + n ) O ( M 3 + n ) Matching rectangle 2 Conditional Lower bound for LCSS O ( n 4- ε ) -time solution (with constant ε > 0 ) is unlikely to exist How can we close this (almost) quadratic gap?

Strong Exponential Time Hypothesis (SETH)  Let s k be the greatest lower bound (infimum) of real numbers δ such that k -SAT can be solved in O (2 δ n ) time, where n = # of variables.  The exponential time hypothesis ( ETH ) is a conjecture that s k > 0 for any k ≥ 3 .  Clearly s 3 ≤ s 4 ≤ s 5 … The strong ETH ( SETH ) is a conjecture that the limit of s k when k approaches ∞ is 1 .

Computi ting l longes est c common square s e subsequen ences - PowerPoint PPT Presentation

CPM 2018 Computi ting l longes est c common square s e subsequen ences Takafumi Inoue 1 , Shunsuke Inenaga 1 , Heikki Hyyr 2 , Hideo Bannai 1 , Masayuki Takeda 1 1 Kyushu University 2 University of Tampere Longest Common Subsequence

Computi ting logical consequences Inference in First- t-Order Logic We want procedures for

In Intel l Visu sual C Computi ting In Insti titu tute A New H Hub i in E Europ ope

Repairing Four-Atom Conjecture Ting-Ting Nan Advisor: Nigel Boston SP Coding and Information

Square Root of Not: Square Root of Not: . . . A Major Difference Between Square Root of

I NFORMATI ON COMPRESSI ON, I NTELLI GENCE, COMPUTI NG, AND MATHEMATI CS Dr Gerry Wolff

Hui-Jung Chen * , Sheng Hsiung Hung, Yung-Ting Chia, Shu-Chuan Kuo, Ting-An Wen, Hui-Ting Chang

EVENT REPORT est. 2013 est. 2018 2019 Carson City Off-Road EPIC RIDES Where beginners,

Tour d'horizon de CMake Montel Laurent Toulouse 26 janvier 2008 Qu'est ce qu'est CMake ?

ELT Overview & Recent Projects ELT Group of Companies Est. 1991 Est. 2004 Est.

DIEGUENO MIDDLE SCHOOL BLDG. B & G MODERNIZATION (PHASE 1) EST. START DATE: 6/22/18 EST.

1 Outline Chi-square test Logistic regression 2 Chi-square test 3 Chi-Square Test -

Homeowners Association 2018 Annual Meeting Baldwin Square November 7, 2018 BALDWIN SQUARE

Jackson Square Site III, Phase 3 Redevelopment August 11, 2016 Jackson Square Partners, LLC

Unit 5: Inference for categorical variables Lecture 3: Chi-square tests Statistics 101 Thomas

FNQROC BOARD MEETING MINUTES Meeti ting g No. Meeti ting FNQROC Board / QRA Presentation

System Integration October 2nd, 2018 A long walk begins with the first step 1 History

ASR, NLU, DM Ling575 Spoken Dialog Systems April 12, 2017 Roadmap ASR Basic

INTRODUCTION TO TELEPHONY & VOIP Advanced Internet Services (COMS 6181 Spring 2015)

Thickness Design 1972 AASHTO Method AASHTO Method Pavement engineers recognized early that

Green Domino Incentives: Impact of Energy-aware Adaptive Link Rate Policies in Routers Cyriac

Evolutionary Search Techniques for the Lyndon Factorization of Biosequences Workshop on

Bioinformatic Research at IIT: the Highlights Marco Pellegrini Istituto di Informatica e

RCRA RC RA and C CERC RCLA Integration a at Federal Facili lities FEBRUARY 3, 2020

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 10: Deep Neural

Computi ting l longes est c common square s e subsequen ences - PowerPoint PPT Presentation

CPM 2018 Computi ting l longes est c common square s e subsequen ences Takafumi Inoue 1 , Shunsuke Inenaga 1 , Heikki Hyyr 2 , Hideo Bannai 1 , Masayuki Takeda 1 1 Kyushu University 2 University of Tampere Longest Common Subsequence

Computi ting logical consequences Inference in First- t-Order Logic We want procedures for

In Intel l Visu sual C Computi ting In Insti titu tute A New H Hub i in E Europ ope

Repairing Four-Atom Conjecture Ting-Ting Nan Advisor: Nigel Boston SP Coding and Information

Square Root of Not: Square Root of Not: . . . A Major Difference Between Square Root of

I NFORMATI ON COMPRESSI ON, I NTELLI GENCE, COMPUTI NG, AND MATHEMATI CS Dr Gerry Wolff

Hui-Jung Chen * , Sheng Hsiung Hung, Yung-Ting Chia, Shu-Chuan Kuo, Ting-An Wen, Hui-Ting Chang

EVENT REPORT est. 2013 est. 2018 2019 Carson City Off-Road EPIC RIDES Where beginners,

Tour d'horizon de CMake Montel Laurent Toulouse 26 janvier 2008 Qu'est ce qu'est CMake ?

ELT Overview &amp; Recent Projects ELT Group of Companies Est. 1991 Est. 2004 Est.

DIEGUENO MIDDLE SCHOOL BLDG. B &amp; G MODERNIZATION (PHASE 1) EST. START DATE: 6/22/18 EST.

1 Outline Chi-square test Logistic regression 2 Chi-square test 3 Chi-Square Test -

Homeowners Association 2018 Annual Meeting Baldwin Square November 7, 2018 BALDWIN SQUARE

Jackson Square Site III, Phase 3 Redevelopment August 11, 2016 Jackson Square Partners, LLC

Unit 5: Inference for categorical variables Lecture 3: Chi-square tests Statistics 101 Thomas

FNQROC BOARD MEETING MINUTES Meeti ting g No. Meeti ting FNQROC Board / QRA Presentation

System Integration October 2nd, 2018 A long walk begins with the first step 1 History

ASR, NLU, DM Ling575 Spoken Dialog Systems April 12, 2017 Roadmap ASR Basic

INTRODUCTION TO TELEPHONY &amp; VOIP Advanced Internet Services (COMS 6181 Spring 2015)

Thickness Design 1972 AASHTO Method AASHTO Method Pavement engineers recognized early that

Green Domino Incentives: Impact of Energy-aware Adaptive Link Rate Policies in Routers Cyriac

Evolutionary Search Techniques for the Lyndon Factorization of Biosequences Workshop on

Bioinformatic Research at IIT: the Highlights Marco Pellegrini Istituto di Informatica e

RCRA RC RA and C CERC RCLA Integration a at Federal Facili lities FEBRUARY 3, 2020

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 10: Deep Neural

ELT Overview & Recent Projects ELT Group of Companies Est. 1991 Est. 2004 Est.

DIEGUENO MIDDLE SCHOOL BLDG. B & G MODERNIZATION (PHASE 1) EST. START DATE: 6/22/18 EST.

INTRODUCTION TO TELEPHONY & VOIP Advanced Internet Services (COMS 6181 Spring 2015)