Simpler and efficient LZW-compressed multiple pattern matching Paweł Gawrychowski July 4, 2012 Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 1 / 20
We consider the standard pattern matching problem. Pattern matching Given a text t and a pattern p , does p occur in t ? If it does, where is the leftmost occurrence? Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 2 / 20
Find kjfdkasl in rokjfdkjncbvdkojsdlkjsldskjxlkalkjfslakjlkxxcv epofikflskdjflskjvnlmnapodierpereporipojdpdaja kjrtrgdkjfdkaslkdjoretieodflkgjsnlgkjdslgkjldf riudkxdjwoisdoiwlkmssoiwoiosdkjwoixkcjksjdkjws wjnswoislkcxlkqpodskjzlapoqlksdxcmdfepowepofde zirpotdpoitgiouyoewpoiewlkjdklnkjfdkaslldkjgrp oieorisdlkweoidssdlkweoidscxmnosdwoioweiwoiwoi eopripowedkljskljwekljsdldkjsxmcnweioiewdlskjd rotirlekdlsdfdwmcslkcsdpowkdwpodkwpoekwpoporer eporjmkjfdkaslpwiowjsklncxmncsldkwpoeiwpoikwed ojreoijdkmndkjnfekreopreojkslkdjsapowi2poqwiqp Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 3 / 20
Find kjfdkasl in rokjfdkjncbvdkojsdlkjsldskjxlkalkjfslakjlkxxcv epofikflskdjflskjvnlmnapodierpereporipojdpdaja kjrtrgdkjfdkaslkdjoretieodflkgjsnlgkjdslgkjldf riudkxdjwoisdoiwlkmssoiwoiosdkjwoixkcjksjdkjws wjnswoislkcxlkqpodskjzlapoqlksdxcmdfepowepofde zirpotdpoitgiouyoewpoiewlkjdklnkjfdkaslldkjgrp oieorisdlkweoidssdlkweoidscxmnosdwoioweiwoiwoi eopripowedkljskljwekljsdldkjsxmcnweioiewdlskjd rotirlekdlsdfdwmcslkcsdpowkdwpodkwpoekwpoporer eporjmkjfdkaslpwiowjsklncxmncsldkwpoeiwpoikwed ojreoijdkmndkjnfekreopreojkslkdjsapowi2poqwiqp Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 3 / 20
Find kjfdkasl in rokjfdkjncbvdkojsdlkjsldskjxlkalkjfslakjlkxxcv epofikflskdjflskjvnlmnapodierpereporipojdpdaja kjrtrgdkjfdkaslkdjoretieodflkgjsnlgkjdslgkjldf riudkxdjwoisdoiwlkmssoiwoiosdkjwoixkcjksjdkjws wjnswoislkcxlkqpodskjzlapoqlksdxcmdfepowepofde zirpotdpoitgiouyoewpoiewlkjdklnkjfdkaslldkjgrp oieorisdlkweoidssdlkweoidscxmnosdwoioweiwoiwoi eopripowedkljskljwekljsdldkjsxmcnweioiewdlskjd rotirlekdlsdfdwmcslkcsdpowkdwpodkwpoekwpoporer eporjmkjfdkaslpwiowjsklncxmncsldkwpoeiwpoikwed ojreoijdkmndkjnfekreopreojkslkdjsapowi2poqwiqp Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 3 / 20
Find kjfdkasl in Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 3 / 20
And move to its natural generalization. Pattern matching Given a text t and a pattern p , does p occur in t ? If it does, where is the leftmost occurrence? As the title suggests, we will actually consider the multiple pattern version. Compressed multiple pattern matching Given a compressed representation of a text t and a collection of patterns p 1 , p 2 , . . . , p ℓ , does any p i occur in t ? Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 4 / 20
And move to its natural generalization. Compressed pattern matching Given a compressed representation of a text t and a pattern p , does p occur in t ? If it does, where is the leftmost occurrence? As the title suggests, we will actually consider the multiple pattern version. Compressed multiple pattern matching Given a compressed representation of a text t and a collection of patterns p 1 , p 2 , . . . , p ℓ , does any p i occur in t ? Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 4 / 20
And move to its natural generalization. Compressed pattern matching Given a compressed representation of a text t and a pattern p , does p occur in t ? If it does, where is the leftmost occurrence? As the title suggests, we will actually consider the multiple pattern version. Compressed multiple pattern matching Given a compressed representation of a text t and a collection of patterns p 1 , p 2 , . . . , p ℓ , does any p i occur in t ? Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 4 / 20
Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t [ 1 .. N ] is split into disjoint blocks b 1 b 2 . . . b n . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF ! √ You can see that n ∈ Ω( N ) , so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression. Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 5 / 20
Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t [ 1 .. N ] is split into disjoint blocks b 1 b 2 . . . b n . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF ! ababbababababababababaabbbaa √ You can see that n ∈ Ω( N ) , so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression. Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 5 / 20
Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t [ 1 .. N ] is split into disjoint blocks b 1 b 2 . . . b n . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF ! ababbababababababababaabbbaa √ You can see that n ∈ Ω( N ) , so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression. Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 5 / 20
Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t [ 1 .. N ] is split into disjoint blocks b 1 b 2 . . . b n . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF ! ababbababababababababaabbbaa √ You can see that n ∈ Ω( N ) , so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression. Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 5 / 20
Lempel-Ziv-Welch-like (or LZ78-like) compression methods Text t [ 1 .. N ] is split into disjoint blocks b 1 b 2 . . . b n . Each block is either a single letter or a previously defined block concatenated with a single letter. Used in compress,GIF,TIFF,PDF ! ababbababababababababaabbbaa √ You can see that n ∈ Ω( N ) , so the best possible compression ratio is limited. On the other hand, this method allows very fast (and simple) compression and decompression. Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 5 / 20
t [ 1 .. N ] text, which after compression consists of n blocks p 1 , p 2 , . . . , p ℓ patterns of total length M LZW-compressed multiple pattern matching Input: p 1 , p 2 , . . . , p ℓ and a sequence of n blocks defining text t Output: does any p i occur in t ? Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 6 / 20
First solutions for the single pattern version were given in 1994 by Amir, Benson, and Farach. They developed two algorithms with time complexities O ( n log M + M ) and O ( n + M 2 ) . Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 7 / 20
Year later the second algorithm was improved by Kosaraju, who developed a O ( n + M 1 + ǫ ) time solution. Gawrychowski SODA 2011 Single pattern version can be solved in O ( n + M ) time. Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 8 / 20
If we consider more than one pattern, the situation seems significantly more challenging. Kida, Takeda, Shinohara, Miyazaki, Arikawa DCC 1998 Multiple pattern version can be solved in O ( n + M 2 ) time. Is it possible to narrow the gap between single and multiple pattern versions? Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 9 / 20
This paper Multiple pattern version can be solved in O ( n log M + M ) or O ( n + M 1 + ǫ ) time. matches the bounds of Amir et al. and Kosaraju. 1 DOES NOT use any combinatorics on words, reduces the 2 question to simple-to-state data structure problems. the same high-level idea in both algortihms. So, in a certain 3 sense, more uniform than the previously known solutions for single pattern. Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 10 / 20
Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern. Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 11 / 20
Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern. Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 11 / 20
Snippets A snippet is simply a substring of (any) pattern. Assume that each block is a snippet. The simplest way to detect an occurrence is to process the blocks from left to right as follows, where red segment is the current longest prefix of (any) pattern. Paweł Gawrychowski LZW-compressed multiple pattern matching July 4, 2012 11 / 20
Recommend
More recommend