constructing antidictionaries in output sensitive space
play

Constructing Antidictionaries in Output-Sensitive Space Lorraine - PowerPoint PPT Presentation

Constructing Antidictionaries in Output-Sensitive Space Lorraine Ayad Golnaz Badkobeh Gabriele Fici Alice H eliou Solon Pissis LSD/LAW 2019 London, UK, 7-8 Feb. 2019 L. Ayad, G. Badkobeh, G. Fici, A. H eliou, S. Pissis Constructing


  1. Constructing Antidictionaries in Output-Sensitive Space Lorraine Ayad Golnaz Badkobeh Gabriele Fici Alice H´ eliou Solon Pissis LSD/LAW 2019 London, UK, 7-8 Feb. 2019 L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

  2. Minimal Absent Words Definition A word v is an absent word of some word w if v does not occur as a factor in w . An absent word is minimal if all its proper factors occur in the word w . L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

  3. Minimal Absent Words Definition A word v is an absent word of some word w if v does not occur as a factor in w . An absent word is minimal if all its proper factors occur in the word w . Example Let w = abaab . The minimal absent words (MAWs) for w are: M w = { aaa, aaba, bab, bb } L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

  4. Minimal Absent Words Definition A word v is an absent word of some word w if v does not occur as a factor in w . An absent word is minimal if all its proper factors occur in the word w . Example Let w = abaab . The minimal absent words (MAWs) for w are: M w = { aaa, aaba, bab, bb } Definition The set M w of MAWs of w is called the antidictionary of w . L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

  5. Applications of Minimal Absent Words Antidictionaries are used in many real-world applications: Data compression (e.g., on-line lossless compression) Sequence comparison (e.g., alignment-free sequence comparison) Pattern matching (e.g., on-line string matching) Bioinformatics (e.g., pathogen-specific signature) L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

  6. Applications of Minimal Absent Words Antidictionaries are used in many real-world applications: Data compression (e.g., on-line lossless compression) Sequence comparison (e.g., alignment-free sequence comparison) Pattern matching (e.g., on-line string matching) Bioinformatics (e.g., pathogen-specific signature) Most of the times, a reduced antidictionary M ℓ is considered, consisting of those MAWs whose length is bounded by some threshold ℓ . L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

  7. Properties of Minimal Absent Words The theory of MAWs is well developed. For example, it is know that: Theorem A word of length n has O ( n ) different MAWs, which can be stored 1 occupying O ( n ) total space. One can compute the antidictionary of a word of length n in O ( n ) 2 time and space. Any word of length n can be reconstructed in O ( n ) time and space 3 from its (complete) antidictionary. The maximal length of a MAW equals 2 + the maximal length of a 4 repeated factor. Thus, for a random a word of length n , the longest MAW has length Θ(log | Σ | n ) . a generated by a Bernoulli i.i.d. source L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

  8. Algorithms for Computing Minimal Absent Words There exist several efficient algorithms for computing the (reduced) antidictionary of a word of length n , e.g.: O ( n ) time and space using a global data structure (e.g., SA) [Barton, H´ eliou, Mouchard, Pissis, 2014] — can be executed in external memory [H´ eliou, Pissis, Puglisi, 2017] O ( n ) + |M ℓ | time using O (min { n, ℓz } ) space, where z is the size of the LZ77 factorization, using the truncated DAWG [Fujishige, Takuya, Diptarama, 2018] L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

  9. Algorithms for Computing Minimal Absent Words There exist several efficient algorithms for computing the (reduced) antidictionary of a word of length n , e.g.: O ( n ) time and space using a global data structure (e.g., SA) [Barton, H´ eliou, Mouchard, Pissis, 2014] — can be executed in external memory [H´ eliou, Pissis, Puglisi, 2017] O ( n ) + |M ℓ | time using O (min { n, ℓz } ) space, where z is the size of the LZ77 factorization, using the truncated DAWG [Fujishige, Takuya, Diptarama, 2018] However, all these algorithms require Ω( n ) space due to the construction of a global data structure on the input word. L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

  10. Number and Distribution of Minimal Absent Words The total number and the distribution of lengths of MAWs has been studied for several sequences. L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

  11. Number and Distribution of Minimal Absent Words The total number and the distribution of lengths of MAWs has been studied for several sequences. Example In the human genome ( n ≈ 3 × 10 9 ) we have ||M 12 ≈ 10 6 || = o ( n ) (while ||M 10 || = 0 ). L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

  12. Number and Distribution of Minimal Absent Words The total number and the distribution of lengths of MAWs has been studied for several sequences. Example In the human genome ( n ≈ 3 × 10 9 ) we have ||M 12 ≈ 10 6 || = o ( n ) (while ||M 10 || = 0 ). Problem Compute the (reduced) antidictionary in output-sensitive space. L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

  13. Strategy Idea: Divide the input word y into k words each of which, alone, fits in the internal memory, with a suitable overlap of length ℓ so as not to lose information. y = y 1 # y 2 # · · · # y k , # / ∈ Σ L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

  14. Strategy Idea: Divide the input word y into k words each of which, alone, fits in the internal memory, with a suitable overlap of length ℓ so as not to lose information. y = y 1 # y 2 # · · · # y k , # / ∈ Σ Then compute the MAWs of the input word y incrementally, from the MAWs of the concatenation of these k words. L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

  15. Strategy Idea: Divide the input word y into k words each of which, alone, fits in the internal memory, with a suitable overlap of length ℓ so as not to lose information. y = y 1 # y 2 # · · · # y k , # / ∈ Σ Then compute the MAWs of the input word y incrementally, from the MAWs of the concatenation of these k words. Formally, we state the following Problem Given k words y 1 , y 2 , . . . , y k over an alphabet Σ and an integer ℓ > 0 , compute the set M ℓ y 1 # ... # y k of minimal absent words of length at most ℓ of y = y 1 # y 2 # . . . # y k , # / ∈ Σ . L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

  16. Theoretical Results Here is an illustration of the theoretical setting: Let y = y 1 # y 2 . We are allowed to store in internal memory y 1 and y 2 but not y . Our goal is to compute M ℓ y from M ℓ y 1 and M ℓ y 2 . L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

  17. Theoretical Results Here is an illustration of the theoretical setting: Let y = y 1 # y 2 . We are allowed to store in internal memory y 1 and y 2 but not y . Our goal is to compute M ℓ y from M ℓ y 1 and M ℓ y 2 . Let x ∈ M ℓ y . We separate two cases: x belongs to M ℓ y 1 ∪ M ℓ y 2 (Case 1) 1 x does not belong to M ℓ y 1 ∪ M ℓ y 2 (Case 2) 2 L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

  18. Theoretical Results Lemma (Case 1) A word x ∈ M ℓ y 1 (resp. x ∈ M ℓ y 2 ) belongs to M ℓ y if and only if x is a superword of a word in M ℓ y 2 (resp. in M ℓ y 1 ). L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

  19. Theoretical Results Lemma (Case 1) A word x ∈ M ℓ y 1 (resp. x ∈ M ℓ y 2 ) belongs to M ℓ y if and only if x is a superword of a word in M ℓ y 2 (resp. in M ℓ y 1 ). Example Let y 1 = abaab , y 2 = bbaaab and ℓ = 5 . y = abaab#bbaaab . We have M ℓ y 1 = { bb,aaa,bab,aaba } and M ℓ y 2 = { bbb,aaaa,baab,aba,bab,abb } . The word bab is contained in M ℓ y 1 ∩ M ℓ y 2 so it belongs to M ℓ y . The word aaba ∈ M ℓ y 1 is a superword of aba ∈ M ℓ y 2 hence aaba ∈ M ℓ y . On the other hand, the words bbb , aaaa and abb are superwords of words in M ℓ y 1 , hence they belong to M ℓ y . The remaining MAWs are not superwords of MAWs of the other word. M ℓ y ∩ ( M ℓ y 1 ∪ M ℓ y 2 ) = { aaaa,bab,aaba,abb,bbb } . L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

  20. Theoretical Results We define the reduced sets of MAWs, R ℓ y i , as those sets obtained from M ℓ y i after removing those words that are superwords of a word in M ℓ y j , { i, j } = { 1 , 2 } . L. Ayad, G. Badkobeh, G. Fici, A. H´ eliou, S. Pissis Constructing Antidictionaries in Output-Sensitive Space

Recommend


More recommend