pattern discovery in colored strings
play

Pattern Discovery in Colored Strings Zsuzsanna Liptk 1 , Simon J. - PowerPoint PPT Presentation

Pattern Discovery in Colored Strings Zsuzsanna Liptk 1 , Simon J. Puglisi 2 , and Massimiliano Rossi 3 SEA 2020 1 University of Verona, Department of Computer Science. 16 Jun 2020, Catania (online) 2 University of Helsinki, Department of


  1. Pattern Discovery in Colored Strings Zsuzsanna Lipták 1 , Simon J. Puglisi 2 , and Massimiliano Rossi 3 SEA 2020 1 University of Verona, Department of Computer Science. 16 Jun 2020, Catania (online) 2 University of Helsinki, Department of Computer Science. 3 University of Florida, Department of Computer & Information Science & Engineering.

  2. Motivations – Assertion mining Embedded Systems are everywhere The design of embedded systems requires to evaluate the correctness of its functionalities. Usually done using assertions (Logic formulae). Typically written by hand by the designers. It might take months to find a small and effective set of assertions. Automatic extraction of assertions from simulation traces. LIPTÁK, PUGLISI, ROSSI - PATTERN DISCOVERY IN COLORED 1 STRINGS

  3. Motivations – Assertion mining Simulation trace Simulation trace T 𝒋 𝟐 𝒋 𝟑 𝒋 𝟒 𝒑 𝟐 𝒑 𝟑 T 𝒋 𝟐 𝒋 𝟑 𝒋 𝟒 𝒑 𝟐 𝒑 𝟑 1 0 1 0 0 0 1 0 1 0 0 0 2 1 1 0 1 0 2 1 1 0 1 0 3 0 1 0 0 0 3 0 1 0 0 0 4 1 1 0 1 1 4 1 1 0 1 1 5 0 1 0 0 0 5 0 1 0 0 0 6 1 1 0 1 0 6 1 1 0 1 0 7 1 0 1 1 1 7 1 0 1 1 1 8 0 1 0 1 0 8 0 1 0 1 0 9 1 1 0 0 0 9 1 1 0 0 0 10 0 1 0 0 0 10 0 1 0 0 0 11 1 0 1 1 1 11 1 0 1 1 1 LIPTÁK, PUGLISI, ROSSI - PATTERN DISCOVERY IN COLORED 2 STRINGS

  4. Motivations – Assertion mining Simulation trace Input alphabet Output alphabet T 𝒋 𝟐 𝒋 𝟑 𝒋 𝟒 𝒑 𝟐 𝒑 𝟑 𝒋 𝟐 𝒋 𝟑 𝒋 𝟒 Σ o $ o % Γ 1 0 1 0 0 0 0 1 0 A 0 0 X 2 1 1 0 1 0 1 0 1 B 1 0 Y 3 0 1 0 0 0 1 1 0 C 1 1 Z 4 1 1 0 1 1 5 0 1 0 0 0 6 1 1 0 1 0 Colored string 7 1 0 1 1 1 X Y X Z X Y Z Y X X Z 8 0 1 0 1 0 A C A C A C B A C A B 9 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 11 10 0 1 0 0 0 11 1 0 1 1 1 LIPTÁK, PUGLISI, ROSSI - PATTERN DISCOVERY IN COLORED 3 STRINGS

  5. Colored Strings Definition Colored strings are strings where each character is assigned one of a finite set of colors. Objective We want to find patterns in the string that always occur with the same color at a certain distance. 3 3 X Y X Z X Y Z Y X X Z Colors A C A C A C B A C A B String 1 2 3 4 5 6 7 8 9 10 11 We say that ACA is (Y,3)-unique. LIPTÁK, PUGLISI, ROSSI - PATTERN DISCOVERY IN COLORED 4 STRINGS

  6. Pattern Discovery 𝑒 𝑒 X Y X Z X Y Z Y X X Z Colors A C A C A C B A C A B String 1 2 3 4 5 6 7 8 9 10 11 We say that ACA is (Y,3)-unique. Problem Given a colored string 𝑇 and a color Y, report all pairs (T,d) such that T is (Y,d)-unique substring of 𝑇 . Note Although this problem is simpler than the assertion mining problem, the solution to our problem contains all the information, possibly filtered, to recover the desired set of minimal assertions in a second stage. LIPTÁK, PUGLISI, ROSSI - PATTERN DISCOVERY IN COLORED 5 STRINGS

  7. Discovery all 𝑧, 𝑒 -unique substrings 𝑒 𝑒 𝑔 ∶ X Y X Z X Y Z Y X X Z 𝑇: A C A C A C B A C A B 1 2 3 4 5 6 7 8 9 10 11 We need to check all occurrences of a substring of 𝑇 . To keep the space contained, we use dedicated string data structures, i.e. Suffix trees. Since the delay is measured from the end of the substring, it is convenient to think in terms of prefixes, i.e. Suffixes of the reverse string. 𝑒 𝑒 𝑔 !"# : Z X X Y Z Y X Z X Y X 𝑇 !"# : B A C A B C A C A C A $ 1 2 3 4 5 6 7 8 9 101112 LIPTÁK, PUGLISI, ROSSI - PATTERN DISCOVERY IN COLORED 6 STRINGS

  8. � � � � � � � � � � � � � � � � � � � � � � � � Suffix tree 𝑇: B A C A B C A C A C A $ 1 2 3 4 5 6 7 8 9 101112 𝑄: A C � � � � � implicit 𝑞𝑏𝑠𝑓𝑜𝑢(𝑣) suffix link 12 � � � � � � � � � � Locus of AC � � suffix link � � � � � � � � 𝑣 � � � � 11 4 1 5 10 3 � � � � � � 𝑑ℎ𝑗𝑚𝑒(𝑣, $) � � leaf number 9 2 7 8 6 LIPTÁK, PUGLISI, ROSSI - PATTERN DISCOVERY IN COLORED 7 STRINGS

  9. � � � � � � � � � � � � � � � � � � � � � � � � Discovery all 𝑧, 𝑒 -unique substrings 𝑔 ∶ X Y X Z X Y Z Y X X Z 𝑔 !"# : Z X X Y Z Y X Z X Y X 𝑇: A C A C A C B A C A B 𝑇 !"# : B A C A B C A C A C A $ 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 101112 1. Build the suffix tree of 𝒯 !"# � � � � 2. Color a leaf if: � Either 𝑚𝑜 ≤ 𝑒 • 12 � � � � or 𝑔 𝑚𝑜 − 𝑒 = 𝑧 � • � � � � � � 3. Color an internal node if: � � � � � � � � 𝑣 All children are colored. • � � � � � 4. If a node 𝑣 is colored, output 11 4 1 5 10 3 � all strings represented along � the incoming edge of 𝑣 . � � � � � � 9 2 7 8 6 𝑒 = 3 Runs in 𝑃(𝑜 $ ) time. Output: …, CA, ACA, … 𝑧 = Y LIPTÁK, PUGLISI, ROSSI - PATTERN DISCOVERY IN COLORED 8 STRINGS

  10. Minimal 𝑧, 𝑒 -unique substrings 𝑒 𝑒 X Y X Z X Y Z Y X X Z A C A C A C B A C A B 1 2 3 4 5 6 7 8 9 10 11 We say that CA is minimal (Y,3)-unique, because A is not (Y,3)-unique and C is not (Y,4)-unique. left-minimality right-minimality Problem Given a colored string 𝑇 and a color Y, report all pairs (T,d) such that T is minimal (Y,d)-unique substring of 𝑇 . LIPTÁK, PUGLISI, ROSSI - PATTERN DISCOVERY IN COLORED 9 STRINGS

  11. � � � � � � � � � � � � � � � � � � � � � � � � Discovery all minimal 𝑧, 𝑒 -unique substrings 𝑔 ∶ X Y X Z X Y Z Y X X Z 𝑔 !"# : Z X X Y Z Y X Z X Y X 𝑇: A C A C A C B A C A B 𝑇 !"# : B A C A B C A C A C A $ 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 101112 We say that CA is minimal (Y,3)-unique, because 1. A is not (Y,3)-unique and � � � � 2. C is not (Y,4)-unique. � 12 � � � � � � � � � � � � � � � (left minimality) � � � � 𝑣 � � � Parent of AC is not colored. � � 11 4 1 5 10 3 Suffix link of AC is not colored � � (right minimality) for 𝑒 = 4 . � � � � � � Process 𝑒 from 11 downto 0 9 2 7 8 6 𝑒 = 3 Runs in 𝑃(𝑜 % ) time. Output: …, CA, … 𝑧 = Y LIPTÁK, PUGLISI, ROSSI - PATTERN DISCOVERY IN COLORED 10 STRINGS

  12. Skipping Algorithm LIPTÁK, PUGLISI, ROSSI - PATTERN DISCOVERY IN COLORED 11 STRINGS

  13. � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � Skipping algorithm Given a node 𝑣 and an integer ℓ , ℎ 𝑣, ℓ is the largest delay 𝑒 < ℓ such that the corresponding string can be (𝑧, 𝑒) -unique. � � ℓ = 4 � � � ℓ = 7 � � � � � 0 𝑧 = Y 𝑧 = Y � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 𝑣 � � � 𝑣 � � � � � 3 � � 3 � � � 11 4 1 1 0 11 4 1 � � � � 𝑣 is (Y, 3) -unique. 𝑣 can be (Y, 3) -unique. � � � � � � � � � � � � 9 2 7 9 2 7 3 3 3 5 6 3 𝑔 !"# : Z X X Y Z Y X Z X Y X 𝑔 !"# : Z X X Y Z Y X Z X Y X 𝑇 !"# : B A C A B C A C A C A $ 𝑇 !"# : B A C A B C A C A C A $ 1 2 3 4 5 6 7 8 9 101112 1 2 3 4 5 6 7 8 9 101112 LIPTÁK, PUGLISI, ROSSI - PATTERN DISCOVERY IN COLORED 12 STRINGS

  14. Skipping algorithm We discover the strings for 𝑒 from 𝑜 downto 0. (right minimality) • For each node 𝑣 , we keep the value ℎ 𝑣, 𝑒 + 1 updated. • We find a node 𝑣 such that: • 1. 𝑣 has the largest value ℎ 𝑣, 𝑒 + 1 ; 2. 𝑣 has priority on its children. (left minimality) We check if 𝑣 is right minimal, and if so, we report it. • We update: • 1. the value of all nodes 𝑤 in the subtree rooted on 𝑣 to ℎ(𝑤, 𝑒) 2. the value of all ancestors 𝑤 of 𝑣 to ℎ(𝑤, 𝑒) Maximum-oriented indexed priority queue Runs in 𝑃(𝑜 % log 𝑜) time. LIPTÁK, PUGLISI, ROSSI - PATTERN DISCOVERY IN COLORED 13 STRINGS

  15. Output restrictions LIPTÁK, PUGLISI, ROSSI - PATTERN DISCOVERY IN COLORED 14 STRINGS

  16. Output restrictions We restrict the output to (𝑧, 𝑒) -unique substrings with at least two occurrences followed by 𝑧 . 𝑔 ∶ X Y X Z X Y Z Y X X Z 𝑔 ∶ X Y X Z X Y Z Y X X Z 𝑇: A C A C A C B A C A B 𝑇: A C A C A C B A C A B 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 𝑒 = 3 𝑒 = 7 Considered Not considered Including this consideration as part of the problem, we can modify the computation of ℎ(𝑣, 𝑒) , when all children of 𝑣 are leaves. LIPTÁK, PUGLISI, ROSSI - PATTERN DISCOVERY IN COLORED 15 STRINGS

  17. Experimental results LIPTÁK, PUGLISI, ROSSI - PATTERN DISCOVERY IN COLORED 16 STRINGS

Recommend


More recommend