methods of analysis of textual data matd
play

Methods of Analysis of Textual Data (MATD) Ji Dvorsk October 11, - PowerPoint PPT Presentation

Methods of Analysis of Textual Data (MATD) Ji Dvorsk October 11, 2019 Department of Computer Science VB TU Ostrava 1/31 Lectures Outline 1. Pattern Matching Exact pattern matching Searching for fjnite set of patterns Searching


  1. Methods of Analysis of Textual Data (MATD) Jiří Dvorský October 11, 2019 Department of Computer Science VŠB – TU Ostrava 1/31

  2. Lectures Outline 1. Pattern Matching Exact pattern matching Searching for fjnite set of patterns Searching for (Regular) Infjnite Set of Patterns in Text Approximate pattern matching 2/31

  3. Pattern Matching Jiří Dvorský Department of Computer Science VŠB – TU Ostrava 3/31

  4. Pattern Matching Exact pattern matching

  5. Searching for (Regular) Infjnite Set of Patterns in Text 1. How to describe infjnte set of pattern i.e. string? Regular Expressions 2. What shall we use to perform matching? Finite Automata 4/31

  6. Regular Expressions and Languages Regular expression 𝑆 𝑙 𝑢𝑗𝑛𝑓𝑡 ⏟ ℎ(𝑉) ∪ ℎ(𝑊) 𝑉 + 𝑊 {𝑣𝑤|𝑣 ∈ ℎ(𝑉) ∧ 𝑤 ∈ ℎ(𝑊)} 𝑉 ⋅ 𝑊 Operations {𝑏} 𝑏, 𝑏 ∈ Σ {𝜁} 𝜁 ∅ ∅ Atomic expressions Value of expression ℎ(𝑆) 5/31 𝑊 𝑙 = 𝑊 ⋅ 𝑊 ⋅ … ⋅ 𝑊 𝑊 + = 𝑊 1 + 𝑊 2 + 𝑊 3 + … 𝑊 ∗ = 𝑊 0 + 𝑊 1 + 𝑊 2 + …

  7. Regular Expression Features = 𝜁 + 𝑉 + = 𝑉 ∗ 𝑉 = 𝑉 + ∅ ∅ = ∅ ⋅ 𝑉 𝑉 = 𝜁 ⋅ 𝑉 𝑉 𝑉 + 𝑉 𝑉 + (𝑊 + 𝑋) (𝑉 ⋅ 𝑊) + (𝑉 ⋅ 𝑋) = 𝑉 ⋅ (𝑊 + 𝑋) (𝑉 ⋅ 𝑋) + (𝑊 ⋅ 𝑋) = (𝑉 + 𝑊) ⋅ 𝑋 𝑊 + 𝑉 = 𝑉 + 𝑊 (𝑉 ⋅ 𝑊) ⋅ 𝑋 = 𝑉 ⋅ (𝑊 ⋅ 𝑋) (𝑉 + 𝑊) + 𝑋 = 6/31

  8. Deterministic Finite Automaton Defjnition Deterministic Finite Automaton (DFA) is a quintuple 𝐵 = (𝑅, Σ, 𝑟 0 , 𝜀, 𝐺) , where • 𝑅 is a fjnite set of states • Σ is an alphabet • 𝜀 ∶ 𝑅 × Σ → 𝑅 is a transition function • 𝐺 ⊆ 𝑅 is a set of fjnal states 7/31 • 𝑟 0 ∈ 𝑅 is an initial state

  9. Deterministic Finite Automaton (cont.) Confjguration of Finite Automaton (𝑟, 𝑥) ∈ 𝑅 × Σ ∗ Transition of Finite Automaton is a relation ↦∶ (𝑅 × Σ ∗ ) × (𝑅 × Σ ∗ ) such as (𝑟, 𝑏𝑥) ↦ (𝑟’, 𝑥) ⟺ 𝜀(𝑟, 𝑏) = 𝑟’ Automaton accepts word 𝑥 if 8/31 (𝑟 0 , 𝑥) ↦ ∗ (𝑟, 𝜁), 𝑟 ∈ 𝐺

  10. Nondeterministic Finite Automaton Defjnition Nondeterministic Finite Automaton (NFA) is a quintuple 𝐵 = (𝑅, Σ, 𝑟 0 , 𝜀, 𝐺) , where • 𝑅 is a fjnite set of states • Σ is an alphabet • 𝜀 ∶ 𝑅 × Σ → 𝑄(𝑅) is a transition function • 𝐺 ⊆ 𝑅 is a set of fjnal states • Alternatively NFA can be defjned as 𝐵 = (𝑅, Σ, 𝑇, 𝜀, 𝐺) , where 𝑇 ⊆ 𝑅 is a set of initial states. • For each NFA, there is a DFA such that it recognizes the same formal language. 9/31 • 𝑟 0 ∈ 𝑅 is an initial state

  11. Nondeterministic Finite Automaton – example 𝑟 5 s h e 𝑟 1 start 𝑟 2 𝑟 3 𝑟 4 𝑟 6 e 𝑟 7 Σ h e r s h e r h Set of patterns 𝑄 = { he , her , she } 𝑟 7 𝑟 1 start 𝑟 2 𝑟 3 𝑟 4 start 𝑟 5 𝑟 6 𝑟 8 e start 𝑟 9 𝑟 10 𝑟 11 Σ Σ Σ h 10/31

  12. NFA ⟶ DFA Conversion The DFA can be constructed using the powerset construction . 0 , 𝜀 ′ , 𝐺 ′ ) • 𝑟 ′ • 𝜀 ′ (𝑟 ′ , 𝑦) = ∪𝜀(𝑟, 𝑦) for all 𝑟 ∈ 𝑟 ′ 11/31 NFA 𝐵 = (𝑅, Σ, 𝑇, 𝜀, 𝐺) ⟶ DFA 𝐵 ′ = (𝑅 ′ , Σ ′ , 𝑟 ′ • 𝑅 ′ ⊆ 𝑄(𝑅) • Σ ′ = Σ 0 = 𝑇 • 𝐺 ′ = {𝑟 ′ ∈ 𝑅 ′ |𝑟 ′ ∩ 𝐺 ≠ ∅}

  13. NFA ⟶ DFA Conversion I Σ 1 𝑟 ′ e h s r e h e h Σ Σ 𝑟 11 𝑟 ′ 𝑟 10 𝑟 9 start 𝑟 8 𝑟 7 𝑟 6 𝑟 5 start 𝑟 4 𝑟 3 𝑟 2 start 𝑟 1 start 2 State s s h r h s e s h s r h h s 𝑟 ′ e h s h 7 𝑟 ′ 5 𝑟 ′ 3 𝑟 ′ 6 𝑟 ′ 4 {1, 4, 8} {1, 4, 8, 9} {1, 4, 7, 8} {1, 2, 4, 5, 8} {1, 2, 4, 5, 8, 10} {1, 4, 8} 3 𝑟 ′ {1, 4, 8, 9} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} {1, 2, 4, 5, 8} {1, 3, 4, 6, 8} 2 𝑟 ′ {1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 8, 9} {1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 8} 1 𝑟 ′ {1, 4, 8} other 𝑡 𝑠 ℎ 𝑓 Label {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} {1, 4, 8} {1, 4, 8} 7 𝑟 ′ {1, 3, 4, 6, 8, 11} {1, 4, 8} {1, 4, 8, 9} {1, 4, 8} {1, 2, 4, 5, 8} {1, 4, 8} 6 𝑟 ′ {1, 4, 7, 8} {1, 4, 8, 9} {1, 3, 4, 6, 8} {1, 4, 8} {1, 2, 4, 5, 8} {1, 3, 4, 6, 8, 11} 5 𝑟 ′ {1, 2, 4, 5, 8, 10} {1, 4, 8} {1, 4, 8, 9} {1, 4, 7, 8} {1, 2, 4, 5, 8} {1, 4, 8} 4 𝑟 ′ 12/31 Only reachable states, transitions to state 𝑟 1 are not shown.

  14. NFA ⟶ DFA Conversion II 𝑟 7 𝑟 ′ start 1 𝑟 ′ e h s r e h Σ 𝑟 6 𝑟 ′ 𝑟 5 𝑟 4 𝑟 3 𝑟 2 start 𝑟 1 {1} {1, 5} {1, 4} {1, 2} {1} 2 4 𝑟 ′ h s h r h s e s h s r h s 𝑟 ′ s e h s h 7 𝑟 ′ 5 𝑟 ′ 3 𝑟 ′ 6 7 {1, 3, 7} State {1, 5} 𝑟 ′ {1, 5} {1} {1, 5} {1} {1, 2} {1, 3} 2 𝑟 ′ {1, 2} {1} {1} {1} {1, 2} {1} 1 𝑟 ′ {1} other 𝑡 𝑠 ℎ 𝑓 Label 3 {1, 2, 6} {1} {1, 3, 7} {1, 5} {1} {1, 2} {1} 6 𝑟 ′ {1, 4} {1} {1, 5} {1} {1, 2} 5 {1} 𝑟 ′ {1, 2, 6} {1} {1, 5} {1, 4} {1, 2} {1} 4 𝑟 ′ {1, 3} {1} {1, 5} 13/31

  15. Derivation of Regular Expression = ∅ = {ℎ𝑓𝑚𝑚, 𝑢𝑝𝑞} = For given regular expression 𝑆 , derivation is defjned as {𝜁} d 𝑏) ℎ(𝑆) = {𝑏, 𝑡ℎ𝑓𝑚𝑚, 𝑡𝑢𝑝𝑞, 𝑞𝑚𝑝𝑢} derivations are For 𝑆 = 𝑏 + 𝑡ℎ𝑓𝑚𝑚 + 𝑡𝑢𝑝𝑞 + 𝑞𝑚𝑝𝑢 and its value Example 14/31 ℎ ( d 𝑆 d 𝑦 ) = {𝑧|𝑦𝑧 ∈ ℎ(𝑆)} ℎ ( d 𝑆 ℎ ( d 𝑆 d 𝑡 ) ℎ ( d 𝑆 d 𝑢 )

  16. Derivation of Regular Expression – properties = = d ∅ d (𝑉 ⋅ 𝑊) d 𝑏 = d 𝑉 d 𝑊 ∗ d 𝑏 d 𝑊 d (𝑉 ⋅ 𝑊) d 𝑊 d d 𝑏 𝑜 ( d d 𝑏 𝑜−1 d 𝑏 2 d 𝑏 1 d 𝑏 d 𝑉 d 𝑏 𝜁, ∀𝑏 ∈ Σ d 𝑏 = ∅, ∀𝑏 ∈ Σ d 𝜁 d 𝑏 = ∅, ∀𝑏 ∈ Σ d 𝑏 = d 𝑏 d 𝑐 = d 𝑉 = d 𝑏 ∅, ∀𝑐 ≠ 𝑏 d (𝑉 + 𝑊) 15/31 d 𝑏 d 𝑏 + d 𝑊 d 𝑏 ⋅ 𝑊, 𝜁 ∉ 𝑉 d 𝑏 ⋅ 𝑊 + d 𝑊 d 𝑏 , 𝜁 ∈ 𝑉 d 𝑏 ⋅ 𝑊 ∗ (⋯ d ( d 𝑊 d 𝑦 = ))) , for 𝑦 = 𝑏 1 𝑏 2 … 𝑏 𝑜

  17. Construction of DFA Derivations of RE 𝑊 {𝑟 ∈ 𝑅|𝜁 ∈ ℎ(𝑟)} = 𝐺 d 𝑦 d 𝑟 = 𝜀(𝑟, 𝑦) = • Derivation of regular expressions allows directly and 𝑟 0 defjning this set of words So, every state can be associated with regular expression, DFA from this state to any of fjnal states. • Each state of DFA defjnes a set of words, that move the • Let 𝑊 is given regular expression in alphabet Σ . algorithmically build DFA for any regular expression. 16/31

  18. Construction of DFA Derivations of RE – example d 0 = = ( d 0 = d 0 = d (0 + 1) d 0 d ((0 + 1) ∗ ) = d 0 Example of derivations: 17/31 Lest’s have 𝑊 = (0 + 1) ∗ ⋅ 01 over alphabet Σ{0, 1} . Then 𝑟 0 = (0 + 1) ∗ ⋅ 01 d ((0 + 1) ∗ ⋅ 01) ⋅ 01 + d 01 ⋅ (0 + 1) ∗ ⋅ 01 + 1 d 0 + d 1 d 0) ⋅ (0 + 1) ∗ ⋅ 01 + 1 (𝜁 + ∅) ⋅ (0 + 1) ∗ ⋅ 01 + 1 (0 + 1) ∗ ⋅ 01 + 1

  19. Construction of DFA Derivations of RE – example (cont.) = = = ( d 0 = d (0 + 1) d 1 d 1 d 1 d ((0 + 1) ∗ ) = d 1 18/31 d ((0 + 1) ∗ ⋅ 01) ⋅ 01 + d 01 ⋅ (0 + 1) ∗ ⋅ 01 + ∅ d 1 + d 1 d 1) ⋅ (0 + 1) ∗ ⋅ 01 (∅ + 𝜁) ⋅ (0 + 1) ∗ ⋅ 01 (0 + 1) ∗ ⋅ 01

  20. Construction of DFA Derivations of RE – example (cont.) Regular Expression 1 0 1 0 0 1 𝑟 2 𝑟 1 start 𝑟 0 𝑟 2 𝑟 1 𝑟 0 1 0 State 19/31 (0 + 1) ∗ ⋅ 01 (0 + 1) ∗ ⋅ 01 + 1 (0 + 1) ∗ ⋅ 01 (0 + 1) ∗ ⋅ 01 + 1 (0 + 1) ∗ ⋅ 01 + 1 (0 + 1) ∗ ⋅ 01 + 𝜁 (0 + 1) ∗ ⋅ 01 + 𝜁 (0 + 1) ∗ ⋅ 01 + 1 (0 + 1) ∗ ⋅ 01

  21. Pattern Matching Approximate pattern matching

  22. Approximate pattern matching • String metric (string distance function) is a metric that measures distance between two text strings for approximate string matching. • String metric can be considered as “inverse similarity” – how two strings are dissimilar. • There are two classic metrics 1. Hamming distance 2. Levenshtein distance • Yes, string dissimilarity, distance can be measured. Both distances are metrics from mathematical point of view – non-negativity, identity, symmetry, and triangle inequality. 20/31

  23. Hamming distance k 0 0 1 1 1 0 0 n i r h t a n Defjnition i l o r a k Hamming distance of “karolin” and “kathrin” is 3. Example substitutions required to change one string into the other. In other words, it measures the minimum number of difgerent. number of positions at which the corresponding symbols are Hamming distance between two strings of equal length is the 21/31

  24. Levenshtein distance Defjnition Levenshtein distance (1965) between two strings is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other. 22/31

  25. Levenshtein distance (cont.) Example Levenshtein distance between “kitten” and “sitting” is 3: 1. kitten → sitten (substitution of “s” for “k”) 2. sitten → sittin (substitution of “i” for “e”) 3. sittin → sitting (insertion of “g” at the end). There is no way to do it with fewer than three edits. 23/31

Recommend


More recommend