parsing with regular expressions and extensions to kleene
play

parsing with regular expressions and extensions to kleene algebra - PowerPoint PPT Presentation

parsing with regular expressions and extensions to kleene algebra Niels Bjrn Bugge Grathwohl DIKU, November 4th 2015 PhD Thesis defense Kleene Meets Church string rewriting 1,John,john@gmail.com,male,123456,DK


  1. parsing with regular expressions and extensions to kleene algebra Niels Bjørn Bugge Grathwohl DIKU, November 4th 2015 PhD Thesis defense Kleene Meets Church

  2. string rewriting 1,John,john@gmail.com,male,123456,DK 2,Benny,benny@hotmail.com,male,98234,UK John 123456 Benny 98234 Want: • Streaming – i.e., output while reading input. • Fast – several Gbps throughput per core. • Linear running time in the size of the input. →

  3. regular expressions Program is essentially a regular expression with outputs. Regular expression syntax 1 Examples a E ::= 0 | 1 | a | E 1 + E 2 | E 1 E 2 | E ⋆ ( a ∈ Σ ) ( Σ = { a , b } ) ( ab ) ⋆ + ( a + b ) ⋆ ( a + b ) ⋆

  4. what is regular expression “matching”? Answer: “Yes!” expression. ab ab ab () Expression ( ab ) ⋆ + ( a + b ) ⋆ Input s = ababab • acceptance testing—is input string member of language? • subgroup matching—substrings in input for subterms in Answer: [ 0 , 6 ] , [ 4 , 2 ] • parsing—what is the parse tree of the input?

  5. acceptance testing Language interpretation Input s matches E iff s ∈ L [ [ E ] ] . L [ [ 0 ] ] = ∅ L [ [ 1 ] ] = { ϵ } L [ [ a ] ] = { a } L [ { s | s ∈ L [ ] } [ E + F ] ] = [ E ] ∪ { t | t ∈ L [ [ F ] ] } L [ [ EF ] ] = { s · t | s ∈ L [ [ E ] ] , t ∈ L [ [ F ] ] } L [ [ E ⋆ ] ] = L [ [ E ] ] ⋆

  6. acceptance testing Example [( ab ) ⋆ + ( a + b ) ⋆ ] L [ ] = L [ [( ab ) ⋆ ] ] ∪ L [ [( a + b ) ⋆ ] ] ] ⋆ ∪ L [ = L [ [ ab ] [ a + b ] ] ⋆ { ab } ⋆ ∪ { a , b } ⋆ = { ϵ, ab , abab , . . . } ∪ { ϵ, a , b , ab , ba , aba , . . . } = = { ϵ, a , b , aa , ab , aaa , aab , . . . }

  7. parsing Construct parse tree from input s such that flattening of parse tree is s . Type interpretation [FC’04;HN’11] T [ ∅ [ 0 ] ] = T [ [ 1 ] ] = { () } T [ [ a ] ] = { a } T [ [ E + F ] ] = { inl v | v ∈ T [ [ E ] ] } ∪ { inr w | w ∈ T [ [ F ] ] } T [ T [ ] × T [ [ EF ] ] = [ E ] [ F ] ] T [ [ E ⋆ ] ] = { [ v 1 , . . . , v n ] | n ≥ 0 , v i ∈ T [ [ E ] ] } Values in T [ [ E ] ] are parse trees .

  8. parsing whereas So Example [( ab ) ⋆ + ( a + b ) ⋆ ] T [ ] contains the parse trees: • inl [( a , b ) , ( a , b ) , ( a , b )] • inr [ inl a , inr b , inl a , inr b , inl a , inr b ] which are not in T [ [( a + b ) ⋆ ] ] ! [( ab ) ⋆ + ( a + b ) ⋆ ] T [ ] ̸ = T [ [( a + b ) ⋆ ] ] , [( ab ) ⋆ + ( a + b ) ⋆ ] L [ ] = L [ [( a + b ) ⋆ ] ]

  9. ambiguity One input string can be parsed in multiple ways: ababab and prioritized. “Greedy parsing.” under E = ( ab ) ⋆ + ( a + b ) ⋆ can be parsed both as inl [( a , b ) , ( a , b ) , ( a , b )] inr [ inl a , inr b , inl a , inr b , inl a , inr b ] Disambiguation policy : the left-most option is always

  10. ambiguity One input string can be parsed in multiple ways: ababab and Disambiguation policy : the left-most option is always prioritized. “Greedy parsing.” under E = ( ab ) ⋆ + ( a + b ) ⋆ can be parsed both as inl [( a , b ) , ( a , b ) , ( a , b )] inr [ inl a , inr b , inl a , inr b , inl a , inr b ]

  11. bit-coding Bit-coded parse trees: only store choices . Example 00001 10001000100011 Parse tree as stream of bits; meaningless without expression! E = ( ab ) ⋆ + ( a + b ) ⋆ , ababab : inl [( a , b ) , ( a , b ) , ( a , b )] inr [ inl a , inr b , inl a , inr b , inl a , inr b ]

  12. finite state transducers 1 q f start q s a start q s q f start q s 0 E • Construction: • Thompsons FSTs with input alphabet Σ , output alphabet { 0 , 1 } . N ( E , q s , q f ) ( q f = q s ) a /ϵ

  13. finite state transducers q f start q s 0 q f E 0 q f 2 q f 2 q s 0 1 1 q f E 1 E 2 q s start q s q s q f q s start N ( E , q s , q f ) N ( E 1 , q s , q ′ ) N ( E 2 , q ′ , q f ) q ′ N ( E 1 , q s 1 ) 1 , q f ϵ/ 0 ϵ/ϵ N ( E 2 , q s 2 ) 2 , q f ϵ/ 1 ϵ/ϵ E 1 + E 2 N ( E 0 , q s 0 ) 0 , q f ϵ/ 0 ϵ/ϵ q ′ ϵ/ϵ ϵ/ 1 E ⋆

  14. parse trees as paths Theorem (Brüggemann-Klein 1993, GHNR 2013) 1-to-1 correspondence between • parse trees for E , • paths in Thompson FST for E , • bit-coded parse trees. Constructing the parse tree corresponds to finding a path through the FST.

  15. optimal streaming Optimally streaming parsing Output the longest common prefix of possible parse trees af- ter reading each input symbol. Example Possible parse tree prefixes after aaaa : Possible parse tree prefixes after aaaaa : E = ( aaa + aa ) ⋆ { 01011 , 000 . . . } { 00011 , 0000 . . . }

  16. greedy parsing Parse (2-p) 2 2 Grathwohl, Henglein, Nielsen, Rasmussen (2013) 1 Frisch, Cardelli (2004) ( n size of input, m size of expression) greedy parse Parse (str.) 3 greedy parse Time 3 Grathwohl, Henglein, Rasmussen (2014) greedy parse Parse (3-p) 1 Answer Aux Space O ( mn ) O ( m ) O ( n ) O ( mn ) O ( m ) O ( n ) O ( mn + 2 m log m ) O ( m ) O ( n )

  17. greedy parsing Parse (2-p) 2 2 Grathwohl, Henglein, Nielsen, Rasmussen (2013) 1 Frisch, Cardelli (2004) ( n size of input, m size of expression) greedy parse Parse (str.) 3 greedy parse Time 3 Grathwohl, Henglein, Rasmussen (2014) greedy parse Parse (3-p) 1 Answer Aux Space O ( mn ) O ( m ) O ( n ) O ( mn ) O ( m ) O ( n ) O ( mn + 2 m log m ) O ( m ) O ( n )

  18. fst simulation Optimally streaming algorithm • Preprocessing step of FST: compute coverage of state sets. • Maintain a path tree during FST simulation, recording the path taken to each state in the FST. • Prune states that are covered by higher-prioritized states. • Output on the stem of the path tree is longest common prefix of any succeeding parse. Theorem (GHR’14) Optimally streaming algorithm computes the optimally stream- ing parsing function in time O ( mn + 2 m log m ) .

  19. 10 0 1 11 2 3 4 5 6 7 8 9 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ϵ ϵ/ 1

  20. 1 8 2 0 3 7 11 10 9 7 0 6 5 4 3 2 11 1 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ϵ ϵ/ 1 ϵ

  21. 0 10 2 3 7 0 11 4 8 a 0 9 0 8 7 6 5 4 3 2 11 1 1 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ϵ ϵ/ 1

  22. 0 5 2 3 7 11 0 4 8 a 2 7 11 a 1 10 9 8 7 6 5 4 3 2 11 1 0 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ϵ ϵ/ 1 9 10 1

  23. 0 2 4 11 7 3 2 1 0 2 0 7 11 5 a 6 10 1 3 a 6 1 11 2 3 4 5 7 7 8 9 10 a 8 11 8 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ϵ ϵ/ 1 9 10 1

  24. 0 a 3 7 11 4 8 a 2 7 11 5 6 10 1 0 2 3 7 11 8 a 4 8 11 a 2 1 0 10 1 11 2 3 4 5 6 7 8 9 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ϵ ϵ/ 1 9 10 1 9 10 1

  25. 00 11 4 8 a 2 7 11 5 a 6 10 1 2 3 7 8 7 a 4 8 11 a 5 2 7 11 a 0 00 0 11 3 2 1 0 1 11 2 3 4 5 6 7 8 9 10 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ϵ ϵ/ 1 9 10 1 9 10 1 9 10 1

  26. kleenex Observation Approach is not limited to Thompson FSTs outputting bit-coded parse trees. Kleenex is a surface language for specifying FSTs and their output: • grammar with greedy disambiguation; • embedded output actions . • Essentially optimally streaming behaviour. • Linear running time in size of input string. • Fast . >1 Gbps common.

  27. kleenex lookahead! • But: each newline ends a number, so output. • Optimal streaming gives this for free! ”100000000000” → ”100,000,000,000” • Problem: need to read entire number; no bounded

  28. determinization Path tree algorithm is “NFA simulation with path trees as state sets.” Compilation of FSTs? Analogy to NFA-DFA determinization with subset construction? Problem: Inifinite number of path trees! Solution: contract unary paths in path trees and store output in registers.

  29. determinization Path tree algorithm is “NFA simulation with path trees as state sets.” Compilation of FSTs? Analogy to NFA-DFA determinization with subset construction? Solution: contract unary paths in path trees and store output in registers. Problem: Inifinite number of path trees!

  30. determinization Path tree algorithm is “NFA simulation with path trees as state sets.” Compilation of FSTs? Analogy to NFA-DFA determinization with subset construction? Solution: contract unary paths in path trees and store output in registers. Problem: Inifinite number of path trees!

  31. determinization x 0 0 0 1 1 0 1 1 4 8 11 x 0 11 x 00 x 01 x 1 x 0 x 0 00 x 1 1011 x 00 0 x 01 0 8 0 5 1 2 3 7 11 4 8 a 2 7 11 a 4 6 10 1 2 3 7 11 8 a 4 8 11 a 1 9 10 1 9 10 1

  32. determinization x 0 0 0 1 1 0 1 1 4 8 11 x 0 11 x 00 x 01 x 1 x 0 x 0 00 x 1 1011 x 00 0 x 01 0 8 0 5 1 2 3 7 11 4 8 a 2 7 11 a 4 6 10 1 2 3 7 11 8 a 4 8 11 a 1 9 10 1 9 10 1

  33. determinization 11 0 0 0 0 1 1 0 1 1 4 8 x 0 8 x 00 x 01 x 1 0 x 0 00 x 1 1011 x 00 0 x 01 11 0 4 5 1 2 3 7 11 4 8 a 2 a 11 7 a 11 11 8 4 a 8 1 3 7 2 6 10 1 9 10 1 9 10 1 x ϵ := x ϵ := := := :=

Recommend


More recommend