parsing with regular expressions and extensions to kleene algebra Niels Bjørn Bugge Grathwohl DIKU, November 4th 2015 PhD Thesis defense Kleene Meets Church
string rewriting 1,John,john@gmail.com,male,123456,DK 2,Benny,benny@hotmail.com,male,98234,UK John 123456 Benny 98234 Want: • Streaming – i.e., output while reading input. • Fast – several Gbps throughput per core. • Linear running time in the size of the input. →
regular expressions Program is essentially a regular expression with outputs. Regular expression syntax 1 Examples a E ::= 0 | 1 | a | E 1 + E 2 | E 1 E 2 | E ⋆ ( a ∈ Σ ) ( Σ = { a , b } ) ( ab ) ⋆ + ( a + b ) ⋆ ( a + b ) ⋆
what is regular expression “matching”? Answer: “Yes!” expression. ab ab ab () Expression ( ab ) ⋆ + ( a + b ) ⋆ Input s = ababab • acceptance testing—is input string member of language? • subgroup matching—substrings in input for subterms in Answer: [ 0 , 6 ] , [ 4 , 2 ] • parsing—what is the parse tree of the input?
acceptance testing Language interpretation Input s matches E iff s ∈ L [ [ E ] ] . L [ [ 0 ] ] = ∅ L [ [ 1 ] ] = { ϵ } L [ [ a ] ] = { a } L [ { s | s ∈ L [ ] } [ E + F ] ] = [ E ] ∪ { t | t ∈ L [ [ F ] ] } L [ [ EF ] ] = { s · t | s ∈ L [ [ E ] ] , t ∈ L [ [ F ] ] } L [ [ E ⋆ ] ] = L [ [ E ] ] ⋆
acceptance testing Example [( ab ) ⋆ + ( a + b ) ⋆ ] L [ ] = L [ [( ab ) ⋆ ] ] ∪ L [ [( a + b ) ⋆ ] ] ] ⋆ ∪ L [ = L [ [ ab ] [ a + b ] ] ⋆ { ab } ⋆ ∪ { a , b } ⋆ = { ϵ, ab , abab , . . . } ∪ { ϵ, a , b , ab , ba , aba , . . . } = = { ϵ, a , b , aa , ab , aaa , aab , . . . }
parsing Construct parse tree from input s such that flattening of parse tree is s . Type interpretation [FC’04;HN’11] T [ ∅ [ 0 ] ] = T [ [ 1 ] ] = { () } T [ [ a ] ] = { a } T [ [ E + F ] ] = { inl v | v ∈ T [ [ E ] ] } ∪ { inr w | w ∈ T [ [ F ] ] } T [ T [ ] × T [ [ EF ] ] = [ E ] [ F ] ] T [ [ E ⋆ ] ] = { [ v 1 , . . . , v n ] | n ≥ 0 , v i ∈ T [ [ E ] ] } Values in T [ [ E ] ] are parse trees .
parsing whereas So Example [( ab ) ⋆ + ( a + b ) ⋆ ] T [ ] contains the parse trees: • inl [( a , b ) , ( a , b ) , ( a , b )] • inr [ inl a , inr b , inl a , inr b , inl a , inr b ] which are not in T [ [( a + b ) ⋆ ] ] ! [( ab ) ⋆ + ( a + b ) ⋆ ] T [ ] ̸ = T [ [( a + b ) ⋆ ] ] , [( ab ) ⋆ + ( a + b ) ⋆ ] L [ ] = L [ [( a + b ) ⋆ ] ]
ambiguity One input string can be parsed in multiple ways: ababab and prioritized. “Greedy parsing.” under E = ( ab ) ⋆ + ( a + b ) ⋆ can be parsed both as inl [( a , b ) , ( a , b ) , ( a , b )] inr [ inl a , inr b , inl a , inr b , inl a , inr b ] Disambiguation policy : the left-most option is always
ambiguity One input string can be parsed in multiple ways: ababab and Disambiguation policy : the left-most option is always prioritized. “Greedy parsing.” under E = ( ab ) ⋆ + ( a + b ) ⋆ can be parsed both as inl [( a , b ) , ( a , b ) , ( a , b )] inr [ inl a , inr b , inl a , inr b , inl a , inr b ]
bit-coding Bit-coded parse trees: only store choices . Example 00001 10001000100011 Parse tree as stream of bits; meaningless without expression! E = ( ab ) ⋆ + ( a + b ) ⋆ , ababab : inl [( a , b ) , ( a , b ) , ( a , b )] inr [ inl a , inr b , inl a , inr b , inl a , inr b ]
finite state transducers 1 q f start q s a start q s q f start q s 0 E • Construction: • Thompsons FSTs with input alphabet Σ , output alphabet { 0 , 1 } . N ( E , q s , q f ) ( q f = q s ) a /ϵ
finite state transducers q f start q s 0 q f E 0 q f 2 q f 2 q s 0 1 1 q f E 1 E 2 q s start q s q s q f q s start N ( E , q s , q f ) N ( E 1 , q s , q ′ ) N ( E 2 , q ′ , q f ) q ′ N ( E 1 , q s 1 ) 1 , q f ϵ/ 0 ϵ/ϵ N ( E 2 , q s 2 ) 2 , q f ϵ/ 1 ϵ/ϵ E 1 + E 2 N ( E 0 , q s 0 ) 0 , q f ϵ/ 0 ϵ/ϵ q ′ ϵ/ϵ ϵ/ 1 E ⋆
parse trees as paths Theorem (Brüggemann-Klein 1993, GHNR 2013) 1-to-1 correspondence between • parse trees for E , • paths in Thompson FST for E , • bit-coded parse trees. Constructing the parse tree corresponds to finding a path through the FST.
optimal streaming Optimally streaming parsing Output the longest common prefix of possible parse trees af- ter reading each input symbol. Example Possible parse tree prefixes after aaaa : Possible parse tree prefixes after aaaaa : E = ( aaa + aa ) ⋆ { 01011 , 000 . . . } { 00011 , 0000 . . . }
greedy parsing Parse (2-p) 2 2 Grathwohl, Henglein, Nielsen, Rasmussen (2013) 1 Frisch, Cardelli (2004) ( n size of input, m size of expression) greedy parse Parse (str.) 3 greedy parse Time 3 Grathwohl, Henglein, Rasmussen (2014) greedy parse Parse (3-p) 1 Answer Aux Space O ( mn ) O ( m ) O ( n ) O ( mn ) O ( m ) O ( n ) O ( mn + 2 m log m ) O ( m ) O ( n )
greedy parsing Parse (2-p) 2 2 Grathwohl, Henglein, Nielsen, Rasmussen (2013) 1 Frisch, Cardelli (2004) ( n size of input, m size of expression) greedy parse Parse (str.) 3 greedy parse Time 3 Grathwohl, Henglein, Rasmussen (2014) greedy parse Parse (3-p) 1 Answer Aux Space O ( mn ) O ( m ) O ( n ) O ( mn ) O ( m ) O ( n ) O ( mn + 2 m log m ) O ( m ) O ( n )
fst simulation Optimally streaming algorithm • Preprocessing step of FST: compute coverage of state sets. • Maintain a path tree during FST simulation, recording the path taken to each state in the FST. • Prune states that are covered by higher-prioritized states. • Output on the stem of the path tree is longest common prefix of any succeeding parse. Theorem (GHR’14) Optimally streaming algorithm computes the optimally stream- ing parsing function in time O ( mn + 2 m log m ) .
10 0 1 11 2 3 4 5 6 7 8 9 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ϵ ϵ/ 1
1 8 2 0 3 7 11 10 9 7 0 6 5 4 3 2 11 1 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ϵ ϵ/ 1 ϵ
0 10 2 3 7 0 11 4 8 a 0 9 0 8 7 6 5 4 3 2 11 1 1 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ϵ ϵ/ 1
0 5 2 3 7 11 0 4 8 a 2 7 11 a 1 10 9 8 7 6 5 4 3 2 11 1 0 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ϵ ϵ/ 1 9 10 1
0 2 4 11 7 3 2 1 0 2 0 7 11 5 a 6 10 1 3 a 6 1 11 2 3 4 5 7 7 8 9 10 a 8 11 8 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ϵ ϵ/ 1 9 10 1
0 a 3 7 11 4 8 a 2 7 11 5 6 10 1 0 2 3 7 11 8 a 4 8 11 a 2 1 0 10 1 11 2 3 4 5 6 7 8 9 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ϵ ϵ/ 1 9 10 1 9 10 1
00 11 4 8 a 2 7 11 5 a 6 10 1 2 3 7 8 7 a 4 8 11 a 5 2 7 11 a 0 00 0 11 3 2 1 0 1 11 2 3 4 5 6 7 8 9 10 path tree example: ( aaa + aa ) ⋆ a /ϵ a /ϵ a /ϵ ϵ/ 0 ϵ/ϵ ϵ/ 1 ϵ/ϵ a /ϵ a /ϵ ϵ/ϵ ϵ/ 0 ϵ/ϵ ϵ/ 1 9 10 1 9 10 1 9 10 1
kleenex Observation Approach is not limited to Thompson FSTs outputting bit-coded parse trees. Kleenex is a surface language for specifying FSTs and their output: • grammar with greedy disambiguation; • embedded output actions . • Essentially optimally streaming behaviour. • Linear running time in size of input string. • Fast . >1 Gbps common.
kleenex lookahead! • But: each newline ends a number, so output. • Optimal streaming gives this for free! ”100000000000” → ”100,000,000,000” • Problem: need to read entire number; no bounded
determinization Path tree algorithm is “NFA simulation with path trees as state sets.” Compilation of FSTs? Analogy to NFA-DFA determinization with subset construction? Problem: Inifinite number of path trees! Solution: contract unary paths in path trees and store output in registers.
determinization Path tree algorithm is “NFA simulation with path trees as state sets.” Compilation of FSTs? Analogy to NFA-DFA determinization with subset construction? Solution: contract unary paths in path trees and store output in registers. Problem: Inifinite number of path trees!
determinization Path tree algorithm is “NFA simulation with path trees as state sets.” Compilation of FSTs? Analogy to NFA-DFA determinization with subset construction? Solution: contract unary paths in path trees and store output in registers. Problem: Inifinite number of path trees!
determinization x 0 0 0 1 1 0 1 1 4 8 11 x 0 11 x 00 x 01 x 1 x 0 x 0 00 x 1 1011 x 00 0 x 01 0 8 0 5 1 2 3 7 11 4 8 a 2 7 11 a 4 6 10 1 2 3 7 11 8 a 4 8 11 a 1 9 10 1 9 10 1
determinization x 0 0 0 1 1 0 1 1 4 8 11 x 0 11 x 00 x 01 x 1 x 0 x 0 00 x 1 1011 x 00 0 x 01 0 8 0 5 1 2 3 7 11 4 8 a 2 7 11 a 4 6 10 1 2 3 7 11 8 a 4 8 11 a 1 9 10 1 9 10 1
determinization 11 0 0 0 0 1 1 0 1 1 4 8 x 0 8 x 00 x 01 x 1 0 x 0 00 x 1 1011 x 00 0 x 01 11 0 4 5 1 2 3 7 11 4 8 a 2 a 11 7 a 11 11 8 4 a 8 1 3 7 2 6 10 1 9 10 1 9 10 1 x ϵ := x ϵ := := := :=
Recommend
More recommend