formalising boost posix regular expression matching
play

Formalising Boost POSIX Regular Expression Matching 15th - PowerPoint PPT Presentation

Martin Berglund, Willem Bester & Brink van der Merwe Formalising Boost POSIX Regular Expression Matching 15th International Colloquium on Theoretical Aspects of Computing 18 October 2018, Stellenbosch, South Africa What weve been doing


  1. Martin Berglund, Willem Bester & Brink van der Merwe Formalising Boost POSIX Regular Expression Matching 15th International Colloquium on Theoretical Aspects of Computing 18 October 2018, Stellenbosch, South Africa

  2. What we’ve been doing We’ve been thinking about ◮ regular expression matching semantics ◮ Perl-Compatible Regular Expression (PCRE) engines ◮ POSIX-compliant engines ◮ ambiguity — “more than one way to match” ◮ capture groups Why Boost? ◮ “very powerful” C ++ library ◮ mature (1999– ) ◮ online peer-reviewed QA process ◮ regular expression engine that has a POSIX mode Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 2 / 16

  3. Leftmost-greedy vs leftmost-longest matching Match “ aba ” Match “ aba ” with E 1 = (ab|ba|a)* with E 2 = (a|ab|ba)* ambiguous ambiguous [ ab ][ a ] [ a ][ ba ] [ ab ][ a ] [ a ][ ba ] [ ab ][ a ] [ a ][ ba ] Leftmost-greedy Leftmost-greedy [ ab ][ a ] [ ab ][ a ] Leftmost-longest Leftmost-longest ◮ E 2 defines the same language as E 1 , but subexpression order differs ◮ Compare E 1 = (ab|ba|a)* to E 2 = (a|ab|ba)* ◮ Leftmost-longest: matcher seemingly considers all possible matches for subexpressions [ more on this later ] Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 3 / 16

  4. The POSIX regular expression specification POSIX specifies leftmost-longest matching: “The search for a matching sequence starts at the beginning of a string and stops when the first sequence matching the expression is found, where ‘first’ is defined to mean ‘begins earliest in the string’. If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched.... Consistent with the whole match being the longest of the leftmost matches, each subpattern from left to right shall match the longest possible string.” Fowler’s complaint: “Subpattern” only used here; elsewhere it’s “subexpression” (always in the context of grouping). Note: We only consider full matching in this work. Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 4 / 16

  5. An eccentric reading of the POSIX standard? Match “ aba ” POSIX with (ab|ba|a)* ◮ Full matching with submatch [ ab ][ a ] addressing Regex-TDFA: [ a ][ ba ] ◮ Position and extent of substrings Boost: matched by subexpressions must be available Match “ aba ” with (a|ab|ba)* Boost POSIX Mode ◮ Maximises what is reported for marked [ ab ][ a ] Regex-TDFA: [ a ][ ba ] subexpressions (those surrounded by Boost: parentheses) ◮ Essentially, reading POSIX with: Regex-TDFA written in Haskell. s/subpattern/marked subexpression/ Boost written in C ++ . Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 5 / 16

  6. More examples Match “ aa ” Match “ aa ” with ( 0 ( 1 a*) 1 ( 2 a*) 2 ) 0 with ( 0 a*( 1 a*) 1 ) 0 Captures Boost RTDFA Captures Boost RTDFA [ 0 [ 1 aa ] 1 [ 2 ] 2 ] 0 [ 0 aa [ 1 ] 1 ] 0 � � � [ 0 [ 1 a ] 1 [ 2 a ] 2 ] 0 [ 0 a [ 1 a ] 1 ] 0 [ 0 [ 1 ] 1 [ 2 aa ] 2 ] 0 [ 0 [ 1 aa ] 1 ] 0 � Note: All non-atomic subexpressions are parenthesised. ◮ Regex-TDFA maximises lengths of all subexpressions in the order they occur in the regular expression ◮ Boost maximises lengths of (capture) groups in the order they occur in the regular expression Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 6 / 16

  7. Capturing regular expressions and forests Capturing Regular Expressions Over a finite alphabet Σ and an index set I : ∅ ATOM empty language ǫ empty string ATOM a symbols a ∈ Σ ATOM ( r 0 · r 1 ) concatenation of capturing regular expression r 0 , r 1 ( r 0 + r 1 ) alternation of capturing regular expressions r 0 , r 1 ( r ∗ ) closure of capturing regular expression r ( i r ) i capture group i ∈ Σ of capturing regular expression r Set of Forests Note Over a finite alphabet Σ If I is non-empty: the strings and an index set I : over Σ properly contained in • ( Σ ∪ { ǫ } ) is a forest the set of forests. • So is f 1 f 2 for forests f 1 and f 2 If I is empty: they are equal. • And [ i f ] i for forest f and i ∈ I Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 7 / 16

  8. Forest and String Languages Forest Language ◮ For string w over Σ , and Σ ′ ⊆ Σ : � ( r ) for a capturing regular expression r π Σ ′ ( w ) is the maximal subsequence of w that contains only symbols � ( ∅ ) = ∅ from Σ ′ . � ( ǫ ) = { ǫ } ◮ The string language described by � ( a ) = { a } the capturing regular expression r over Σ is the set π Σ ( � ( r )) . � ( r 0 · r 1 ) = � ( r 0 ) · � ( r 1 ) � ( r 0 + r 1 ) = � ( r 0 ) ∪ � ( r 1 ) � ( r ∗ ) = � ( r ) ∗ � (( i r ) i ) = { [ i } · � ( r ) · { ] i } Also: By extension, we also handle r + = rr ∗ , r m , n = r ··· r r ? = ( r + ǫ ) , ( r + ǫ ) ··· ( r + ǫ ) and ���� � �� � m times n times Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 8 / 16

  9. From forest to captures Strategy to compute capture information 1. collect the matching forests 2. determine the capture history C ( f ) and final capture history C fin ( f ) for each forest f 3. order forests by Boost partial order ≺ B on C fin values 4. return the greatest C fin value as determined by ≺ B Capture history • informally, a function C ( f , i ) for forest f and group i • returns a pair ( s , ℓ ) for each substring captured by group i • s ← substring start index, ℓ ← substring length Final capture history • C last ( f , i ) is the pair ( s , ℓ ) in C ( f , i ) with the greatest s � � • C fin ( f ) is the set ( j , C last ( f , j ) | j ∈ I Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 9 / 16

  10. Boost partial order and captures Boost partial order • denote as ≺ B • assume π Σ ( f 1 ) = π Σ ( f 2 ) Then C fin ( f 1 ) ≺ B C fin ( f 2 ) if for the smallest j ∈ I such that ( j , s 1 , ℓ 1 ) � = ( j , s 2 , ℓ 2 ) , where ( j , s i , ℓ i ) ∈ C fin ( f i ) , we have 1. s 1 > s 2 , or 2. s 1 = s 2 but ℓ 1 < ℓ 2 Boost captures • capturing regular expression r • w ∈ π Σ ( � ( r )) • the Boost captures of matching w with r : the largest element in { C fin ( f ) | f ∈ � ( r ) , π Σ ( f ) = w } determined by ≺ B Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 10 / 16

  11. Examples Match w = “ ab ” with a?( 1 ab) 1 ?b? Forests: f 1 = [ 0 ab ] 0 and f 2 = [ 0 [ 1 ab ] 1 ] 0 C ( f 1 ,0 ) = { ( 0,2 ) } , C ( f 1 ,1 ) = � , C ( f 2 ,0 ) = { ( 0,2 ) } , C ( f 2 ,1 ) = { ( 0,2 ) } C fin ( f 1 ) = { ( 0,0,2 ) , ( 1, ⊤ , ⊥ ) } , C fin ( f 2 ) = { ( 0,0,2 ) , ( 1,0,2 ) } At j = 1, we find s 1 = ⊤ and s 2 = 0, so that s 1 > s 2 . Therefore, C fin ( f 1 ) ≺ B C fin ( f 2 ) . Match w with ( 1 a?) 1 ( 2 ab) 2 ?( 3 b?) 3 Forests: f 3 = [ 0 [ 1 a ] 1 [ 3 b ] 3 ] 0 and f 4 = [ 0 [ 1 ] 1 [ 2 ab ] 2 [ 3 ] 3 ] 0 C fin ( f 3 ) = { ( 0,0,2 ) , ( 1,0,1 ) , ( 2, ⊤ , ⊥ ) , ( 3,1,1 ) } C fin ( f 4 ) = { ( 0,0,2 ) , ( 1,0,0 ) , ( 2,0,2 ) , ( 3,2,0 ) } At j = 1, we find s 3 = s 4 = 0, ℓ 3 = 1, and ℓ 4 = 0, so that ℓ 4 < ℓ 3 . Therefore, C fin ( f 4 ) ≺ B C fin ( f 3 ) . Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 11 / 16

  12. POSIX matching algorithm in Boost Inside Boost: ◮ complete Perl-Compatible Regular Expression (PCRE) engine ◮ implemented by depth-first backtracking POSIX matching algorithm: 1. • apply the PCRE-style matching engine to the input • record the resulting parse tree t • if engine rejects, then reject string 2. • apply PCRE-style matching engine to the input • each time it would accept on parse tree t ′ • if C fin ( t ) ≺ B C fin ( t ′ ) , then t ← t ′ • reject, causing engine to backtrack 3. output t as POSIX-style result Theorem Boost captures can be computed in time O ( k | w || r | log | w | ) when matching input string w with regular expression r , and k is the number of distinct capturing indices. Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 12 / 16

  13. Experimental results Two testing frameworks in Python ◮ small one for existing matchers ◮ larger, extensible one for exploring different disambiguation policies Sanity check: Almost 3 000 000 generated test cases — ◮ over the atoms a , b , . and the operators | , * , + , ? ◮ input strings over Σ = { a , b , c } . Fowler’s test cases ◮ 93 examples to test POSIX compliance ◮ 47 ERE; 37 without partial matching + 19 of our own ◮ use a Boost runner as oracle ◮ our formalism passed all but 2 Berglund, Bester & Van der Merwe: Formalising Boost POSIX Regular Expression Matching , ICTAC 2018 13 / 16

Recommend


More recommend