Constant delay algorithms for regular document spanners Fernando Florenzano Cristian Riveros Domagoj Vrgoˇ c From PUC Chile Mart´ ın Ugarte Stijn Vansummeren From Universit´ e Libre de Bruxelles
Rule-based information extraction by example 18:30 ERROR 06 “Extract all pairs (time,id) 19:10 OK 00 of ERROR events” 20:00 ERROR 19 y y x x : : : 1 1 8 3 3 4 0 6 E 7 R 8 R 9 O 10 R 11 12 0 13 6 ↱ 15 1 16 9 18 1 19 0 20 21 O 22 K 23 24 0 25 0 ↱ 27 2 28 0 30 0 31 0 32 33 E 34 R 35 R 36 O 37 R 38 39 1 40 9 2 5 14 17 26 29 41 Rule: RGX formula Output: mappings Σ ∗ ⋅ x { δδ ∶ δδ } ⋅ x y ⋅ y { δδ } ⋅ Σ ∗ ERROR [ 1 , 6 ⟩ [ 13 , 15 ⟩ δ = ( 0 + 1 + . . . + 9 ) [ 28 , 33 ⟩ [ 40 , 42 ⟩
Rule-based information extraction by example Evaluation of rules in information extraction. Problem: Input: RGX formula R and document d . Enumerate all mappings of d that satisfy R . Output: : : : 1 1 8 3 3 4 0 6 E 7 R 8 R 9 O 10 R 11 12 0 13 6 ↱ 15 1 16 9 18 1 19 0 20 21 O 22 K 23 24 0 25 0 ↱ 27 2 28 0 30 0 31 0 32 33 E 34 R 35 R 36 O 37 R 38 39 1 40 9 2 5 14 17 26 29 41 Output: mappings Rule: RGX formula Σ ∗ ⋅ x { δδ ∶ δδ } ⋅ x y ⋅ y { δδ } ⋅ Σ ∗ ERROR [ 1 , 6 ⟩ [ 13 , 15 ⟩ δ = ( 0 + 1 + . . . + 9 ) [ 28 , 33 ⟩ [ 40 , 42 ⟩
Unfortunately, the output can easily become exponential : : : 1 1 8 3 3 4 0 6 E 7 R 8 R 9 O 10 R 11 12 0 13 6 ↱ 15 1 16 9 18 1 19 0 20 21 O 22 K 23 24 0 25 0 27 2 ↱ 28 0 30 0 31 0 32 33 E 34 R 35 R 36 O 37 R 38 39 1 40 9 2 5 14 17 26 29 41 Output: mappings Rule: RGX formula Σ ∗ ⋅ x 1 { δδ } ⋅ Σ ∗ ⋅ x 2 { δδ } ⋅ Σ ∗ x 1 x 2 [ 1 , 3 ⟩ [ 4 , 6 ⟩ δ = ( 0 + 1 + . . . + 9 ) [ 1 , 3 ⟩ [ 13 , 15 ⟩ ⋮ ⋮ [ 1 , 3 ⟩ [ 40 , 42 ⟩ Θ (∣ d ∣ 2 ) [ 4 , 6 ⟩ [ 13 , 15 ⟩ [ 4 , 6 ⟩ [ 16 , 18 ⟩ ⋮ ⋮ In general, a RGX formula with k variables can have an output of size Θ (∣ d ∣ k ) .
Constant delay algorithms to the rescue Definition Given a RGX rule R and a document d , a constant delay algorithm is a two-phase enumeration algorithm: 1. Preprocessing phase: linear in ∣ d ∣ and, hopefully, linear in ∣ R ∣ . 2. Enumeration phase: constant time between two consecutive outputs. Can we have an efficient constant delay algorithm for RGX formulas?
In this paper, we propose a constant delay algorithm for variable-set automata Specifically, our contributions are: 1. We study the class of extended and deterministic variable-set automata. 2. We give a simple constant delay algorithm for deterministic functional extended variable-set automata. 3. We extend this algorithm for the full class of variable-set automata and spanner algebra. 4. We study the complexity of counting the number of output mappings. In this talk: only the main ideas of the constant delay algorithm.
Outline Variable-set automata and their variants The constant delay algorithm
Outline Variable-set automata and their variants The constant delay algorithm
Variable-set automata (VA) a a 1 y ⊢ x ⊢ ⊣ y ⊣ x a b 0 3 4 5 6 7 ⊣ y ⊣ x y ⊢ x ⊢ 2 a a 2 b document : 1 3
Variable-set automata (VA) a a 1 y ⊢ x ⊢ ⊣ y ⊣ x a b 0 3 4 5 6 7 ⊣ y ⊣ x y ⊢ x ⊢ 2 a a 2 b document : 1 3 y ⊢ ⊣ y x ⊢ a a ⊣ x b a a 2 b 0 1 3 3 4 5 6 7 1 3 x = [ 1 , 3 ⟩ , y = [ 1 , 4 ⟩
Variable-set automata (VA) a a 1 y ⊢ x ⊢ ⊣ y ⊣ x a b 0 3 4 5 6 7 ⊣ y ⊣ x y ⊢ x ⊢ 2 a a 2 b document : 1 3 y ⊢ ⊣ y x ⊢ a a ⊣ x b a a 2 b 0 1 3 3 4 5 6 7 1 3 x = [ 1 , 3 ⟩ , y = [ 1 , 4 ⟩ y ⊢ ⊣ y x ⊢ a a b ⊣ x a a 2 b 0 2 3 4 4 5 6 7 1 3 x = [ 1 , 4 ⟩ , y = [ 1 , 3 ⟩
Variable-set automata (VA) a a 1 y ⊢ x ⊢ ⊣ y ⊣ x a b 0 3 4 5 6 7 ⊣ y ⊣ x y ⊢ x ⊢ 2 a a 2 b document : 1 3 Theorem (Freydenberger17,MRV18) The evaluation problem of variable-set automata is NP -complete. How do we restrict VA to have constant delay algorithms?
Problematic behaviors of VA and their classes 1. Functional VA 2. Extended VA 3. Deterministic VA
Problematic behaviors of VA and their classes 1. Functional VA 2. Extended VA 3. Deterministic VA a a 1 y ⊢ x ⊢ ⊣ y ⊣ x a b 0 3 4 5 6 7 ⊣ y ⊣ x y ⊢ x ⊢ 2 Problem: A VA can have accepting runs that are NOT valid.
Problematic behaviors of VA and their classes 1. Functional VA 2. Extended VA 3. Deterministic VA a a 1 y ⊢ x ⊢ ⊣ y ⊣ x a b 0 3 4 5 6 7 ⊣ y ⊣ x y ⊢ x ⊢ 2 Problem: A VA can have accepting runs that are NOT valid. Example of an accepting run that is not valid y ⊢ x ⊢ a a ⊣ x b ⊣ x 0 1 3 3 4 5 6 7
Problematic behaviors of VA and their classes 1. Functional VA 2. Extended VA 3. Deterministic VA a a 1 y ⊢ x ⊢ ⊣ y ⊣ x a b 0 3 4 5 6 7 ⊣ y ⊣ x y ⊢ x ⊢ 2 Definition: functional VA A VA is functional if every accepting run is a valid run.
Problematic behaviors of VA and their classes 1. Functional VA 2. Extended VA 3. Deterministic VA b a a 1 5 6 y ⊢ ⊣ y x ⊢ ⊣ x a 0 3 4 7 y ⊢ ⊣ y x ⊢ b ⊣ x 2 5’ 6’ Definition: functional VA A VA is functional if every accepting run is a valid run. Theorem (FKRV15) Every VA is equivalent to a functional VA of at most exponential size.
Problematic behaviors of VA and their classes 1. Functional VA 2. Extended VA 3. Deterministic VA b a a 1 5 6 y ⊢ ⊣ y x ⊢ ⊣ x a 0 3 4 7 y ⊢ ⊣ y x ⊢ b ⊣ x 2 5’ 6’
Problematic behaviors of VA and their classes 1. Functional VA 2. Extended VA 3. Deterministic VA b a a 1 5 6 y ⊢ ⊣ y x ⊢ ⊣ x a 0 3 4 7 y ⊢ ⊣ y b ⊣ x x ⊢ 2 5’ 6’ Problem: VA can use several paths of variables for the same extraction of spans.
Problematic behaviors of VA and their classes 1. Functional VA 2. Extended VA 3. Deterministic VA b a a 1 5 6 y ⊢ ⊣ y x ⊢ ⊣ x a 0 3 4 7 y ⊢ ⊣ y b ⊣ x x ⊢ 2 5’ 6’ Definition: extended VA An extended VA uses transitions extended with sets of variables such that between each pair of letters at most one of these transitions are used.
Problematic behaviors of VA and their classes 1. Functional VA 2. Extended VA 3. Deterministic VA b a a {⊣ x } 5 6 {⊣ y } { x ⊢ , y ⊢} a 0 3 4 7 b {⊣ x } {⊣ y } 5’ 6’ Definition: extended VA An extended VA uses transitions extended with sets of variables such that between each pair of letters at most one of these transitions are used. Theorem Every VA is equivalent to an extended VA of at most exponential size.
Problematic behaviors of VA and their classes 1. Functional VA 2. Extended VA 3. Deterministic VA b a a {⊣ x } 5 6 {⊣ y } { x ⊢ , y ⊢} a 0 3 4 7 b {⊣ x } {⊣ y } 5’ 6’ Problem : A VA can have several runs that witness the same output. Example of several runs with the same input/output { x ⊢ , y ⊢} {⊣ x } {⊣ y } a a b 0 3 3 4 5 6 7 { x ⊢ , y ⊢} {⊣ x } {⊣ y } a a b 0 3 4 4 5 6 7
Problematic behaviors of VA and their classes 1. Functional VA 2. Extended VA 3. Deterministic VA b a a {⊣ x } 5 6 {⊣ y } { x ⊢ , y ⊢} a 0 3 4 7 b {⊣ x } {⊣ y } 5’ 6’ Definition: deterministic (Input/Output) VA An extended VA is deterministic if the transition relation is a function .
Problematic behaviors of VA and their classes 1. Functional VA 2. Extended VA 3. Deterministic VA b a {⊣ x } 5 6 {⊣ y } { x ⊢ , y ⊢} a 0 3 4 7 b {⊣ x } {⊣ y } 5’ 6’ Definition: deterministic (Input/Output) VA An extended VA is deterministic if the transition relation is a function . Theorem Every extended VA is equivalent to a deterministic extended VA of at most exponential size.
Outline Variable-set automata and their variants The constant delay algorithm
The constant delay algorithm for extended VA Given an deterministic and functional extended VA A = ( Q , q 0 , F , δ ) . procedure Evaluate ( A , a 1 . . . a n ) procedure Capturing ( i ) for all q ∈ Q / { q 0 } do for all q ∈ Q do list old list q ← ǫ ← list q . l azycopy q for all q ∈ Q with list old list q 0 ← [ � ] ≠ ǫ do q for i ∶ = 1 to n do for all S ∈ Markers δ ( q ) do node ← N ode (( S , i ) , list old Capturing ( i ) q ) Reading ( i ) p ← δ ( q , S ) list p . a dd ( node ) Capturing ( n + 1 ) Enumerate ({ list q } q ∈ Q , F ) procedure Reading ( i ) for all q ∈ Q do list old ← list q q list q ← ǫ for all q ∈ Q with list old ≠ ǫ do q p ← δ ( q , a i ) list p . a ppend ( list old q )
Sketch idea of the constant delay algorithm in 3 steps Given an deterministic and functional extended VA A = ( Q , q 0 , F , δ ) . 1. Convert the document d into a deterministic extended VA A d . a a 2 b document d : 1 3 . . . . . . . . . { x ⊢} { y ⊢} a a b VA A d : d 1 d 2 d 3 d 4 . . . {⊣ x , y ⊢}
Sketch idea of the constant delay algorithm in 3 steps Given an deterministic and functional extended VA A = ( Q , q 0 , F , δ ) . 1. Convert the document d into a deterministic extended VA A d . 2. Build the product between A and A d , and annotate the variable transitions with the position of d where they take place.
Recommend
More recommend