Applications of WFA WFA Can Model: ➓ Probability distributions f A ♣ x q ✏ P r x s ➓ Binary classifiers g ♣ x q ✏ sign ♣ f A ♣ x q � θ q ➓ Real predictors f A ♣ x q ➓ Sequence predictors g ♣ x q ✏ argmax y f A ♣ x , y q (with Σ ✏ X ✂ Y ) Used In Several Applications: ➓ Speech recognition [Mohri et al., 2008] ➓ Machine translation [de Gispert et al., 2010] ➓ Image processing [Albert and Kari, 2009] ➓ OCR systems [Knight and May, 2009] ➓ System testing [Baier et al., 2009]
Useful Intuitions About f A f A ♣ x q ✏ f A ♣ x 1 . . . x T q ✏ α ❏ 0 A x 1 ☎ ☎ ☎ A x T α ✽ ✏ α ❏ 0 A x α ✽ ➓ Sum-Product: f A ♣ x q is a sum–product computation ✄ T ☛ ➳ ➵ α 0 ♣ i 0 q A x t ♣ i t ✁ 1 , i t q α ✽ ♣ i T q i 0 , i 1 ,..., i T Pr n s t ✏ 1 ➓ Forward-Backward: f A ♣ x q is dot product between forward and backward vectors α ❏ � ✟ f A ♣ ps q ✏ ☎ ♣ A s α ✽ q ✏ α p ☎ β s 0 A p ➓ Compositional Features: f A ♣ x q is a linear model α ❏ � ✟ f A ♣ x q ✏ ☎ α ✽ ✏ φ ♣ x q ☎ α ✽ 0 A x where φ : Σ ✍ Ñ R n compositional features (i.e. φ ♣ xσ q ✏ φ ♣ x q A σ )
Useful Intuitions About f A f A ♣ x q ✏ f A ♣ x 1 . . . x T q ✏ α ❏ 0 A x 1 ☎ ☎ ☎ A x T α ✽ ✏ α ❏ 0 A x α ✽ ➓ Sum-Product: f A ♣ x q is a sum–product computation ✄ T ☛ ➳ ➵ α 0 ♣ i 0 q A x t ♣ i t ✁ 1 , i t q α ✽ ♣ i T q i 0 , i 1 ,..., i T Pr n s t ✏ 1 ➓ Forward-Backward: f A ♣ x q is dot product between forward and backward vectors α ❏ � ✟ f A ♣ ps q ✏ ☎ ♣ A s α ✽ q ✏ α p ☎ β s 0 A p ➓ Compositional Features: f A ♣ x q is a linear model α ❏ � ✟ f A ♣ x q ✏ ☎ α ✽ ✏ φ ♣ x q ☎ α ✽ 0 A x where φ : Σ ✍ Ñ R n compositional features (i.e. φ ♣ xσ q ✏ φ ♣ x q A σ )
Useful Intuitions About f A f A ♣ x q ✏ f A ♣ x 1 . . . x T q ✏ α ❏ 0 A x 1 ☎ ☎ ☎ A x T α ✽ ✏ α ❏ 0 A x α ✽ ➓ Sum-Product: f A ♣ x q is a sum–product computation ✄ T ☛ ➳ ➵ α 0 ♣ i 0 q A x t ♣ i t ✁ 1 , i t q α ✽ ♣ i T q i 0 , i 1 ,..., i T Pr n s t ✏ 1 ➓ Forward-Backward: f A ♣ x q is dot product between forward and backward vectors α ❏ � ✟ f A ♣ ps q ✏ ☎ ♣ A s α ✽ q ✏ α p ☎ β s 0 A p ➓ Compositional Features: f A ♣ x q is a linear model α ❏ � ✟ f A ♣ x q ✏ ☎ α ✽ ✏ φ ♣ x q ☎ α ✽ 0 A x where φ : Σ ✍ Ñ R n compositional features (i.e. φ ♣ xσ q ✏ φ ♣ x q A σ )
Useful Intuitions About f A f A ♣ x q ✏ f A ♣ x 1 . . . x T q ✏ α ❏ 0 A x 1 ☎ ☎ ☎ A x T α ✽ ✏ α ❏ 0 A x α ✽ ➓ Sum-Product: f A ♣ x q is a sum–product computation ✄ T ☛ ➳ ➵ α 0 ♣ i 0 q A x t ♣ i t ✁ 1 , i t q α ✽ ♣ i T q i 0 , i 1 ,..., i T Pr n s t ✏ 1 ➓ Forward-Backward: f A ♣ x q is dot product between forward and backward vectors α ❏ � ✟ f A ♣ ps q ✏ ☎ ♣ A s α ✽ q ✏ α p ☎ β s 0 A p ➓ Compositional Features: f A ♣ x q is a linear model α ❏ � ✟ f A ♣ x q ✏ ☎ α ✽ ✏ φ ♣ x q ☎ α ✽ 0 A x where φ : Σ ✍ Ñ R n compositional features (i.e. φ ♣ xσ q ✏ φ ♣ x q A σ )
➓ r ♣ qs ✏ r � ✏ s r ♣ qs ✏ r ⑤ ✏ s Forward–Backward Equations for A σ Any WFA A defines forward and backward maps α A , β A : Σ ✍ Ñ R n such that for any splitting x ✏ p ☎ s one has α ❏ � ✟ � ✟ f A ♣ x q ✏ 0 A p 1 ☎ ☎ ☎ A p T ☎ A s 1 ☎ ☎ ☎ A s T ✶ α ✽ ✏ α A ♣ p q ☎ β A ♣ s q
Forward–Backward Equations for A σ Any WFA A defines forward and backward maps α A , β A : Σ ✍ Ñ R n such that for any splitting x ✏ p ☎ s one has α ❏ � ✟ � ✟ f A ♣ x q ✏ 0 A p 1 ☎ ☎ ☎ A p T ☎ A s 1 ☎ ☎ ☎ A s T ✶ α ✽ ✏ α A ♣ p q ☎ β A ♣ s q Example ➓ In HMM coordinates of α A and β A have probabilistic interpretation: r α A ♣ p qs i ✏ P r p , h � 1 ✏ i s r β A ♣ s qs i ✏ P r s ⑤ h ✏ i s
Forward–Backward Equations for A σ Any WFA A defines forward and backward maps α A , β A : Σ ✍ Ñ R n such that for any splitting x ✏ p ☎ s one has α ❏ � ✟ � ✟ f A ♣ x q ✏ 0 A p 1 ☎ ☎ ☎ A p T ☎ A s 1 ☎ ☎ ☎ A s T ✶ α ✽ ✏ α A ♣ p q ☎ β A ♣ s q Key Observation Comparing f A ♣ ps q and f A ♣ pσs q reveals information about A σ : f A ♣ ps q ✏ α A ♣ p q ☎ β A ♣ s q f A ♣ pσs q ✏ α A ♣ p q ☎ A σ ☎ β A ♣ s q
Forward–Backward Equations for A σ Any WFA A defines forward and backward maps α A , β A : Σ ✍ Ñ R n such that for any splitting x ✏ p ☎ s one has α ❏ � ✟ � ✟ f A ♣ x q ✏ 0 A p 1 ☎ ☎ ☎ A p T ☎ A s 1 ☎ ☎ ☎ A s T ✶ α ✽ ✏ α A ♣ p q ☎ β A ♣ s q Key Observation Comparing f A ♣ ps q and f A ♣ pσs q reveals information about A σ : f A ♣ ps q ✏ α A ♣ p q ☎ β A ♣ s q f A ♣ pσs q ✏ α A ♣ p q ☎ A σ ☎ β A ♣ s q Hankel matrices help organize and solve these equations!
♣ q ✏ ⑤ ⑤ ☎☎☎ ✔ ☎ ☎ ☎ ✜ ✖ ✣ ✖ ✣ ✏ ✖ ✣ ✖ ✣ ✖ ✣ ✕ ✢ ♣ q ✏ ♣ q ✏ ♣ q ✏ The Hankel Matrix Two Equivalent Representations ➓ Functional: f : Σ ✍ Ñ R ➓ Matricial: H f P R Σ ✍ ✂ Σ ✍ , the Hankel matrix of f Definition: p prefix, s suffix ñ H f ♣ p , s q ✏ f ♣ p ☎ s q
♣ q ✏ ♣ q ✏ ♣ q ✏ The Hankel Matrix Two Equivalent Representations ➓ Functional: f : Σ ✍ Ñ R ➓ Matricial: H f P R Σ ✍ ✂ Σ ✍ , the Hankel matrix of f Definition: p prefix, s suffix ñ H f ♣ p , s q ✏ f ♣ p ☎ s q Example f ♣ x q ✏ ⑤ x ⑤ a λ a b aa ☎☎☎ ✔ ✜ 0 1 0 2 ☎ ☎ ☎ (number of a ’s in x ) λ 1 2 1 3 a ✖ ✣ ✖ ✣ 0 1 0 2 H f ✏ b ✖ ✣ ✖ ✣ 2 3 2 4 aa ✖ ✣ . ✕ . ✢ ... . . . .
The Hankel Matrix Two Equivalent Representations ➓ Functional: f : Σ ✍ Ñ R ➓ Matricial: H f P R Σ ✍ ✂ Σ ✍ , the Hankel matrix of f Definition: p prefix, s suffix ñ H f ♣ p , s q ✏ f ♣ p ☎ s q Example f ♣ x q ✏ ⑤ x ⑤ a λ a b aa ☎☎☎ ✔ ✜ 0 1 0 2 ☎ ☎ ☎ (number of a ’s in x ) λ 1 2 1 3 a ✖ ✣ ✖ ✣ 0 1 0 2 H f ✏ b ✖ ✣ ✖ ✣ 2 3 2 4 aa ✖ ✣ . ✕ . ✢ ... . . . . H f ♣ λ , aa q ✏ H f ♣ a , a q ✏ H f ♣ aa , λ q ✏ 2
♣ q ✏ ♣ q ✏ ♣ q ✏ The Hankel Matrix Two Equivalent Representations ➓ Functional: f : Σ ✍ Ñ R ➓ Matricial: H f P R Σ ✍ ✂ Σ ✍ , the Hankel matrix of f Definition: p prefix, s suffix ñ H f ♣ p , s q ✏ f ♣ p ☎ s q Properties λ a b aa ☎☎☎ ✔ ✜ 0 1 0 2 ☎ ☎ ☎ λ ➓ ⑤ x ⑤ � 1 entries for f ♣ x q 1 2 1 3 a ✖ ✣ ✖ ✣ 0 1 0 2 ➓ Depends on ordering of Σ ✍ H f ✏ b ✖ ✣ ✖ ✣ 2 3 2 4 aa ✖ ✣ ➓ Captures structure . ✕ . ✢ ... . . . .
♣ q ✏ ♣ q ✏ ♣ q ✏ The Hankel Matrix Two Equivalent Representations ➓ Functional: f : Σ ✍ Ñ R ➓ Matricial: H f P R Σ ✍ ✂ Σ ✍ , the Hankel matrix of f Definition: p prefix, s suffix ñ H f ♣ p , s q ✏ f ♣ p ☎ s q Properties λ a b aa ☎☎☎ ✔ ✜ 0 1 0 2 ☎ ☎ ☎ λ ➓ ⑤ x ⑤ � 1 entries for f ♣ x q 1 2 1 3 a ✖ ✣ ✖ ✣ 0 1 0 2 ➓ Depends on ordering of Σ ✍ H f ✏ b ✖ ✣ ✖ ✣ 2 3 2 4 aa ✖ ✣ ➓ Captures structure . ✕ . ✢ ... . . . .
A Fundamental Theorem about WFA Relates the rank of H f and the number of states of WFA computing f
A Fundamental Theorem about WFA Theorem [Carlyle and Paz, 1971, Fliess, 1974] Let f : Σ ✍ Ñ R be any function 1. If f ✏ f A for some WFA A with n states ñ rank ♣ H f q ↕ n 2. If rank ♣ H f q ✏ n ñ exists WFA A with n states s.t. f ✏ f A
A Fundamental Theorem about WFA Theorem [Carlyle and Paz, 1971, Fliess, 1974] Let f : Σ ✍ Ñ R be any function 1. If f ✏ f A for some WFA A with n states ñ rank ♣ H f q ↕ n 2. If rank ♣ H f q ✏ n ñ exists WFA A with n states s.t. f ✏ f A
A Fundamental Theorem about WFA Theorem [Carlyle and Paz, 1971, Fliess, 1974] Let f : Σ ✍ Ñ R be any function 1. If f ✏ f A for some WFA A with n states ñ rank ♣ H f q ↕ n 2. If rank ♣ H f q ✏ n ñ exists WFA A with n states s.t. f ✏ f A Why Fundamental? Because proof of (2) gives an algorithm for recovering A from the Hankel matrix of f A
A Fundamental Theorem about WFA Theorem [Carlyle and Paz, 1971, Fliess, 1974] Let f : Σ ✍ Ñ R be any function 1. If f ✏ f A for some WFA A with n states ñ rank ♣ H f q ↕ n 2. If rank ♣ H f q ✏ n ñ exists WFA A with n states s.t. f ✏ f A Why Fundamental? Because proof of (2) gives an algorithm for recovering A from the Hankel matrix of f A Example: Can recover an HMM from the probabilities it assigns to sequences of observations
♣ q ✏ ♣ ☎q ♣ q ✏ ♣☎ q Structure of Low-rank Hankel Matrices H f A P R Σ ✍ ✂ Σ ✍ P P R Σ ✍ ✂ n S P R n ✂ Σ ✍ s . ✔ ✜ . . ✔ ✜ ☎ ☎ ☎ s . ✖ . ✣ ✔ ✜ . ☎ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✖ ✣ ✖ ✣ . ✖ ✣ ✏ ☎ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✖ . ✣ . ✖ ✣ ✕ ✢ ✖ ✣ ✖ ✣ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ p ✕ ✢ ☎ ☎ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ☎ ☎ p ✖ ✣ ☎ ☎ ☎ ✕ ✢ . . . α ❏ f A ♣ p 1 ☎ ☎ ☎ p T ☎ s 1 ☎ ☎ ☎ s T ✶ q ✏ 0 A p 1 ☎ ☎ ☎ A p T A s 1 ☎ ☎ ☎ A s T ✶ α ✽ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ α A ♣ p q β A ♣ s q
Structure of Low-rank Hankel Matrices H f A P R Σ ✍ ✂ Σ ✍ P P R Σ ✍ ✂ n S P R n ✂ Σ ✍ s . ✔ ✜ . . ✔ ✜ ☎ ☎ ☎ s . ✖ . ✣ ✔ ✜ . ☎ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✖ ✣ ✖ ✣ . ✖ ✣ ✏ ☎ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ✖ . ✣ . ✖ ✣ ✕ ✢ ✖ ✣ ✖ ✣ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ p ✕ ✢ ☎ ☎ ☎ ☎ ☎ ☎ ✌ ☎ ☎ ☎ ☎ ☎ ☎ p ✖ ✣ ☎ ☎ ☎ ✕ ✢ . . . α ❏ f A ♣ p 1 ☎ ☎ ☎ p T ☎ s 1 ☎ ☎ ☎ s T ✶ q ✏ 0 A p 1 ☎ ☎ ☎ A p T A s 1 ☎ ☎ ☎ A s T ✶ α ✽ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ α A ♣ p q β A ♣ s q α A ♣ p q ✏ P ♣ p , ☎q β A ♣ s q ✏ S ♣☎ , s q
♣ q ✏ ♣ q ✏ ✍ ✂ ✍ ✂ ✂ P P P ✔ ✜ ☎ ☎ ☎ ✔ ✜ ✔ ✜ ☎ ☎ ☎ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✖ ✣ ✏ ☎ ☎ ☎ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✕ ✢ ✕ ✢ ✖ ✣ ✌ ✌ ✌ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✕ ✢ ☎ ☎ ☎ ❏ ✏ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ☎ ✽ ✶ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ ♣ q ♣ q � � ✏ ñ ✏ ñ ✏ Hankel Factorizations and Operators H σ P R Σ ✍ ✂ Σ ✍ s ✔ ✜ ☎ ☎ ✖ ✣ ✖ ✣ ☎ ✖ ✣ ✖ ✣ ☎ ☎ ✌ ☎ ☎ p ✕ ✢ ☎ f A ♣ p 1 ☎ ☎ ☎ p T ☎ σ ☎ s 1 ☎ ☎ ☎ s T ✶ q
♣ q ✏ ♣ q ✏ � � ✏ ñ ✏ ñ ✏ Hankel Factorizations and Operators H σ P R Σ ✍ ✂ Σ ✍ P P R Σ ✍ ✂ n S P R n ✂ Σ ✍ A σ P R n ✂ n s ✔ ✜ ✔ ✜ ☎ ☎ ☎ ☎ s ✔ ✜ ✔ ✜ ☎ ☎ ☎ ☎ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✖ ✣ ✖ ✣ ✖ ✣ ☎ ✏ ☎ ☎ ☎ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✖ ✣ ✕ ✢ ✕ ✢ ✖ ✣ ✖ ✣ ☎ ☎ ✌ ☎ ☎ ✌ ✌ ✌ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ p p ✕ ✢ ✕ ✢ ☎ ☎ ☎ ☎ α ❏ f A ♣ p 1 ☎ ☎ ☎ p T ☎ σ ☎ s 1 ☎ ☎ ☎ s T ✶ q ✏ 0 A p 1 ☎ ☎ ☎ A p T ☎ A σ ☎ A s 1 ☎ ☎ ☎ A s T ✶ α ✽ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ α A ♣ p q β A ♣ s q
♣ q ✏ ♣ q ✏ Hankel Factorizations and Operators H σ P R Σ ✍ ✂ Σ ✍ P P R Σ ✍ ✂ n S P R n ✂ Σ ✍ A σ P R n ✂ n s ✔ ✜ ✔ ✜ ☎ ☎ ☎ ☎ s ✔ ✜ ✔ ✜ ☎ ☎ ☎ ☎ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✖ ✣ ✖ ✣ ✖ ✣ ☎ ✏ ☎ ☎ ☎ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✖ ✣ ✕ ✢ ✕ ✢ ✖ ✣ ✖ ✣ ☎ ☎ ✌ ☎ ☎ ✌ ✌ ✌ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ p p ✕ ✢ ✕ ✢ ☎ ☎ ☎ ☎ α ❏ f A ♣ p 1 ☎ ☎ ☎ p T ☎ σ ☎ s 1 ☎ ☎ ☎ s T ✶ q ✏ 0 A p 1 ☎ ☎ ☎ A p T ☎ A σ ☎ A s 1 ☎ ☎ ☎ A s T ✶ α ✽ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ α A ♣ p q β A ♣ s q A σ ✏ P � H σ S � H ✏ P S ñ H σ ✏ P A σ S ñ = =
Hankel Factorizations and Operators H σ P R Σ ✍ ✂ Σ ✍ P P R Σ ✍ ✂ n S P R n ✂ Σ ✍ A σ P R n ✂ n s ✔ ✜ ✔ ✜ ☎ ☎ ☎ ☎ s ✔ ✜ ✔ ✜ ☎ ☎ ☎ ☎ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✖ ✣ ✖ ✣ ✖ ✣ ☎ ✏ ☎ ☎ ☎ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ ✖ ✣ ✖ ✣ ✕ ✢ ✕ ✢ ✖ ✣ ✖ ✣ ☎ ☎ ✌ ☎ ☎ ✌ ✌ ✌ ✌ ✌ ✌ ☎ ☎ ✌ ☎ ☎ p p ✕ ✢ ✕ ✢ ☎ ☎ ☎ ☎ α ❏ f A ♣ p 1 ☎ ☎ ☎ p T ☎ σ ☎ s 1 ☎ ☎ ☎ s T ✶ q ✏ 0 A p 1 ☎ ☎ ☎ A p T ☎ A σ ☎ A s 1 ☎ ☎ ☎ A s T ✶ α ✽ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ ❧♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♥ α A ♣ p q β A ♣ s q A σ ✏ P � H σ S � H ✏ P S ñ H σ ✏ P A σ S ñ = = Note: Works with finite sub-blocks as well (assuming rank ♣ P q ✏ rank ♣ S q ✏ n )
General Learning Algorithm for WFA Hankel Low-rank matrix Factorization and Data WFA estimation matrix linear algebra
General Learning Algorithm for WFA Hankel Low-rank matrix Factorization and Data WFA estimation matrix linear algebra Key Idea: The Hankel Trick 1. Learn a low-rank Hankel matrix that implicitly induces “latent” states 2. Recover the states from a decomposition of the Hankel matrix
✒ ✚ ✒ ✚ ✒ ✚ ✁ ✁ ✏ ✏ ✏ ✁ ✁ Limitations of WFA Invariance Under Change of Basis For any invertible matrix Q the following WFA are equivalent: ➓ A ✏ ① α 0 , α ✽ , t A σ ✉② ➓ B ✏ ① Q ❏ α 0 , Q ✁ 1 α ✽ , t Q ✁ 1 A σ Q ✉② f A ♣ x q ✏ α ❏ 0 A x 1 ☎ ☎ ☎ A x T α ✽ ✏ ♣ α ❏ 0 Q q♣ Q ✁ 1 A x 1 Q q ☎ ☎ ☎ ♣ Q ✁ 1 A x T Q q♣ Q ✁ 1 α ✽ q ✏ f B ♣ x q
Limitations of WFA Invariance Under Change of Basis For any invertible matrix Q the following WFA are equivalent: ➓ A ✏ ① α 0 , α ✽ , t A σ ✉② ➓ B ✏ ① Q ❏ α 0 , Q ✁ 1 α ✽ , t Q ✁ 1 A σ Q ✉② f A ♣ x q ✏ α ❏ 0 A x 1 ☎ ☎ ☎ A x T α ✽ ✏ ♣ α ❏ 0 Q q♣ Q ✁ 1 A x 1 Q q ☎ ☎ ☎ ♣ Q ✁ 1 A x T Q q♣ Q ✁ 1 α ✽ q ✏ f B ♣ x q Example ✒ 0.5 ✚ ✒ ✚ ✒ ✚ 0.1 0 1 0.3 ✁ 0.2 Q ✁ 1 A a Q ✏ A a ✏ Q ✏ 0.2 0.3 ✁ 1 0 ✁ 0.1 0.5
Limitations of WFA Invariance Under Change of Basis For any invertible matrix Q the following WFA are equivalent: ➓ A ✏ ① α 0 , α ✽ , t A σ ✉② ➓ B ✏ ① Q ❏ α 0 , Q ✁ 1 α ✽ , t Q ✁ 1 A σ Q ✉② f A ♣ x q ✏ α ❏ 0 A x 1 ☎ ☎ ☎ A x T α ✽ ✏ ♣ α ❏ 0 Q q♣ Q ✁ 1 A x 1 Q q ☎ ☎ ☎ ♣ Q ✁ 1 A x T Q q♣ Q ✁ 1 α ✽ q ✏ f B ♣ x q Consequences ➓ There is no unique parametrization for WFA ➓ Given A it is undecidable whether ❅ x f A ♣ x q ➙ 0 ➓ Cannot expect to recover a probabilistic parametrization
Outline 1. Weighted Automata and Hankel Matrices 2. Spectral Learning of Probabilistic Automata 3. Spectral Methods for Transducers and Grammars Sequence Tagging Finite-State Transductions Tree Automata 4. Hankel Matrices with Missing Entries 5. Conclusion 6. References
Spectral Learning of Probabilistic Automata Hankel Low-rank matrix Factorization and Data WFA estimation linear algebra matrix Basic Setup: ➓ Data are strings sampled from probability distribution on Σ ✍ ➓ Hankel matrix is estimated by empiricial probabilities ➓ Factorization and low-rank approximation is computed using SVD
✏ t ✉ ✏ t ✉ ✩ ✱ ✬ ✴ ✬ ✴ ✫ ✳ ✏ Ñ ♣ q ✏ ✓ ✬ ✴ ✬ ✴ ✪ ✲ The Empirical Hankel Matrix Suppose S ✏ ♣ x 1 , . . . , x N q is a sample of N i.i.d. strings Empirical distribution Empirical Hankel matrix H S ♣ p , s q ✏ ˆ ˆ N f S ♣ ps q f S ♣ x q ✏ 1 I r x i ✏ x s ˆ ➳ N i ✏ 1
✏ t ✉ ✏ t ✉ The Empirical Hankel Matrix Suppose S ✏ ♣ x 1 , . . . , x N q is a sample of N i.i.d. strings Empirical distribution Empirical Hankel matrix H S ♣ p , s q ✏ ˆ ˆ N f S ♣ ps q f S ♣ x q ✏ 1 I r x i ✏ x s ˆ ➳ N i ✏ 1 Example: ✩ ✱ aa , b , bab , a , ✬ ✴ ✬ ✴ b , a , ab , aa , f S ♣ aa q ✏ 5 ✫ ✳ ˆ S ✏ Ñ 16 ✓ 0.31 − ba , b , aa , a , ✬ ✴ ✬ ✴ aa , bab , b , aa ✪ ✲
The Empirical Hankel Matrix Suppose S ✏ ♣ x 1 , . . . , x N q is a sample of N i.i.d. strings Empirical distribution Empirical Hankel matrix H S ♣ p , s q ✏ ˆ ˆ N f S ♣ ps q f S ♣ x q ✏ 1 I r x i ✏ x s ˆ ➳ N i ✏ 1 Example: a b ✩ ✱ aa , b , bab , a , ✔ .19 .25 ✜ λ ✬ ✴ ✬ ✴ b , a , ab , aa , .31 .06 ✫ ✳ a ˆ ✖ ✣ S ✏ Ñ H S ✏ − ✖ ✣ ba , b , aa , a , .06 .00 b ✕ ✢ ✬ ✴ ✬ ✴ aa , bab , b , aa .00 .13 ✪ ✲ ba (Hankel with rows P ✏ t λ , a , b , ba ✉ and columns S ✏ t a , b ✉ )
Finite Sub-blocks of Hankel Matrices Parameters: ➓ Set of rows (prefixes) P ⑨ Σ ✍ ➓ Set of columns (suffixes) S ⑨ Σ ✍ S ... Σ λ a b aa ab h λ , S 1 0.3 0.7 0.05 0.25 . . . λ H 0.3 0.05 0.25 0.02 0.03 . . . a P 0.7 0.6 0.1 0.03 0.2 . . . b 0.05 0.02 0.03 0.017 0.003 . . . aa H a h P , λ 0.25 0.23 0.02 0.11 0.12 . . . ab . . . . . . ... . . . . . . . . . . . . ➓ H P R P ✂ S for finding P and S ➓ H σ P R P ✂ S for finding A σ ➓ h λ , S P R 1 ✂ S for finding α 0 ➓ h P , λ P R P ✂ 1 for finding α ✽
Low-rank Approximation and Factorization Will use the singular value decomposition (SVD) as the main building block Hence the name spectral!
❏ ✓ ❧♦ ♦♠♦ ♦♥ ❧♦ ♦♠♦ ♦♥ ❧♦ ♦♠♦ ♦♥ ❧♦ ♦♠♦ ♦♥ ✂ ✂ ✂ ✂ ✓ � ✏ ✁ ❏ q � ✟ � ✏ ñ ✏ ♣ � ✏ ❏ ✏ ñ Low-rank Approximation and Factorization Parameters: ➓ Desired number of states n ➓ Block H P R P ✂ S of the empirical Hankel matrix
✓ � ✏ ✁ ❏ q � ✟ � ✏ ñ ✏ ♣ � ✏ ❏ ✏ ñ Low-rank Approximation and Factorization Parameters: ➓ Desired number of states n ➓ Block H P R P ✂ S of the empirical Hankel matrix Low-rank Approximation: compute truncated SVD of rank n V ❏ ✓ H U n Λ n n ❧♦ ♦♠♦ ♦♥ ❧♦ ♦♠♦ ♦♥ ❧♦ ♦♠♦ ♦♥ ❧♦ ♦♠♦ ♦♥ P ✂ S n ✂ n P ✂ n n ✂ S
Low-rank Approximation and Factorization Parameters: ➓ Desired number of states n ➓ Block H P R P ✂ S of the empirical Hankel matrix Low-rank Approximation: compute truncated SVD of rank n V ❏ ✓ H U n Λ n n ❧♦ ♦♠♦ ♦♥ ❧♦ ♦♠♦ ♦♥ ❧♦ ♦♠♦ ♦♥ ❧♦ ♦♠♦ ♦♥ P ✂ S n ✂ n P ✂ n n ✂ S Factorization: H ✓ PS given by SVD, pseudo-inverses are easy P � ✏ Λ ✁ 1 n U ❏ ✏ ♣ HV n q � ✟ � P ✏ U n Λ n ñ n S � ✏ V n S ✏ V ❏ ñ n
� ✏ � ✁ ❏ q � � ✟ ✏ ✏ ♣ ❏ ✏ � ✏ � ✁ ❏ q � � ✟ ✽ ✏ ✏ ✏ ♣ ✽ Computing the WFA Parameters: ➓ Factorization H ✓ ♣ U Λ q ☎ V ❏ ✏ P ☎ S ➓ Hankel blocks H σ , h λ , S , h P , λ
✽ Computing the WFA Parameters: ➓ Factorization H ✓ ♣ U Λ q ☎ V ❏ ✏ P ☎ S ➓ Hankel blocks H σ , h λ , S , h P , λ Equations: A σ ✏ P � H σ S � ✏ Λ ✁ 1 U ❏ H σ V ✏ ♣ HV q � H σ V � ✟ h λ , S S � ✏ α ❏ 0 ✏ h λ , S V P � h P , λ ✏ Λ ✁ 1 U ❏ h P , λ ✏ ♣ HV q � h P , λ � ✟ α ✽ ✏
Computing the WFA Parameters: ➓ Factorization H ✓ ♣ U Λ q ☎ V ❏ ✏ P ☎ S ➓ Hankel blocks H σ , h λ , S , h P , λ Equations: A σ ✏ P � H σ S � ✏ Λ ✁ 1 U ❏ H σ V ✏ ♣ HV q � H σ V � ✟ h λ , S S � ✏ α ❏ 0 ✏ h λ , S V P � h P , λ ✏ Λ ✁ 1 U ❏ h P , λ ✏ ♣ HV q � h P , λ � ✟ α ✽ ✏ Full Algorithm 1. Estimate empirical Hankel and retrieve sub-blocks H , H σ , h λ , S , h P , λ 2. Perform SVD of H 3. Solve for A σ , α 0 , α ✽ with pseudo-inverses
Computational and Statistical Complexity Running Time: ➓ Empirical Hankel matrix: O ♣⑤ PS ⑤ ☎ N q ➓ SVD and linear algebra: O ♣⑤ P ⑤ ☎ ⑤ S ⑤ ☎ n q Statistical Consistency: ➓ By law of large numbers, ˆ H S Ñ E r H s when N Ñ ✽ ➓ If E r H s is Hankel of some WFA A , then ˆ A Ñ A ➓ Works for data coming from PFA and HMM PAC Analysis: (assuming data from A with n states) ❄ ➓ With high probability, ⑥ ˆ H S ✁ H ⑥ ↕ O ♣ 1 ④ N q ➓ When N ➙ O ♣ n ⑤ Σ ⑤ 2 T 4 ④ ε 2 s n ♣ H q 4 q , then ➳ ⑤ f A ♣ x q ✁ f ˆ A ♣ x q⑤ ↕ ε ⑤ x ⑤↕ T Proofs can be found in [Hsu et al., 2009, Bailly, 2011, Balle, 2013]
Practical Considerations Hankel Low-rank matrix Factorization and Data WFA estimation matrix linear algebra Basic Setup: ➓ Data are strings sampled from probability distribution on Σ ✍ ➓ Hankel matrix is estimated by empiricial probabilities ➓ Factorization and low-rank approximation is computed using SVD Advanced Implementations: ➓ Choice of parameters P and S ➓ Scalable estimation and factorization of Hankel matrices ➓ Smoothing and variance normalization ➓ Use of prefix and substring statistics
↕ ✏ ✏ ➙ ➓ ➓ ➓ Choosing the Basis Definition: The pair ♣ P , S q defining the sub-block is called a basis Intuitions: ➓ Basis should be choosen such that E r H s has full rank ➓ P must contain strings reaching each possible state of the WFA ➓ S must contain string producing different outcomes for each pair of states in the WFA
Choosing the Basis Definition: The pair ♣ P , S q defining the sub-block is called a basis Intuitions: ➓ Basis should be choosen such that E r H s has full rank ➓ P must contain strings reaching each possible state of the WFA ➓ S must contain string producing different outcomes for each pair of states in the WFA Popular Approaches: ➓ Set P ✏ S ✏ Σ ↕ k for some k ➙ 1 [Hsu et al., 2009] ➓ Choose P and S to contain the K most frequent prefixes and suffixes in the sample [Balle et al., 2012] ➓ Take all prefixes and suffixes appearing in the sample [Bailly et al., 2009]
Scalable Implementations Problem: When ⑤ Σ ⑤ is large, even the simplest basis become huge Hankel Matrix Representation: ➓ Use hash functions to map P ( S ) to row (column) indices ➓ Use sparse matrix data structures because statistics are usually sparse ➓ Never store the full Hankel matrix in memory Efficient SVD Computation: ➓ SVD for sparse matrices [Berry, 1992] ➓ Approximate randomized SVD [Halko et al., 2011] ➓ On-line SVD with rank 1 updates [Brand, 2006]
➓ ➓ ➓ ➓ Refining the Statistics in the Hankel Matrix Smoothing the Estimates ➓ Empirical probabilities ˆ f S ♣ x q tend to be sparse ➓ Like in n -gram models, smoothing can help when Σ is large ➓ Should take into account that strings in PS have different lengths ➓ Open Problem: How to smooth empirical Hankels properly
Refining the Statistics in the Hankel Matrix Smoothing the Estimates ➓ Empirical probabilities ˆ f S ♣ x q tend to be sparse ➓ Like in n -gram models, smoothing can help when Σ is large ➓ Should take into account that strings in PS have different lengths ➓ Open Problem: How to smooth empirical Hankels properly Row and Column Weighting ➓ More frequent prefixes (suffixes) have better estimated rows (columns) ➓ Can scale rows and columns to reflect that ➓ Will lead to more reliable SVD decompositions ➓ See [Cohen et al., 2013] for details
➳ ✏ r s ✏ ✩ ✱ ✔ ✜ ✬ ✴ ✬ ✴ ✫ ✳ ✖ ✣ ✏ Ñ ✏ ✖ ✣ ✕ ✢ ✬ ✴ ✬ ✴ ✪ ✲ ✩ ✱ ✔ ✜ ✬ ✴ ✬ ✴ ✫ ✳ ✖ ✣ ✏ Ñ ✏ ✖ ✣ ✕ ✢ ✬ ✴ ✬ ✴ ✪ ✲ Substring Statistics Problem: If the sample contains strings with wide range of lengths, small basis will ignore most of the examples
➳ ✏ r s ✏ ✩ ✱ ✔ ✜ ✬ ✴ ✬ ✴ ✫ ✳ ✖ ✣ ✏ Ñ ✏ ✖ ✣ ✕ ✢ ✬ ✴ ✬ ✴ ✪ ✲ Substring Statistics Problem: If the sample contains strings with wide range of lengths, small basis will ignore most of the examples String Statistics (occurence probability): a b ✩ ✱ ✔ ✜ aa , b , bab , a , .19 .06 λ ✬ ✴ ✬ ✴ bbab , abb , babba , abbb , .06 .06 ✫ ✳ a ˆ ✖ ✣ S ✏ Ñ H ✏ − ✖ ✣ ab , a , aabba , baa , .00 .06 b ✕ ✢ ✬ ✴ ✬ ✴ abbab , baba , bb , a .06 .06 ✪ ✲ ba
Substring Statistics Problem: If the sample contains strings with wide range of lengths, small basis will ignore most of the examples String Statistics (occurence probability): a b ✩ ✱ ✔ ✜ aa , b , bab , a , .19 .06 λ ✬ ✴ ✬ ✴ bbab , abb , babba , abbb , .06 .06 ✫ ✳ a ˆ ✖ ✣ S ✏ Ñ H ✏ − ✖ ✣ ab , a , aabba , baa , .00 .06 b ✕ ✢ ✬ ✴ ✬ ✴ abbab , baba , bb , a .06 .06 ✪ ✲ ba Substring Statistics (expected number of occurences as substring): N ✏ 1 ➳ r number of occurences of x in x i s Empirical expectation N i ✏ 1 a b ✩ ✱ ✔ ✜ aa , b , bab , a , 1.31 1.56 λ ✬ ✴ ✬ ✴ bbab , abb , babba , abbb , .19 .62 ✫ ✳ ˆ a ✖ ✣ S ✏ Ñ H ✏ − ✖ ✣ ab , a , aabba , baa , .56 .50 b ✕ ✢ ✬ ✴ ✬ ✴ abbab , baba , bb , a .06 .31 ✪ ✲ ba
Substring Statistics Theorem [Balle et al., 2014] If a probability distribution f is computed by a WFA with n states, then the corresponding substring statistics are also computed by a WFA with n states Learning from Substring Statistics ➓ Can work with smaller Hankel matrices ➓ But estimating the matrix takes longer
Experiment: PoS-tag Sequence Models Spectral, Σ basis 74 Spectral, basis k=25 Spectral, basis k=50 Spectral, basis k=100 72 Spectral, basis k=300 Spectral, basis k=500 Word Error Rate (%) Unigram 70 Bigram 68 66 64 62 60 0 10 20 30 40 50 Number of States ➓ PTB sequences of simplified PoS tags [Petrov et al., 2012] ➓ Configuration: expectations on frequent substrings ➓ Metric: error rate on predicting next symbol in test sequences
Experiment: PoS-tag Sequence Models 70 68 Word Error Rate (%) 66 64 62 Spectral, Σ basis 60 Spectral, basis k=500 EM Unigram 58 Bigram 0 10 20 30 40 50 Number of States ➓ Comparison with a bigram baseline and EM ➓ Metric: error rate on predicting next symbol in test sequences ➓ At training, the Spectral Method is → 100 faster than EM
Outline 1. Weighted Automata and Hankel Matrices 2. Spectral Learning of Probabilistic Automata 3. Spectral Methods for Transducers and Grammars Sequence Tagging Finite-State Transductions Tree Automata 4. Hankel Matrices with Missing Entries 5. Conclusion 6. References
Sequence Tagging and Transduction ➓ Many applications involve pairs of input-output sequences: ➓ Sequence tagging (one output tag per input token) e.g.: part of speech tagging output: NNP NNP VBZ NNP . input: Ms. Haag plays Elianti . ➓ Transductions (sequence lenghts might differ) e.g.: spelling correction output: a p p l e input: a p l e ➓ Finite-state automata are classic methods to model these relations. Spectral methods apply naturally to this setting.
Sequence Tagging ➓ Notation: ➓ Input alphabet X ➓ Output alphabet Y ➓ Joint alphabet Σ ✏ X ✂ Y ➓ Goal: map input sequences to output sequences of the same length ➓ Approach: learn a function f : ♣ X ✂ Y q ✍ Ñ R Then, given an input x P X T return argmax f ♣ x , y q y P Y T (note: this maximization is not tractable in general)
✂ ➓ ✏ ① ✽ t ✉② ➓ q ✍ Ñ ♣ ✂ ❏ ❏ ♣ q ✏ ☎ ☎ ☎ ✽ ✏ ✽ Weighted Finite Tagger ➓ Notation: ➓ X ✂ Y : joint alphabet – finite set ➓ n : number of states – positive integer ➓ α 0 : initial weights – vector in R n (features of empty prefix) ➓ α ✽ : final weights – vector in R n (features of empty suffix) a : transition weights – matrix in R n ✂ n ( ❅ a P X , b P Y ) ➓ A b
➓ q ✍ Ñ ♣ ✂ ❏ ❏ ♣ q ✏ ☎ ☎ ☎ ✽ ✏ ✽ Weighted Finite Tagger ➓ Notation: ➓ X ✂ Y : joint alphabet – finite set ➓ n : number of states – positive integer ➓ α 0 : initial weights – vector in R n (features of empty prefix) ➓ α ✽ : final weights – vector in R n (features of empty suffix) a : transition weights – matrix in R n ✂ n ( ❅ a P X , b P Y ) ➓ A b ➓ Definition: WFTagger with n states over X ✂ Y A ✏ ① α 0 , α ✽ , t A b a ✉②
Weighted Finite Tagger ➓ Notation: ➓ X ✂ Y : joint alphabet – finite set ➓ n : number of states – positive integer ➓ α 0 : initial weights – vector in R n (features of empty prefix) ➓ α ✽ : final weights – vector in R n (features of empty suffix) a : transition weights – matrix in R n ✂ n ( ❅ a P X , b P Y ) ➓ A b ➓ Definition: WFTagger with n states over X ✂ Y A ✏ ① α 0 , α ✽ , t A b a ✉② ➓ Compositional Function: Every WFTagger defines a function f A : ♣ X ✂ Y q ✍ Ñ R f A ♣ x 1 . . . x T , y 1 . . . y T q ✏ α ❏ x T α ✽ ✏ α ❏ 0 A y 1 x 1 ☎ ☎ ☎ A y T 0 A y x α ✽
The Spectral Method for WFTaggers Low-rank matrix Hankel Factorization and Data WFA estimation linear algebra matrix ➓ Assume f ♣ x , y q ✏ P ♣ x , y q ➓ Same mechanics as for WFA, with Σ ✏ X ✂ Y ➓ In a nutshell: 1. Choose set of prefixes and suffixes to define Hankel Ñ in this case they are bistrings 2. Estimate Hankel with prefix-suffix training statistics 3. Factorize Hankel using SVD 4. Compute α and β projections, and compute operators ① α 0 , α ✽ , t A σ ✉② ➓ Other cases: ➓ f A ♣ x , y q ✏ P ♣ y ⑤ x q — see [Balle et al., 2011] ➓ f A ♣ x , y q non-probabilistic — see [Quattoni et al., 2014]
➳ ❏ ✾ ✽ ✏ ✄ ☛ ✄ ☛ ❏ ➳ ➳ ✁ � ✾ ✽ ✁ � ✁ � ❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥ ❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥ ✍ ♣ ✍ ♣ ✁ q q � ✄ ➳ ☛ ✄ ➳ ☛ ✍ ♣ ✍ ♣ ✍ ♣ ✍ ♣ q ✏ ✁ q q ✏ q � P P Prediction with WFTaggers ➓ Assume f A ♣ x , y q ✏ P ♣ x , y q ➓ Given x 1 : T , compute most likely output tag at position t : argmax µ ♣ t , a q a P Y where ➳ µ ♣ t , a q ✜ P ♣ y t ✏ a ⑤ x q ✾ P ♣ x , y q y ✏ y 1 ... a ... y T
Prediction with WFTaggers ➓ Assume f A ♣ x , y q ✏ P ♣ x , y q ➓ Given x 1 : T , compute most likely output tag at position t : argmax µ ♣ t , a q a P Y where ➳ µ ♣ t , a q ✜ P ♣ y t ✏ a ⑤ x q ✾ P ♣ x , y q y ✏ y 1 ... a ... y T ➳ α ❏ 0 A y ✾ x α ✽ y ✏ y 1 ... a ... y T ✄ ☛ ✄ ☛ A y 1 : t ✁ 1 A y i � 1 : T ✾ α ❏ ➳ A a ➳ α ✽ x 1 : t ✁ 1 x t � 1 : T 0 x t y 1 ... y t ✁ 1 y t � 1 ... y T ❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥ ❧♦♦♦♦♦♦♦♦♦♦♦♦♠♦♦♦♦♦♦♦♦♦♦♦♦♥ α ✍ β ✍ A ♣ x 1 : t ✁ 1 q A ♣ x t � 1 : T q ✄ ➳ ☛ ✄ ➳ ☛ α ✍ A ♣ x 1 : t q ✏ α ✍ A b β ✍ A b β ✍ A ♣ x 1 : t ✁ 1 q A ♣ x t : T q ✏ A ♣ x t � 1 : T q x t x t b P Y b P Y
Prediction with WFTaggers (II) ➓ Assume f A ♣ x , y q ✏ P ♣ x , y q ➓ Given x 1 : T , compute most likely output bigram ab at position t : argmax µ ♣ t , a , b q a , b P Y where µ ♣ t , a , b q ✏ P ♣ y t ✏ a , y t � 1 ✏ b ⑤ x q α ✍ A ♣ x 1 : t ✁ 1 q A a x t A b x t � 1 β ✍ ✾ A ♣ x t � 2 : T q ➓ Compute most likely full sequence y – intractable In practice, use Minimum Bayes-Risk decoding: ➳ argmax µ ♣ t , y t , y t � 1 q y P Y T t
Finite State Transducers c e d a a-c ǫ -d (ab,cde) b-e b ➓ A WFTransducer evaluates aligned strings, using the empty symbol ǫ to produce one-to-one alignments: f ♣ c d e b q ✏ α ❏ 0 A c a A d ǫ A e b α ✽ a ǫ ➓ Then, a function g can be defined on unaligned strings by aggregating alignments ➳ g ♣ ab , cde q ✏ f ♣ π q π P Π ♣ ab , cde q
Ñ Ñ Ñ ➓ ➓ ➓ Finite State Transducers: Main Problems ➓ Prediction: given an FST A , how to . . . ➓ Compute g ♣ x , y q for unaligned strings? ➓ Compute marginal quantities µ ♣ edge q ✏ P ♣ edge ⑤ x q ? ➓ Compute most-likely y for given x ?
➓ ➓ ➓ Finite State Transducers: Main Problems ➓ Prediction: given an FST A , how to . . . ➓ Compute g ♣ x , y q for unaligned strings? Ñ using edit-distance recursions ➓ Compute marginal quantities µ ♣ edge q ✏ P ♣ edge ⑤ x q ? Ñ also using edit-distance recursions ➓ Compute most-likely y for given x ? Ñ use MBR-decoding with marginal scores
Finite State Transducers: Main Problems ➓ Prediction: given an FST A , how to . . . ➓ Compute g ♣ x , y q for unaligned strings? Ñ using edit-distance recursions ➓ Compute marginal quantities µ ♣ edge q ✏ P ♣ edge ⑤ x q ? Ñ also using edit-distance recursions ➓ Compute most-likely y for given x ? Ñ use MBR-decoding with marginal scores ➓ Unsupervised Learning: learn an FST from pairs of unaligned strings ➓ Unlike with EM, the spectral method can not recover latent structure such as alignments (recall: alignments are needed to estimate Hankel entries) ➓ See [Bailly et al., 2013b] for a solution based on Hankel matrix completion
Spectral Learning of Tree Automata and Grammars S NP VP noun verb NP Mary det noun plays the guitar Some References: ➓ Tree Series: [Bailly et al., 2010, Bailly et al., 2010] ➓ Latent-annotated PCFG: [Cohen et al., 2012, Cohen et al., 2013] ➓ Dependency parsing: [Luque et al., 2012, Dhillon et al., 2012] ➓ Unsupervised learning of WCFG: [Bailly et al., 2013a, Parikh et al., 2014] ➓ Synchronous grammars: [Saluja et al., 2014]
☎ ☞ ☛ ❏ ✄ ✁ ✁ ✠ ✠ ✏ ✌ ✏ ❜ ♣ q ✝ ✍ ✍ ✆ ☎ ☞ ☛ ❏ ✄ ✏ ✌ ✏ ♣ ♣ q ❜ ♣ qq ✝ ✍ ✆ ✍ Compositional Functions over Trees ☎ ☞ ☎ ☞ ☛ ❏ ☎ ☞ a a ✄ a a b a b a ✌ ✏ ✌ ✏ α A f f β A ✝ ✍ ✝ ✍ ✍ b ✆ ✌ c ✆ ✆ c c c c c b b b b b b
☎ ☞ ☛ ❏ ✄ ✏ ✌ ✏ ♣ ♣ q ❜ ♣ qq ✝ ✍ ✆ ✍ Compositional Functions over Trees ☎ ☞ ☎ ☞ ☛ ❏ ☎ ☞ a a ✄ a a b a b a ✌ ✏ ✌ ✏ α A f f β A ✝ ✍ ✝ ✍ ✍ b ✆ ✌ c ✆ ✆ c c c c c b b b b b b ☎ ☞ ☛ ❏ a ✄ a ✁ ✁ ✠ ✠ c b a ✏ f ✌ ✏ α A A a β A ❜ β A ♣ c q ✝ ✍ b ✍ ✆ b b c c b b
Compositional Functions over Trees ☎ ☞ ☎ ☞ ☛ ❏ ☎ ☞ a a ✄ a a b a b a ✌ ✏ ✌ ✏ α A f f β A ✝ ✍ ✝ ✍ ✍ b ✆ ✌ c ✆ ✆ c c c c c b b b b b b ☎ ☞ ☛ ❏ a ✄ a ✁ ✁ ✠ ✠ c b a ✏ f ✌ ✏ α A A a β A ❜ β A ♣ c q ✝ ✍ b ✍ ✆ b b c c b b ☎ ☞ ☛ ❏ a ✄ a b a ✏ f ✌ ✏ α A A c ♣ β A ♣ b q ❜ β A ♣ b qq ✝ ✍ b a ✆ c c c ✍ b b
❞ ✏ Inside-Outside Composition of Trees a a c c ✍ b b c a c a b b t ✏ t o ❞ t i note: i-o composition generalizes the notion of concatenation in strings, i.e., outside trees are prefixes, inside trees are suffixes
Recommend
More recommend