learning automata with hankel matrices
play

Learning Automata with Hankel Matrices Borja Balle Amazon Research - PowerPoint PPT Presentation

Learning Automata with Hankel Matrices Borja Balle Amazon Research Cambridge Highlights London, September 2017 Weighted Finite Automata (WFA) (over R ) Graphical Representation Algebraic Representation A x , , t A a u a P y a ,


  1. Learning Automata with Hankel Matrices Borja Balle Amazon Research Cambridge Highlights — London, September 2017

  2. Weighted Finite Automata (WFA) (over R ) Graphical Representation Algebraic Representation A “ x α , β , t A a u a P Σ y a , 1.2 a , 3.2 b , 2 b , 5 a , ´ 2 „ ´ 1 „ 1.2 b , 0   ´ 1 α “ A a “ 0.5 ´ 2 3.2 q 1 q 2 ´ 1 0.5 1.2 0 „ 1.2 „ 2   ´ 2 β “ A b “ a , ´ 1 0 0 5 b , ´ 2 Behavioral Representation Each WFA A computes a function A : Σ ‹ Ñ R given by A p x 1 ¨ ¨ ¨ x T q “ α J A x 1 ¨ ¨ ¨ A x T β

  3. In This Talk... § Describe a core algorithm common to many algorithms for learning weighted automata § Explain the role this core plays in three learning problems in different setups § Survey extensions to more complex models and some applications

  4. Outline 1. From Hankel Matrices to Weighted Automata 2. From Data to Hankel Matrices 3. From Theory to Practice

  5. Outline 1. From Hankel Matrices to Weighted Automata 2. From Data to Hankel Matrices 3. From Theory to Practice

  6. Hankel Matrices and Fliess’ Theorem Given f : Σ ‹ Ñ R define its Hankel matrix H f P R Σ ‹ ˆ Σ ‹ as ǫ a b ¨¨¨ s ¨¨¨ . » fi . f p ǫ q f p a q f p b q . ǫ . — ffi . f p a q f p aa q f p ab q . — ffi a — ffi . — . ffi f p b q f p ba q f p bb q . b — ffi H f “ — ffi . . — ffi . — ffi — ffi ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ ¨ f p ps q p — ffi – fl . . . Theorem [Fli74] 1. The rank of H f is finite if and only if f is computed by a WFA 2. The rank rank p f q “ rank p H f q equals the number of states of a minimal WFA computing f

  7. The Structure of Hankel Matrices A p p 1 ¨ ¨ ¨ p T s 1 ¨ ¨ ¨ s T 1 q “ α J A p 1 ¨ ¨ ¨ A p T A s 1 ¨ ¨ ¨ A s T 1 β s ¨ ¨ ¨ ¨ » fi » fi » fi ¨ ¨ ¨ ¨ ¨ ¨ ‚ ¨ ¨ — ffi — ffi H “ ¨ fl “ ¨ ¨ ¨ ¨ ¨ ‚ ¨ ¨ — ffi — ffi – fl — ffi — ffi ¨ ¨ f p ps q ¨ ¨ ‚ ‚ ‚ ¨ ¨ ‚ ¨ ¨ p – – fl ¨ ¨ ¨ ¨ A p p 1 ¨ ¨ ¨ p T as 1 ¨ ¨ ¨ s T 1 q “ α J A p 1 ¨ ¨ ¨ A p T A a A s 1 ¨ ¨ ¨ A s T 1 β s ¨ ¨ ¨ ¨ » fi » fi » fi » fi ¨ ¨ ¨ ¨ ‚ ‚ ‚ ¨ ¨ ‚ ¨ ¨ — ffi — ffi H a “ ¨ fl “ ¨ ¨ ¨ ‚ ‚ ‚ ¨ ¨ ‚ ¨ ¨ — ffi — ffi – fl – fl — ffi — ffi ¨ ¨ f p pas q ¨ ¨ ‚ ‚ ‚ ‚ ‚ ‚ ¨ ¨ ‚ ¨ ¨ p – – fl ¨ ¨ ¨ ¨ Algebraically: Factorizing H lets us solve for A a A a “ P ` H a S ` H “ P S ñ H σ “ P A a S ñ = =

  8. SVD-based Reconstruction [HKZ09; Bal+14] Inputs § Desired number of states r § Basis B “ p P , S q with P , S Ă Σ ‹ , ǫ P P X S § Finite Hankel blocks indexed by prefixes and suffixes in B : § H B P R P ˆ S § H B Σ “ t H B a P R P ˆ S : a P Σ u Algorithm: Spectral p H B , H B Σ , r q 1. Compute the rank r SVD of H B « UDV J 2. Let A a “ D ´ 1 U J H a V 3. Let α “ V J H B p ǫ , ´q and β “ D ´ 1 U J H B p´ , ǫ q 4. Return A “ x α , β , t A a uy Running time: 1. SVD takes O p| P || S | r q 2. Matrix multiplications take O p| Σ || P || S | r q

  9. Properties of Spectral [HKZ09; Bal13; BM15a] Consistency § If P is prefix-closed, S is suffix-closed, and r “ rank p H B q “ rank pr H B | H B Σ sq § Then @ p P P , @ s P S , @ a P Σ , the WFA A “ Spectral p H B , H B Σ , r q satisfies A p p ¨ s q “ H B p p , s q and ˜ A p p ¨ a ¨ s q “ H B a p p , s q Recovery § If H B and H B Σ are sub-blocks of H f with r “ rank p f q “ rank p H B q § Then the WFA A “ Spectral p H B , H B Σ , r q satisfies A ” f Robustness Σ sq and } H B ´ ˆ a ´ ˆ § If r “ rank p H B q “ rank pr H B | H B H B } ď ε and } H B H B a } ď ε for all a P Σ α , ˆ β , t ˆ A a uy “ Spectral p ˆ H B , ˆ § Then x α , β , t A a uy “ Spectral p H B , H B H B Σ , r q and x ˆ Σ , r q α } , } β ´ ˆ β } , } A a ´ ˆ satisfy } α ´ ˆ A a } ď ε

  10. Outline 1. From Hankel Matrices to Weighted Automata 2. From Data to Hankel Matrices 3. From Theory to Practice

  11. Learning Models 1. Exact query learning: membership + equivalence queries [BV96; BBM06; BM15a] 2. Distributional PAC learning: samples from a stochastic WFA [HKZ09; BDR09; Bal+14] 3. Statistical learning: optimize output predictions wrt a loss function [BM12; BM15b]

  12. Exact Learning of WFA with Queries Setup: § Unknown f : Σ ‹ Ñ R with rank p f q “ n § Membership oracle: MQ f p x q returns f p x q for any x P Σ ‹ § Equivalence oracle: EQ f p A q returns true if f ” A and p false , z q if f p z q ‰ A p z q Algorithm: 1. Initialize P “ S “ t ǫ u and maintain B “ p P , S q 2. Let A “ Spectral p H B , H B Σ , rank p H B qq 3. While EQ p A q “ p false , z q 3.1 Let z “ p ¨ a ¨ s with p the longest prefix of z in P 3.2 Let S “ S Y suffixes p s q 3.3 While D p P P and D a P Σ such that H B a p p , ´q R rowspan p H B q , add p ¨ a to P 3.4 Let A “ Spectral p H B , H B Σ , rank p H B qq Analysis: § At most n ` 1 calls to EQ f and O p| Σ | n 2 L q calls to MQ f , where L “ max | z | § Can be improved to O pp| Σ | ` log L q n 2 q calls to MQ f ; can reduce calls to EQ f by increasing calls to MQ f

  13. PAC Learning Stochastic WFA Setup: § Unknown f : Σ ‹ Ñ R with rank p f q “ n defining probability distribution on Σ ‹ § Data: x p 1 q , . . . , x p m q i.i.d. strings sampled from f § Parameters: n and B “ p P , S q such that rank p H B q “ n and ǫ P P X S Algorithm: H B and ˆ 1. Estimate Hankel matrices ˆ H B a for all a P Σ using empirical probabilities m f p x q “ 1 1 r x p i q “ x s ˆ ÿ m i “ 1 2. Return ˆ A “ Spectral p ˆ H B , ˆ H B Σ , n q Analysis: § Running time is O p| P ¨ S | m ` | Σ || P || S | n q L 2 | Σ |? n ´ ¯ | x |ď L | f p x q ´ ˆ § With high probability ř A p x q| “ O f q 2 ? m σ n p H B

  14. Statistical Learning of WFA Setup: § Unknown distribution D over Σ ‹ ˆ R § Data: p x p 1 q , y p 1 q q , . . . , p x p m q , y p m q q i.i.d. string-label pairs sampled from D § Parameters: n , convex loss function ℓ : R ˆ R Ñ R ` , convex regularizer R , regularization parameter λ ą 0, and B “ p P , S q with ǫ P P X S Algorithm: 1. Build B 1 “ p P 1 , S q with P 1 “ P Y P ¨ Σ H B 1 solving min H 1 2. Find the Hankel matrix ˆ ř m i “ 1 ℓ p H p x p i q q , y p i q q ` λ R p H q m H B and ˆ 3. Return ˆ A “ Spectral p ˆ H B , ˆ Σ , n q , where ˆ Σ are submatrices of ˆ H B 1 H B H B Analysis: § Running time is polynomial in n , m , | Σ | , | P | , and | S | § With high probability ˆ 1 m ˙ A p x q , y qs ď 1 E p x , y q„ D r ℓ p ˆ ÿ ℓ p ˆ A p x p i q q , y p i q q ` O ? m m i “ 1

  15. Outline 1. From Hankel Matrices to Weighted Automata 2. From Data to Hankel Matrices 3. From Theory to Practice

  16. Extensions 1. More complex models § Transducers and taggers [BQC11; Qua+14] § Grammars and tree automata [Luq+12; Bal+14; RBC16] § Reactive models [BBP15; LBP16; BM17a] 2. More realistic setups § Multiple related tasks [RBP17] § Timing data [BBP15; LBP16] § Single trajectory [BM17a] § Probabilistic models [BHP14] 3. Deeper theory § Convex relaxations [BQC12] § Generalization bounds [BM15b; BM17b] § Approximate minimisation [BPP15] § Bisimulation metrics [BGP17]

  17. And It Works Too! 0.7 HMM Spectral methods are competitive k−HMM 0.6 FST 90 against traditional methods: 0.5 Hamming Accuracy (test) 88 L1 distance 86 0.4 § Expectation maximization 84 No Regularization 0.3 Avg. Perceptron 82 CRF Spectral IO HMM 0.2 L2 Max Margin 80 Spectral Max Margin § Conditional random fields 78 0.1 500 1K 2K 5K 10K 15K Training Samples 0 § Tensor decompositions 32 128 512 2048 8192 32768 # training samples (in thousands) 0.6 Spectral, Σ basis 100 74 8 Spectral, basis k=25 1000 Spectral, basis k=50 Initialization Spectral, basis k=100 7 0.5 72 10000 Model Building Spectral, basis k=300 In a variety of problems: Spectral, basis k=500 True ODM Word Error Rate (%) 6 Runtime [log(sec)] 70 Unigram Bigram 0.4 Rel. error 5 68 § Sequence tagging 4 0.3 66 3 64 2 0.2 62 § Constituency and dependency 1 0.1 0 60 0 50 100 150 200 Spec-Str Spec- S ub CO Tensor EM 0 10 20 30 40 50 Hankel rank Number of States parsing length ∙ 5 length ∙ 15 all sentences 98 85 § Timing and geometry learning 75 96 70 80 94 65 § POS-level language modelling 75 92 60 mu mu mu qn qn qn 55 90 SVTA SVTA SVTA 70 SVTA* SVTA* SVTA* 50 88 10 4 10 5 10 4 10 5 10 4 10 5 10 6

  18. Open Problems and Current Trends § Optimal selection of P and S from data § Scalable convex optimization over sets of Hankel matrices § Constraining the output WFA (eg. probabilistic automata) § Relations between learning and approximate minimisation § How much of this can be extended to WFA over semi-rings? § Spectral methods for initializing non-convex gradient-based learning algorithms

Recommend


More recommend