A Spectral Learning Algorithm for Finite State Transducers Borja Balle , Ariadna Quattoni, Xavier Carreras ECML PKDD — September 7, 2011 B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 1 / 15
Overview Probabilistic Transducers ◮ Model input-output relations with hidden states ◮ As conditional distribution Pr [ y | x ] over strings ◮ With certain independence assumptions Input X 1 X 2 X 3 X 4 ... H 1 H 2 H 3 H 4 · · · Hidden Output Y 1 Y 2 Y 3 Y 4 ◮ Used in many applications: NLP , biology, . . . ◮ Hard to learn in general — usually EM algorithm is used B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 2 / 15
Overview Spectral Learning Probabilistic Transducers Our contribution: ◮ Fast learning algorithm for probabilistic FST ◮ With PAC-style theoretical guarantees ◮ Based on Observable Operator Model for FST ◮ Using spectral methods (Chang ’96, Mossel-Roch ’05, Hsu et al. ’09, Siddiqi et al. ’10) ◮ Performing better than EM in experiments with real data B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 3 / 15
Outline Observable Operators for FST Learning Observable Operator Models Experimental Evaluation Conclusion B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 4 / 15
Observable Operators for FST Deriving Observable Operator Models Given ( x , y ) ∈ ( X × Y ) t aligned sequences, model computes conditional probability (i.e. | x | = | y | ) Pr [ y | x ] = � h ∈H t Pr [ y , h | x ] (marginalize states) = � h t + 1 ∈H Pr [ y , h t + 1 | x ] (independence assumptions) = 1 ⊤ α t + 1 (vector form, α t + 1 ∈ R m ) = 1 ⊤ A y t x t α t (forward-backward equations) = 1 ⊤ A y t x t · · · A y 1 x 1 α (induction on t ) The choice of an operator A b a depends only on observable symbols B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 5 / 15
Observable Operators for FST Observable Operator Model Parameters Given X = { a 1 , . . . , a k } , Y = { b 1 , . . . , b l } , H = { c 1 , . . . , c m } , then Pr [ y | x ] = 1 ⊤ A y t x t · · · A y 1 x 1 α with parameters: A b a = T a D b ∈ R m × m (factorized operator) T a ( i , j ) = Pr [ H s = c i | X s − 1 = a , H s − 1 = c j ] ∈ R m × m (state transition) D b ( i , j ) = δ i , j Pr [ Y s = b | H s = c j ] ∈ R m × m (observation emission) O ( i , j ) = Pr [ Y s = b i | H s = c j ] ∈ R l × m (collected emissions) α ( i ) = Pr [ H 1 = c i ] ∈ R m (initial probabilites) The choice of an operator A b a depends only on observable symbols . . . . . . but operator parameters are conditioned by hidden states B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 6 / 15
Observable Operators for FST A Learnable Set of Observable Operators Note that for any invertible Q ∈ R m × m Pr [ y | x ] = 1 ⊤ Q − 1 ( Q A y t x t Q − 1 ) · · · ( Q A y 1 x 1 Q − 1 ) Q α Idea ( subspace identification methods for linear systems, ’80s ) Find a basis for the state space such that operators in the new basis are related to observable quantities Following multiplicity automata and spectral HMM learning . . . B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 7 / 15
Observable Operators for FST A Learnable Set of Observable Operators Find a basis Q where operators can be expressed in terms of unigram, bigram and trigram probabilities ρ ( i ) = Pr [ Y 1 = b i ] ∈ R l P ( i , j ) = Pr [ Y 1 = b j , Y 2 = b i ] ∈ R l × l P b a ( i , j ) = Pr [ Y 1 = b j , Y 2 = b , Y 3 = b i | X 2 = a ] ∈ R l × l Theorem ( ρ , P and P b a are sufficient statistics) Let P = U Σ V ∗ be a thin SVD decomposition, then Q = U ⊤ O yields (under certain assumptions) Q α = U ⊤ ρ 1 ⊤ Q − 1 = ρ ⊤ ( U ⊤ P ) + a Q − 1 = ( U ⊤ P b Q A b a )( U ⊤ P ) + B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 8 / 15
Learning Observable Operator Models Spectral Learning Algorithm Given ◮ Input X and output Y alphabet ◮ Number of hidden states m ◮ Training sample S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } Do ρ , bigram � P and trigram � P b ◮ Compute unigram � a relative frequencies in S ◮ Perform SVD on � P and take � U with top m left singular vectors ρ , � P , � a and � ◮ Return operators computed using � P b U In Time ◮ O ( n ) to compute relative frequencies ◮ O ( |Y| 3 ) to compute SVD B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 9 / 15
Learning Observable Operator Models PAC-Style Result ◮ Input distribution D X over X ∗ with λ = E [ | X | ] , µ = min a Pr [ X 2 = a ] ◮ Conditional distributions D Y | x on Y ∗ given x ∈ X ∗ modeled by an FST with m states (satisfying certain rank assumptions) ◮ Sampling i.i.d. from joint distribution D X ⊗ D Y | X Theorem For any 0 < ε, δ < 1 , if the algorithm receives a sample of size � � λ 2 m |Y| log |X| ( σ O and σ P are mth singular n ≥ O , ε 4 µσ 2 O σ 4 values of O and P in target) δ P then with probability at least 1 − δ the hypothesis � D Y | x satisfies � � � ( L 1 distance between � � � D Y | X ( y ) − � joint distributions ≤ ε . E X D Y | X ( y ) � D X ⊗ D Y | X and D X ⊗ � D Y | X ) y ∈Y ∗ B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 10 / 15
Experimental Evaluation Synthetic Experiments Goal: Compare against baselines when learning hypothesis hold Target: Randomly generated with |X| = 3, |Y| = 3, |H| = 2 0.7 HMM k−HMM 0.6 FST 0.5 ◮ HMM: model input-output L1 distance jointly 0.4 ◮ k -HMM: one model for each 0.3 input symbol 0.2 ◮ Results averaged over 5 runs 0.1 0 32 128 512 2048 8192 32768 # training samples (in thousands) B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 11 / 15
Experimental Evaluation Transliteration Experiments Goal: Compare against EM in a real task (where modeling assumptions fail) Task: English to Russian transliteration (brooklyn → бруклин ) 80 Training times Spectral, m=2 Spectral, m=3 Spectral 26 s EM, m=2 70 normalized edit distance EM, m=3 EM (iteration) 37 s EM (best) 1133 s 60 50 ◮ Sequence alignment done in 40 preprocessing 30 ◮ Standard techniques used for inference 20 75 150 350 750 1500 3000 6000 ◮ Test size: 943, |X| = 82, |Y| = 34 # training sequences B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 12 / 15
Conclusion Summary of Contributions ◮ Fast spectral method for learning input-output OOM ◮ Strong theoretical guarantees with few assumptions on input distribution ◮ Outperforming previous spectral algorithms on FST ◮ Faster and better than EM in some real tasks B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 13 / 15
A Spectral Learning Algorithm for Finite State Transducers Borja Balle , Ariadna Quattoni, Xavier Carreras ECML PKDD — September 7, 2011 B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 14 / 15
Technical Assumptions X = { a 1 , . . . , a k } , Y = { b 1 , . . . , b l } , H = { c 1 , . . . , c m } Parameters T a ( i , j ) = Pr [ H s = c i | X s − 1 = a , H s − 1 = c j ] ∈ R m × m (state transition) T = � a T a Pr [ X 1 = a ] ∈ R m × m (“mean” transition matrix) O ( i , j ) = Pr [ Y s = b i | H s = c j ] ∈ R l × m (collected emissions) α ( i ) = Pr [ H 1 = c i ] ∈ R m (initial probabilites) Assumptions 1. l ≥ m 2. α > 0 3. rank ( T ) = rank ( O ) = m 4. min a Pr [ X 2 = a ] > 0 B. Balle , A. Quattoni, X. Carreras Spectral Learning FST ECML PKDD 2011 15 / 15
Recommend
More recommend