Local Loss Optimization in Operator Models: A New Insight into Spectral Learning Borja Balle , Ariadna Quattoni, Xavier Carreras ICML 2012 June 2012, Edinburgh This work is partially supported by the PASCAL2 Network and a Google Research Award
A Simple Spectral Method [HKZ09] ➓ n states – Y t P t 1, . . . , n ✉ Discrete Homogeneous ➓ k symbols – X t P t σ 1 , . . . , σ k ✉ Hidden Markov Model ➓ for now assume n ↕ k Y 1 Y 2 Y 3 Y 4 ⋯ ➓ Forward-backward equations with A σ P R n ✂ n : X 1 X 2 X 3 X 4 P r X 1 : t ✏ w s ✏ α ❏ 1 A w 1 ☎ ☎ ☎ A w t � 1 ➓ Probabilities arranged into matrices H , H σ 1 , . . . , H σ k P R k ✂ k H ♣ i , j q ✏ P r X 1 ✏ σ i , X 2 ✏ σ j s H σ ♣ i , j q ✏ P r X 1 ✏ σ i , X 2 ✏ σ , X 3 ✏ σ j s ➓ Spectral learning algorithm for B σ ✏ QA σ Q ✁ 1 : 1. Compute SVD H ✏ UDV ❏ and take top n right singular vectors V n 2. B σ ✏ ♣ HV n q � H σ V n (For simplicity, in this talk we ignore learning of initial and final vectors)
A Local Approach to Learning? ➓ Maximum likelihood uses the whole of the sample S ✏ t w 1 , . . . , w N ✉ and is always consistent in the realizable case ➳ N 1 log ♣ α ❏ � max 1 A w i 1 ☎ ☎ ☎ A w i 1 q N ti α 1 , t A σ ✉ i ✏ 1 ➓ The spectral method only uses local information from the sample in H , ♣ ♣ H a , ♣ H b and its consistency depends on properties of H S ✏ t abbabba , aabaa , baaabbbabab , bbaaba , bababbabbaaaba , abbb , . . . ✉ Questions ➓ Is the spectral method minimizing a “local” loss function? ➓ When does this minimization yield a consistent algorithm?
Outline Spectral Learning as Local Loss Optimization A Convex Relaxation of the Local Loss Choosing a Consistent Local Loss
Loss Function of the Spectral Method ➓ Both ingredients in the spectral method have optimization interpretations n V n ✏ I ⑥ HV n V ❏ SVD — min V ❏ n ✁ H ⑥ F Pseudo-inverse — min B σ ⑥ HV n B σ ✁ H σ V n ⑥ F ➓ Can formulate a joint optimization for the spectral method ➳ ⑥ HV n B σ ✁ H σ V n ⑥ 2 min F t B σ ✉ , V ❏ n V n ✏ I σ P Σ
Properties of the Spectral Optimization ➳ ⑥ HV n B σ ✁ H σ V n ⑥ 2 min F t B σ ✉ , V ❏ n V n ✏ I σ P Σ ➓ Theorem The optimization is consistent under the same conditions of the spectral method ➓ The loss is non-convex due to V n B σ and constraint V ❏ n V n ✏ I ➓ Spectral method equivalent to 1. Choosing V n using SVD 2. Optimizing t B σ ✉ with fixed V n Intuition about the Loss Function ➓ Minimize the ℓ 2 norm of the unexplained (finite set of) futures when a symbol σ is generated and the transition is explained using B σ ( over a finite set of pasts ) ➓ Strongly based on the markovianity of the process – which generic ML does not exploit
A Convex Relaxation of the Local Loss ➓ For algorithmic purposes a convex local loss function is more desirable ➓ A relaxation can be obtained by replacing the projection V n with a regularization term ➦ σ P Σ ⑥ HV n B σ ✁ H σ V n ⑥ 2 min t B σ ✉ , V ❏ n V n ✏ I F ➓ 1. fix n ✏ ⑤ S ⑤ and take V n ✏ I ➓ ➒ 2. B Σ ✏ r B σ 1 ⑤ ☎ ☎ ☎ ⑤ B σ k s and H Σ ✏ r H σ 1 ⑤ ☎ ☎ ☎ ⑤ H σ k s 3. regularize via nuclear norm to emulate V n min B Σ ⑥ HB Σ ✁ H Σ ⑥ 2 F � τ ⑥ B Σ ⑥ ✝ ➓ This optimization is convex and has some interesting theoretical (see paper) and empirical properties
Experimental Results with the Convex Local Loss Performing experiments with synthetic targets the following facts are observed ➓ Tuning the regularization parameter τ a better trade-off between generalization and model complexity can be achieved ➓ The largest gains when using the convex relaxation are attained for targets suposedly hard to the spectral method 0.09 0.1 SVD n=1 SVD SVD n=2 0.09 CO SVD n=3 difference 0.08 SVD n=4 0.08 SVD n=5 0.07 0.07 CO 0.06 0.06 L1 error L1 error 0.05 0.04 0.05 0.03 0.04 0.02 0.01 0.03 0 0.02 -0.01 0 500 1000 1500 2000 2500 3000 1e-05 0.0001 0.001 0.01 0.1 1 tau minimum singular value of target model
The Hankel Matrix For any function f : Σ ✍ Ñ R its Hankel matrix H f P R Σ ✍ ✂ Σ ✍ is defined as H f ♣ p , s q ✏ f ♣ p ☎ s q Σ λ a b aa ab ... 1 0.3 0.7 0.05 0.25 . . . λ H 0.3 0.05 0.25 0.02 0.03 . . . a 0.7 0.6 0.1 0.03 0.2 . . . b 0.05 0.02 0.03 0.017 0.003 . . . aa H a 0.25 0.23 0.02 0.11 0.12 . . . ab . . . . . . ... . . . . . . . . . . . . ➓ Blocks defined by sets of rows (prefixes P ) and columns (suffixes S ) ➓ Can parametrize the spectral method by P and S taking H P R P ✂ S ➓ Each pair ♣ P , S q defines a different local loss function
Consistency of the Local Loss Theorem (Schützenberger ’61) rank ♣ H f q ✏ n iff f can be computed with operators A σ P R n ✂ n Consequences ➓ The spectral method is consistent iff rank ♣ H q ✏ rank ♣ H f q ✏ n ➓ There always exist ⑤ P ⑤ ✏ ⑤ S ⑤ ✏ n with rank ♣ H q ✏ n Trade-off ➓ Larger P and S more likely to have rank ♣ H q ✏ n , but also require larger samples for good estimation ♣ H Question ➓ Given a sample, how to choose good P and S ? Answer ➓ Random sampling succeeds w.h.p. with ⑤ P ⑤ and ⑤ S ⑤ depending polynomially on the complexity of the target
Visit us at poster 53
Local Loss Optimization in Operator Models: A New Insight into Spectral Learning Borja Balle , Ariadna Quattoni, Xavier Carreras ICML 2012 June 2012, Edinburgh This work is partially supported by the PASCAL2 Network and a Google Research Award
Recommend
More recommend