satyen kale yahoo research
play

Satyen Kale (Yahoo! Research) Joint work with Elad Hazan (IBM - PowerPoint PPT Presentation

Satyen Kale (Yahoo! Research) Joint work with Elad Hazan (IBM Almaden) and Manfred Warmuth (UCSC) x 2 y 1 y 2 x T x 1 y T Input: pairs of unit vectors in R n : (x 1 , y 1 ), (x 2 , y 2 ), , ( x T , y T ) Assumption: y t = Rx t + noise, where R


  1. Satyen Kale (Yahoo! Research) Joint work with Elad Hazan (IBM Almaden) and Manfred Warmuth (UCSC)

  2. x 2 y 1 y 2 x T x 1 y T Input: pairs of unit vectors in R n : (x 1 , y 1 ), (x 2 , y 2 ), …, ( x T , y T ) Assumption: y t = Rx t + noise, where R is an unknown rotation matrix Problem: find “best - fit” rotation matrix for the data, i.e. arg min R  t k Rx t – y t k 2

  3.  k Rx t – y t k 2 = k Rx t k 2 + k y t k 2 – 2(y t x t > ) ² R > ) ² R. = 2 - 2(y t x t A ² B = Tr(A > B) = Linear in R  ij A ij B ij > ² R  arg min R  t k Rx t – y t k 2 = arg max R  t y t x t  Computing arg max R M ² R: “Wahba’s problem”  Can be solved using SVD of M

  4. x 2 R 1 x 1 y 1 y 2 x T x 1 R 2 x 2 R T x T y T Choose rot matrix R T Choose rot matrix R 2 Choose rot matrix R 1 Predict R T x T Predict R 2 x 2 Predict R 1 x 1 T (R T ) = k R T x T – y T k 2 L 2 (R 2 ) = k R 2 x 2 – y 2 k 2 L L 1 (R 1 ) = k R 1 x 1 – y 1 k 2 Open problem Goal: Minimize regret: from COLT 2008 Regret =  t L t (R t ) – min R  t L t (R) [Smith, Warmuth]

  5.  Rot matrix ´ orthogonal matrix of determinant 1  Set of rot matrices, SO(n):  Non-convex: so online convex optimization techniques like gradient descent, exponentiated gradient, etc. don’t apply directly  Lie group with Lie algebra = set of all skew-symmetric matrices  Lie group gives universal representation for all Lie groups via a conformal embedding

  6.  [Arora , NIPS ’09] using Lie group/Lie algebra structure  Based on matrix exponentiated gradient: matrix exp maps Lie algebra to Lie group  Deterministic algorithm   (T) lower bound on any such deterministic algorithm, so randomization is crucial

  7. Adversary can compute R t since alg is deterministic  Assume for convenience that n is even.  Bad example: x t = e 1 , y t = -R t x t .  L t (R t ) = k R t x t - y t k 2 = k 2y t k 2 = 4. So total loss = 4T.  Since n is even, both I, -I are rot matrices, and  t L t (I) + L t (-I) =  t 2 k y t k 2 + 2 k x t k 2 = 4T.  Hence, min R  t L t (R) · 2T.  So, Regret ¸ 2T.

  8.  Randomized algorithm with expected regret O( p nL), where L = min R  t L t (R)  Lower bound on regret of any online learning algorithm for choosing rot matrices of  ( p nT)  Using Hannan/Kalai- Vempala’s Follow-The- Perturbed-Leader technique based on linearity of loss function

  9. Sample noise matrix N with i.i.d entries distributed uniformly in [-1/  , 1/  ] t-1 L i (R) - N ² R. In round t, use R t = arg min R  1 Using SVD solution to Wahba’s problem Thm [KV’05]: Regret · O(n 5/4 p T).

  10. Sample n numbers  1 ,  2 , …,  n i.i.d. from the exponential distribution of density  exp(-  ) Sample 2 orthogonal matrices U, V from the uniform Haar measure Set N = U  V > , where  = diag(  1 ,  2 , …,  n ). t-1 L i (R) - N ² R. In round t, use R t = arg min R  1

  11. Sample n numbers  1 ,  2 , …,  n i.i.d. from the exponential distribution of density  exp(-  ) Sample 2 orthogonal matrices U, V from the uniform Haar measure E.g. using QR-decomposition Set N = U  V > , where  = diag(  1 ,  2 , …,  n ). of matrix with i.i.d. standard Gaussian entries t-1 L i (R) - N ² R. In round t, use R t = arg min R  1

  12. Sample n numbers  1 ,  2 , …,  n i.i.d. from the exponential distribution of density  exp(-  ) Effectively, we choose N w.p. / exp(-  k N k * ), where k N k * = trace norm, i.e. sum of singular values of N Sample 2 orthogonal matrices U, V from the uniform Haar measure Set N = U  V > , where  = diag(  1 ,  2 , …,  n ). t-1 L i (R) - N ² R. In round t, use R t = arg min R  1

  13.  Stability Lemma [KV’05]: E[Regret] ·  t E[L t (R t )] – E[L t (R t+1 )] + 2E[ k N k * ] · 2  L = 2n/   Choose  = p n/L, and we get E[Regret] · O( p nL).

  14. t-1 y i x i > + N) ² R  R t = arg max R (  1 t y i x i > + N’) ² R  R t+1 = arg max R (  1 Re- randomization doesn’t change expected regret

  15. t-1 y i x i > + N) ² R  R t = arg max R (  1 t y i x i > + N’) ² R  R t+1 = arg max R (  1  First sample N, then set N’ = N – y t x t > .  Then R t = R t+1 , and so E D [L t (R t ) ] – E D’ [L t (R t+1 )] = 0. D = dist of N, D’ = dist of N’

  16. t-1 y i x i > + N) ² R  R t = arg max R (  1 t y i x i > + N’) ² R  R t+1 = arg max R (  1  First sample N, then set N’ = N – y t x t > .  Then R t = R t+1 , and so E D [L t (R t ) ] – E D’ [L t (R t+1 )] = 0.  However, k D ’ – D k 1 ·  .  So E D’ [L t (R t+1 )] – E D [L t (R t+1 )] · 2  .

  17. t-1 y i x i > + N) ² R  R t = arg max R (  1 t y i x i > + N’) ² R  R t+1 = arg max R (  1  First sample N, then set N’ = N – y t x t > .  Then R t = R t+1 , and so E D [L t (R t ) ] – E D’ [L t (R t+1 )] = 0.  However, k D ’ – D k 1 ·  .  So E D’ [L t (R t+1 )] – E D [L t (R t+1 )] · 2  . Pr D ’ [N]/Pr D [N] ¼ exp( §  k y t x t > k * ) ¼ 1 §  .

  18. E[ k N k * ] = E[  i  i ] =  i E[  i ] = n/  . Because  i is drawn from the exponential distribution of density  exp(-  )

  19.  Bad example: x t = e t mod n , y t = § x t w.p. ½ each *  Opt rot matrix R * = diag(sgn(X 1 ),…, sgn(X n )) X i = sum of § signs over all t s.t. (t mod n) = i. * ignoring det(R * ) = 1 issue

  20.  Bad example: x t = e t mod n , y t = § x t w.p. ½ each  Opt rot matrix R * = diag(sgn(X 1 ),…, sgn(X n )) *  Expected total loss = 2T – 2  i E[|X i | ] ¸ 2T - n ¢  ( p T/n) = 2T -  ( p nT)  But for any R t , E[L t (R t )] = 2 – 2E[(y t x t > ) ² R t ] = 2, and hence total expected loss of alg = 2T.  So, E[Regret] ¸  ( p nT). * ignoring det(R * ) = 1 issue

  21.  Optimal algorithm for online learning of rotations with regret O( p nL)  Based on FSPL  Open questions:  Other applications for FSPL? Matrix Hedge? Faster algorithms for SDPs? More details in Manfred’s open problem talk.  Any other example of natural problems where FPL is the only known technique that works? Thank you!

Recommend


More recommend