ra randomsh shuffl fle b beats sg sgd d after finite
play

Ra RandomSh Shuffl fle B Beats SG SGD D after Finite Epoch - PowerPoint PPT Presentation

Ra RandomSh Shuffl fle B Beats SG SGD D after Finite Epoch chs Tsinghua University Jeff HaoChen Suvrit Sra Massachusetts Institute of Technology In Intr troduc ductio tion Goal: to minimize the function In Intr troduc ductio


  1. Ra RandomSh Shuffl fle B Beats SG SGD D after Finite Epoch chs Tsinghua University Jeff HaoChen Suvrit Sra Massachusetts Institute of Technology

  2. In Intr troduc ductio tion • Goal: to minimize the function

  3. In Intr troduc ductio tion • SGD with replacement: (often appears in algorithm analysis) • ! " = ! "$% − '∇) * " (! "$% ) • -(.) uniformly random from [0] , 1 ≤ . ≤ 4 • SGD without replacement: (often appears in reality) 5 = ! "$% 5 5 • ! " − '∇) 6 7 " (! "$% ) • 8 5 uniformly from random permutation of [0] , 1 ≤ . ≤ 0

  4. In Intr troduc ductio tion • SGD with replacement: (often appears in algorithm analysis) • ! " = ! "$% − '∇) * " (! "$% ) • -(.) uniformly random from [0] , 1 ≤ . ≤ 4 • SGD without replacement: (often appears in reality) 5 = ! "$% 5 5 • ! " − '∇) 6 7 " (! "$% ) • 8 5 uniformly from random permutation of [0] , 1 ≤ . ≤ 0

  5. Intr In troduc ductio tion • SGD with replacement: (often appears in algorithm analysis) • ! " = ! "$% − '∇) * " (! "$% ) We call this SGD • -(.) uniformly random from [0] , 1 ≤ . ≤ 4 • SGD without replacement: (often appears in reality) 5 = ! "$% 5 5 • ! " − '∇) 6 7 " (! "$% ) We call this RandomShuffle • 8 5 uniformly from random permutation of [0] , 1 ≤ . ≤ 0

  6. In Intr troduc ductio tion • So a natural question: which one is better? • A Numerical Comparison: ( Bottou, 2009 ) SGD RandomShuffle

  7. In Intr troduc ductio tion • So a natural question: which one is better? • A Numerical Comparison: ( Bottou, 2009 ) SGD RandomShuffle

  8. In Intr troduc ductio tion • Why? • Intuitively, we should prefer RandomShuffle for the following two reasons: • It uses more “information” in one epoch (by visiting each component) • It has smaller variance for one epoch • However, what is a rigorous proof?

  9. In Intr troduc ductio tion • Why? • Intuitively, we should prefer RandomShuffle for the following two reasons: • It uses more “information” in one epoch (by visiting each component) • It has smaller variance for one epoch • However, what is a rigorous proof?

  10. In Intr troduc ductio tion • Why? • Intuitively, we should prefer RandomShuffle for the following two reasons: • It uses more “information” in one epoch (by visiting each component) • It has smaller variance for one epoch • However, what is a rigorous proof?

  11. A A Br Brief History • Under strong structure , we can convert this problem into matrix inequality: (Recht and Ré, 2012) ' # − ) " ) + • Assume the problem is quadratic: ! " # = (& " • Then “RandomShuffle is better than SGD after one epoch” is true under conjecture: • Which we still don’t know how to prove yet L

  12. A A Br Brief History • Under strong structure , we can convert this problem into matrix inequality: (Recht and Ré, 2012) ' # − ) " ) + • Assume the problem is quadratic: ! " # = (& " • Then “RandomShuffle is better than SGD after one epoch” is true under conjecture: • Which we still don’t know how to prove yet L

  13. A A Br Brief History • Under strong structure , we can convert this problem into matrix inequality: (Recht and Ré, 2012) ' # − ) " ) + • Assume the problem is quadratic: ! " # = (& " • Then “RandomShuffle is better than SGD after one epoch” is true under conjecture: • Which we still don’t know how to prove yet L

  14. A A Br Brief History • Under strong structure , we can convert this problem into matrix inequality: (Recht and Ré, 2012) ' # − ) " ) + • Assume the problem is quadratic: ! " # = (& " • Then “RandomShuffle is better than SGD after one epoch” is true under conjecture: • Which we still don’t know how to prove yet L

  15. A A Br Brief History • What about the more general situation? • We can try to show with a better convergence bound! • The hope is: prove a faster worst-case convergence rate of RandomShuffle " • A well-known fact: SGD converges with rate ! # : • $ ∥ & # − & ∗ ∥ ) ≤ ! " #

  16. A A Br Brief History • What about the more general situation? • We can try to show with a better convergence bound! • The hope is: prove a faster worst-case convergence rate of RandomShuffle " • A well-known fact: SGD converges with rate ! # : • $ ∥ & # − & ∗ ∥ ) ≤ ! " #

  17. A A Br Brief History • What about the more general situation? • We can try to show with a better convergence bound! • The hope is: prove a faster worst-case convergence rate of RandomShuffle " • A well-known fact: SGD converges with rate ! # : • $ ∥ & # − & ∗ ∥ ) ≤ ! " #

  18. A A Br Brief History • One of the recent breakthrough: (Gürbüzbalaban, 2015) " • Asymptotically RandomShuffle has convergence rate ! # $ • But not sure what happen after finite epochs • In contrast, there is a non-asymptotic result: (Shamir, 2016) " • RandomShuffle is no worse than SGD, with provably ! # convergence rate • But cannot show that RandomShuffle is really faster

  19. A A Br Brief History • One of the recent breakthrough: (Gürbüzbalaban, 2015) " • Asymptotically RandomShuffle has convergence rate ! # $ • But not sure what happen after finite epochs • In contrast, there is a non-asymptotic result: (Shamir, 2016) " • RandomShuffle is no worse than SGD, with provably ! # convergence rate • But cannot show that RandomShuffle is really faster

  20. A A Br Brief History • One of the recent breakthrough: (Gürbüzbalaban, 2015) " • Asymptotically RandomShuffle has convergence rate ! # $ • But not sure what happen after finite epochs • In contrast, there is a non-asymptotic result: (Shamir, 2016) " • RandomShuffle is no worse than SGD, with provably ! # convergence rate • But cannot show that RandomShuffle is really faster What happens in between?

  21. Su Summa mmary of of r results We analyze RandomShuffle in the following settings: • Strongly convex, Lipschitz Hessian • Sparse data • Vanishing variance • Nonconvex, under PL condition • Smooth convex

  22. Su Summa mmary of of r results We analyze RandomShuffle in the following settings: • Strongly convex, Lipschitz Hessian • Sparse data Dheeraj Nagaraj et el. get rid • Vanishing variance of this constraint • Nonconvex, under PL condition • Smooth convex

  23. Su Summa mmary of of r results We analyze RandomShuffle in the following settings: • Strongly convex, Lipschitz Hessian • Sparse data this talk • Vanishing variance • Nonconvex, under PL condition • Smooth convex

  24. Su Summa mmary of of r results We analyze RandomShuffle in the following settings: • Strongly convex, Lipschitz Hessian • Sparse data • Vanishing variance • Nonconvex, under PL condition • Smooth convex

  25. Fir First t attem empt: t: tr try to pr prove e a a tig tighter er bo bound! und! " " • Can we show a non-asymptotic bound better than ! # ? E.g., ! # $%& ? • If we can, then everything is solved J • ……unless we cannot L

  26. Fir First t attem empt: t: tr try to pr prove e a a tig tighter er bo bound! und! " " • Can we show a non-asymptotic bound better than ! # ? E.g., ! # $%& ? • If we can, then everything is solved J • ……unless we cannot L

  27. Fir First t attem empt: t: tr try to pr prove e a a tig tighter er bo bound! und! " " • Can we show a non-asymptotic bound better than ! # ? E.g., ! # $%& ? • If we can, then everything is solved J • ……unless we cannot L

  28. Pr Proof of the theorem • We only consider the case when ! = # , i.e., we run one epoch of the algorithm • We prove the theorem with a counter-example: • Recall function $ % = & ' ' ∑ )*& + ) % & - % − / 0 1 % − / , 3 455, • We set + ) % = , & - % + / 0 1 % + / , 3 787#. • A and b to be determined later…

  29. Pr Proof of the theorem • We only consider the case when ! = # , i.e., we run one epoch of the algorithm • We prove the theorem with a counter-example: • Recall function $ % = & ' ' ∑ )*& + ) % & - % − / 0 1 % − / , 3 455, • We set + ) % = , & - % + / 0 1 % + / , 3 787#. • A and b to be determined later…

  30. Pr Proof of the theorem • Step 1: Calculate the error & = & & I − )* + x - − x ∗ −1 4 0 ) 5 − )* #60 *7 " # − " ∗ # ∑ 012 • ! + ! P Q • Step 2: Simplify via eigenvector basis decomposition −1 4 0 1 − ); 9 #60 & & , > = ) & ∑ 912 : : & ; 9 & ! 1 − ); 9 &# < 9 # • 8 = ∑ 912 ∑ 012 ? 9 • Step 3: Construct a contradiction 2 • For contradiction, assume there is ) dependent on @ achieving convergence A # ⟹ )@ = 1 + A(1) 2 − ); 9 ; 9

Recommend


More recommend