Ra RandomSh Shuffl fle B Beats SG SGD D after Finite Epoch chs Tsinghua University Jeff HaoChen Suvrit Sra Massachusetts Institute of Technology
In Intr troduc ductio tion • Goal: to minimize the function
In Intr troduc ductio tion • SGD with replacement: (often appears in algorithm analysis) • ! " = ! "$% − '∇) * " (! "$% ) • -(.) uniformly random from [0] , 1 ≤ . ≤ 4 • SGD without replacement: (often appears in reality) 5 = ! "$% 5 5 • ! " − '∇) 6 7 " (! "$% ) • 8 5 uniformly from random permutation of [0] , 1 ≤ . ≤ 0
In Intr troduc ductio tion • SGD with replacement: (often appears in algorithm analysis) • ! " = ! "$% − '∇) * " (! "$% ) • -(.) uniformly random from [0] , 1 ≤ . ≤ 4 • SGD without replacement: (often appears in reality) 5 = ! "$% 5 5 • ! " − '∇) 6 7 " (! "$% ) • 8 5 uniformly from random permutation of [0] , 1 ≤ . ≤ 0
Intr In troduc ductio tion • SGD with replacement: (often appears in algorithm analysis) • ! " = ! "$% − '∇) * " (! "$% ) We call this SGD • -(.) uniformly random from [0] , 1 ≤ . ≤ 4 • SGD without replacement: (often appears in reality) 5 = ! "$% 5 5 • ! " − '∇) 6 7 " (! "$% ) We call this RandomShuffle • 8 5 uniformly from random permutation of [0] , 1 ≤ . ≤ 0
In Intr troduc ductio tion • So a natural question: which one is better? • A Numerical Comparison: ( Bottou, 2009 ) SGD RandomShuffle
In Intr troduc ductio tion • So a natural question: which one is better? • A Numerical Comparison: ( Bottou, 2009 ) SGD RandomShuffle
In Intr troduc ductio tion • Why? • Intuitively, we should prefer RandomShuffle for the following two reasons: • It uses more “information” in one epoch (by visiting each component) • It has smaller variance for one epoch • However, what is a rigorous proof?
In Intr troduc ductio tion • Why? • Intuitively, we should prefer RandomShuffle for the following two reasons: • It uses more “information” in one epoch (by visiting each component) • It has smaller variance for one epoch • However, what is a rigorous proof?
In Intr troduc ductio tion • Why? • Intuitively, we should prefer RandomShuffle for the following two reasons: • It uses more “information” in one epoch (by visiting each component) • It has smaller variance for one epoch • However, what is a rigorous proof?
A A Br Brief History • Under strong structure , we can convert this problem into matrix inequality: (Recht and Ré, 2012) ' # − ) " ) + • Assume the problem is quadratic: ! " # = (& " • Then “RandomShuffle is better than SGD after one epoch” is true under conjecture: • Which we still don’t know how to prove yet L
A A Br Brief History • Under strong structure , we can convert this problem into matrix inequality: (Recht and Ré, 2012) ' # − ) " ) + • Assume the problem is quadratic: ! " # = (& " • Then “RandomShuffle is better than SGD after one epoch” is true under conjecture: • Which we still don’t know how to prove yet L
A A Br Brief History • Under strong structure , we can convert this problem into matrix inequality: (Recht and Ré, 2012) ' # − ) " ) + • Assume the problem is quadratic: ! " # = (& " • Then “RandomShuffle is better than SGD after one epoch” is true under conjecture: • Which we still don’t know how to prove yet L
A A Br Brief History • Under strong structure , we can convert this problem into matrix inequality: (Recht and Ré, 2012) ' # − ) " ) + • Assume the problem is quadratic: ! " # = (& " • Then “RandomShuffle is better than SGD after one epoch” is true under conjecture: • Which we still don’t know how to prove yet L
A A Br Brief History • What about the more general situation? • We can try to show with a better convergence bound! • The hope is: prove a faster worst-case convergence rate of RandomShuffle " • A well-known fact: SGD converges with rate ! # : • $ ∥ & # − & ∗ ∥ ) ≤ ! " #
A A Br Brief History • What about the more general situation? • We can try to show with a better convergence bound! • The hope is: prove a faster worst-case convergence rate of RandomShuffle " • A well-known fact: SGD converges with rate ! # : • $ ∥ & # − & ∗ ∥ ) ≤ ! " #
A A Br Brief History • What about the more general situation? • We can try to show with a better convergence bound! • The hope is: prove a faster worst-case convergence rate of RandomShuffle " • A well-known fact: SGD converges with rate ! # : • $ ∥ & # − & ∗ ∥ ) ≤ ! " #
A A Br Brief History • One of the recent breakthrough: (Gürbüzbalaban, 2015) " • Asymptotically RandomShuffle has convergence rate ! # $ • But not sure what happen after finite epochs • In contrast, there is a non-asymptotic result: (Shamir, 2016) " • RandomShuffle is no worse than SGD, with provably ! # convergence rate • But cannot show that RandomShuffle is really faster
A A Br Brief History • One of the recent breakthrough: (Gürbüzbalaban, 2015) " • Asymptotically RandomShuffle has convergence rate ! # $ • But not sure what happen after finite epochs • In contrast, there is a non-asymptotic result: (Shamir, 2016) " • RandomShuffle is no worse than SGD, with provably ! # convergence rate • But cannot show that RandomShuffle is really faster
A A Br Brief History • One of the recent breakthrough: (Gürbüzbalaban, 2015) " • Asymptotically RandomShuffle has convergence rate ! # $ • But not sure what happen after finite epochs • In contrast, there is a non-asymptotic result: (Shamir, 2016) " • RandomShuffle is no worse than SGD, with provably ! # convergence rate • But cannot show that RandomShuffle is really faster What happens in between?
Su Summa mmary of of r results We analyze RandomShuffle in the following settings: • Strongly convex, Lipschitz Hessian • Sparse data • Vanishing variance • Nonconvex, under PL condition • Smooth convex
Su Summa mmary of of r results We analyze RandomShuffle in the following settings: • Strongly convex, Lipschitz Hessian • Sparse data Dheeraj Nagaraj et el. get rid • Vanishing variance of this constraint • Nonconvex, under PL condition • Smooth convex
Su Summa mmary of of r results We analyze RandomShuffle in the following settings: • Strongly convex, Lipschitz Hessian • Sparse data this talk • Vanishing variance • Nonconvex, under PL condition • Smooth convex
Su Summa mmary of of r results We analyze RandomShuffle in the following settings: • Strongly convex, Lipschitz Hessian • Sparse data • Vanishing variance • Nonconvex, under PL condition • Smooth convex
Fir First t attem empt: t: tr try to pr prove e a a tig tighter er bo bound! und! " " • Can we show a non-asymptotic bound better than ! # ? E.g., ! # $%& ? • If we can, then everything is solved J • ……unless we cannot L
Fir First t attem empt: t: tr try to pr prove e a a tig tighter er bo bound! und! " " • Can we show a non-asymptotic bound better than ! # ? E.g., ! # $%& ? • If we can, then everything is solved J • ……unless we cannot L
Fir First t attem empt: t: tr try to pr prove e a a tig tighter er bo bound! und! " " • Can we show a non-asymptotic bound better than ! # ? E.g., ! # $%& ? • If we can, then everything is solved J • ……unless we cannot L
Pr Proof of the theorem • We only consider the case when ! = # , i.e., we run one epoch of the algorithm • We prove the theorem with a counter-example: • Recall function $ % = & ' ' ∑ )*& + ) % & - % − / 0 1 % − / , 3 455, • We set + ) % = , & - % + / 0 1 % + / , 3 787#. • A and b to be determined later…
Pr Proof of the theorem • We only consider the case when ! = # , i.e., we run one epoch of the algorithm • We prove the theorem with a counter-example: • Recall function $ % = & ' ' ∑ )*& + ) % & - % − / 0 1 % − / , 3 455, • We set + ) % = , & - % + / 0 1 % + / , 3 787#. • A and b to be determined later…
Pr Proof of the theorem • Step 1: Calculate the error & = & & I − )* + x - − x ∗ −1 4 0 ) 5 − )* #60 *7 " # − " ∗ # ∑ 012 • ! + ! P Q • Step 2: Simplify via eigenvector basis decomposition −1 4 0 1 − ); 9 #60 & & , > = ) & ∑ 912 : : & ; 9 & ! 1 − ); 9 &# < 9 # • 8 = ∑ 912 ∑ 012 ? 9 • Step 3: Construct a contradiction 2 • For contradiction, assume there is ) dependent on @ achieving convergence A # ⟹ )@ = 1 + A(1) 2 − ); 9 ; 9
Recommend
More recommend