Ra RandomSh Shuffl fle B Beats SG SGD D after Finite Epoch - PowerPoint PPT Presentation

Ra RandomSh Shuffl fle B Beats SG SGD D after Finite Epoch chs Tsinghua University Jeff HaoChen Suvrit Sra Massachusetts Institute of Technology

In Intr troduc ductio tion • Goal: to minimize the function

In Intr troduc ductio tion • SGD with replacement: (often appears in algorithm analysis) • ! " = ! "$% − '∇) * " (! "$% ) • -(.) uniformly random from [0] , 1 ≤ . ≤ 4 • SGD without replacement: (often appears in reality) 5 = ! "$% 5 5 • ! " − '∇) 6 7 " (! "$% ) • 8 5 uniformly from random permutation of [0] , 1 ≤ . ≤ 0

Intr In troduc ductio tion • SGD with replacement: (often appears in algorithm analysis) • ! " = ! "$% − '∇) * " (! "$% ) We call this SGD • -(.) uniformly random from [0] , 1 ≤ . ≤ 4 • SGD without replacement: (often appears in reality) 5 = ! "$% 5 5 • ! " − '∇) 6 7 " (! "$% ) We call this RandomShuffle • 8 5 uniformly from random permutation of [0] , 1 ≤ . ≤ 0

In Intr troduc ductio tion • So a natural question: which one is better? • A Numerical Comparison: ( Bottou, 2009 ) SGD RandomShuffle

In Intr troduc ductio tion • Why? • Intuitively, we should prefer RandomShuffle for the following two reasons: • It uses more “information” in one epoch (by visiting each component) • It has smaller variance for one epoch • However, what is a rigorous proof?

A A Br Brief History • Under strong structure , we can convert this problem into matrix inequality: (Recht and Ré, 2012) ' # − ) " ) + • Assume the problem is quadratic: ! " # = (& " • Then “RandomShuffle is better than SGD after one epoch” is true under conjecture: • Which we still don’t know how to prove yet L

A A Br Brief History • What about the more general situation? • We can try to show with a better convergence bound! • The hope is: prove a faster worst-case convergence rate of RandomShuffle " • A well-known fact: SGD converges with rate ! # : • $ ∥ & # − & ∗ ∥ ) ≤ ! " #

A A Br Brief History • One of the recent breakthrough: (Gürbüzbalaban, 2015) " • Asymptotically RandomShuffle has convergence rate ! # $ • But not sure what happen after finite epochs • In contrast, there is a non-asymptotic result: (Shamir, 2016) " • RandomShuffle is no worse than SGD, with provably ! # convergence rate • But cannot show that RandomShuffle is really faster

A A Br Brief History • One of the recent breakthrough: (Gürbüzbalaban, 2015) " • Asymptotically RandomShuffle has convergence rate ! # $ • But not sure what happen after finite epochs • In contrast, there is a non-asymptotic result: (Shamir, 2016) " • RandomShuffle is no worse than SGD, with provably ! # convergence rate • But cannot show that RandomShuffle is really faster What happens in between?

Su Summa mmary of of r results We analyze RandomShuffle in the following settings: • Strongly convex, Lipschitz Hessian • Sparse data • Vanishing variance • Nonconvex, under PL condition • Smooth convex

Su Summa mmary of of r results We analyze RandomShuffle in the following settings: • Strongly convex, Lipschitz Hessian • Sparse data Dheeraj Nagaraj et el. get rid • Vanishing variance of this constraint • Nonconvex, under PL condition • Smooth convex

Su Summa mmary of of r results We analyze RandomShuffle in the following settings: • Strongly convex, Lipschitz Hessian • Sparse data this talk • Vanishing variance • Nonconvex, under PL condition • Smooth convex

Su Summa mmary of of r results We analyze RandomShuffle in the following settings: • Strongly convex, Lipschitz Hessian • Sparse data • Vanishing variance • Nonconvex, under PL condition • Smooth convex

Fir First t attem empt: t: tr try to pr prove e a a tig tighter er bo bound! und! " " • Can we show a non-asymptotic bound better than ! # ? E.g., ! # $%& ? • If we can, then everything is solved J • ……unless we cannot L

Pr Proof of the theorem • We only consider the case when ! = # , i.e., we run one epoch of the algorithm • We prove the theorem with a counter-example: • Recall function $ % = & ' ' ∑ )*& + ) % & - % − / 0 1 % − / , 3 455, • We set + ) % = , & - % + / 0 1 % + / , 3 787#. • A and b to be determined later…

Pr Proof of the theorem • Step 1: Calculate the error & = & & I − )* + x - − x ∗ −1 4 0 ) 5 − )* #60 *7 " # − " ∗ # ∑ 012 • ! + ! P Q • Step 2: Simplify via eigenvector basis decomposition −1 4 0 1 − ); 9 #60 & & , > = ) & ∑ 912 : : & ; 9 & ! 1 − ); 9 &# < 9 # • 8 = ∑ 912 ∑ 012 ? 9 • Step 3: Construct a contradiction 2 • For contradiction, assume there is ) dependent on @ achieving convergence A # ⟹ )@ = 1 + A(1) 2 − ); 9 ; 9

Ra RandomSh Shuffl fle B Beats SG SGD D after Finite Epoch - PowerPoint PPT Presentation

Ra RandomSh Shuffl fle B Beats SG SGD D after Finite Epoch chs Tsinghua University Jeff HaoChen Suvrit Sra Massachusetts Institute of Technology In Intr troduc ductio tion Goal: to minimize the function In Intr troduc ductio

Fle FlexE E IP Vasan Karighatam VP of of Engi ngine neering ng A New Fle lexib ible le

Emerging Markets beats EFA 2003 through September 2007 EEM 22 countries (+298%) beats EFA

SGD and Averaging Instructor: Sham Kakade 1 SGD and optimality There is a strong sense in which

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Soft Gamma-ray Polarimetry with ASTRO-H SGD August 23, 2014 HEAPA Symposium on Future

Lunch & Learn Emily Shuffl ebottom Business Development Manager Aritco is an award winning

GLO Science Professional Before & After Images Before GLO After GLO Before GLO After GLO

F irst I nte rim Budg e t 2016/ 2017 Pro po se d Distric t Budg e t re fle c ting c ha ng e s

Monitor your containers with the Elastic Stack Monica Sarbu Monica Sarbu Team lead, Beats team

Finite A to B implies |A| = |B| Cardinality for finite A, B finite-card .1 finite-card .2

Federated Learning Min Du Postdoc, UC Berkeley Outline q Preliminary: deep learning and SGD q

Semi-Cyclic SGD Hubert Eichner Tomer Koren Brendan McMahan Kunal Talwar Google Google Google

Throughput Prediction of Asynchronous SGD in TensorFlow Zhuojin Li Wumo Yan Marco Paolieri

Optimization why does it work How many minima Do they control worm complexity Plain

Finite Automata A finite automaton has a finite set of states with which it accepts or rejects

1 Deterministic Finite Automata S* 0,1 Finite Automaton Finite Internal States 0,1 0,1

CSC304 Lecture 13 Mechanism Design w/o Money 2: Stable Matching Gale-Shapley Algorithm CSC304 -

Some Proof Templates A Mental Picture Axioms, definitions, already proven

Implicature Discernment in Natural Language Inference Group 7 Jesse Gioannini Charlie Guo

Example 1.73 I Lets apply pumping lemma to prove that B = { 0 n 1 n | n 0 } is not regular

ICS 275 Winter 2016 Winter 2016 Winter 2016 Conflict Analysis: Implication Grpahs The

'Liberty vs. Love': The Principal Contradiction of Human Culture (2) The 'Liberty vs. Love'

All Prayers Are Answered! But No is the answer wed rather not hear! Todays Prayer

3515ICT Theory of Computation Some sample proofs 4-0 Proof types 1. Proof

Ra RandomSh Shuffl fle B Beats SG SGD D after Finite Epoch - PowerPoint PPT Presentation

Ra RandomSh Shuffl fle B Beats SG SGD D after Finite Epoch chs Tsinghua University Jeff HaoChen Suvrit Sra Massachusetts Institute of Technology In Intr troduc ductio tion Goal: to minimize the function In Intr troduc ductio

Fle FlexE E IP Vasan Karighatam VP of of Engi ngine neering ng A New Fle lexib ible le

Emerging Markets beats EFA 2003 through September 2007 EEM 22 countries (+298%) beats EFA

SGD and Averaging Instructor: Sham Kakade 1 SGD and optimality There is a strong sense in which

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Soft Gamma-ray Polarimetry with ASTRO-H SGD August 23, 2014 HEAPA Symposium on Future

Lunch &amp; Learn Emily Shuffl ebottom Business Development Manager Aritco is an award winning

GLO Science Professional Before &amp; After Images Before GLO After GLO Before GLO After GLO

F irst I nte rim Budg e t 2016/ 2017 Pro po se d Distric t Budg e t re fle c ting c ha ng e s

Monitor your containers with the Elastic Stack Monica Sarbu Monica Sarbu Team lead, Beats team

Finite A to B implies |A| = |B| Cardinality for finite A, B finite-card .1 finite-card .2

Federated Learning Min Du Postdoc, UC Berkeley Outline q Preliminary: deep learning and SGD q

Semi-Cyclic SGD Hubert Eichner Tomer Koren Brendan McMahan Kunal Talwar Google Google Google

Throughput Prediction of Asynchronous SGD in TensorFlow Zhuojin Li Wumo Yan Marco Paolieri

Optimization why does it work How many minima Do they control worm complexity Plain

Finite Automata A finite automaton has a finite set of states with which it accepts or rejects

1 Deterministic Finite Automata S* 0,1 Finite Automaton Finite Internal States 0,1 0,1

CSC304 Lecture 13 Mechanism Design w/o Money 2: Stable Matching Gale-Shapley Algorithm CSC304 -

Some Proof Templates A Mental Picture Axioms, definitions, already proven

Implicature Discernment in Natural Language Inference Group 7 Jesse Gioannini Charlie Guo

Example 1.73 I Lets apply pumping lemma to prove that B = { 0 n 1 n | n 0 } is not regular

ICS 275 Winter 2016 Winter 2016 Winter 2016 Conflict Analysis: Implication Grpahs The

'Liberty vs. Love': The Principal Contradiction of Human Culture (2) The 'Liberty vs. Love'

All Prayers Are Answered! But No is the answer wed rather not hear! Todays Prayer

3515ICT Theory of Computation Some sample proofs 4-0 Proof types 1. Proof

Lunch & Learn Emily Shuffl ebottom Business Development Manager Aritco is an award winning

GLO Science Professional Before & After Images Before GLO After GLO Before GLO After GLO