Sub-sampled Newton Methods with Non-uniform Sampling Jiyan Yang - PowerPoint PPT Presentation

Sub-sampled Newton Methods with Non-uniform Sampling Jiyan Yang ICME, Stanford University IAS/PCMI Research Program, July 14, 2016 Joint work with Peng Xu, Fred Roosta, Chris R´ e and Michael Mahoney Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 1 / 36

Problem formulation Consider the optimization problem n � min w ∈C F ( w ) = f i ( w ) + R ( w ) , (1) i =1 where f i ( w ) and R ( w ) are convex and twice-differentiable (assume C = R d in this talk) Example: R ( w ) = λ 2 � w � 2 f i ( w ) = ℓ ( x T i w ) , 2 , (2) where ℓ ( · ) is a loss function and x i ’s are data points Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 2 / 36

Second-order methods There is a plethora of first-order optimization algorithms for solving (1). However, for ill-conditioned problems, it is often the case that first-order methods return a solution far from the minimizer, w ∗ , albeit a low objective value Reference: [Nocedal and Wright, 2006] Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 3 / 36

Second-order methods There is a plethora of first-order optimization algorithms for solving (1). However, for ill-conditioned problems, it is often the case that first-order methods return a solution far from the minimizer, w ∗ , albeit a low objective value On the other hand, most second-order algorithms prove to be more robust to such ill conditioning. This is so since, using the curvature information, second-order methods properly rescale the gradient, such that it is a more appropriate direction to follow Reference: [Nocedal and Wright, 2006] Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 3 / 36

Newton’s method Newton’s method enjoys fast local convergence and is good at recovering the minimizer w ∗ . In the unconstrained case, it has updates of the form H ( w t ) v = g ( w t ) , (3) = w t − v (4) w t +1 Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 4 / 36

Newton’s method Newton’s method enjoys fast local convergence and is good at recovering the minimizer w ∗ . In the unconstrained case, it has updates of the form H ( w t ) v = g ( w t ) , (3) = w t − v (4) w t +1 Issues when n and d are large: When n is large, forming the Hessian n n � ∇ 2 f i ( w ) + ∇ 2 R ( w ) := � H ( w t ) = H i ( w ) + Q ( w ) (5) i =1 i =1 is expensive. The cost is O ( nd 2 ) in the above example When d is large, solving (3) is also expensive: O ( d 3 ) Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 4 / 36

Remedy When n is large, forming the Hessian n n � ∇ 2 f i ( w ) + ∇ 2 R ( w ) := � H ( w t ) = H i ( w ) + Q ( w ) (6) i =1 i =1 is expensive. The cost is O ( nd 2 ) in the above example Idea: Sub-sample only a few terms, say s , from { H i ( w ) } n i =1 , without forming them, to form ˜ H so that the cost can be reduced to O ( sd 2 ) Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 5 / 36

Remedy When n is large, forming the Hessian n n � ∇ 2 f i ( w ) + ∇ 2 R ( w ) := � H ( w t ) = H i ( w ) + Q ( w ) (6) i =1 i =1 is expensive. The cost is O ( nd 2 ) in the above example Idea: Sub-sample only a few terms, say s , from { H i ( w ) } n i =1 , without forming them, to form ˜ H so that the cost can be reduced to O ( sd 2 ) When d is large, solving (3) is also expensive: O ( d 3 ) Idea: Use an iterative solver such as Conjugate Gradient to solve (3) Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 5 / 36

Main contributions We propose randomized Newton-type algorithms that exploit non-uniform sub-sampling of {∇ 2 f i ( w ) } n i =1 , as well as inexact updates , as means to reduce the computational complexity Two non-uniform sampling distributions based on row norm squares and leverage scores are considered in order to capture important terms among {∇ 2 f i ( w ) } n i =1 We show that at each iteration non-uniformly sampling at most O ( d log d ) terms from {∇ 2 f i ( w ) } n i =1 is sufficient to achieve a linear-quadratic convergence rate in w when a suitable initial point is provided We show that to achieve a locally problem independent linear convergence rate, the per-iteration complexities of our algorithm have lower dependence on condition numbers compared to [Agarwal et al., 2016, Pilanci and Wainwright, 2015, Roosta-Khorasani and Mahoney, 2016b] We empirically demonstrate that our methods are at least twice as fast as Newton’s methods with ridge logistic regression on several real datasets Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 6 / 36

Related work Newton sketch [Pilanci and Wainwright, 2015] considers a similar class of problems and proposes sketching the Hessian using random sub-Gaussian matrices or randomized orthonormal systems Algorithms that employ uniform sub-sampling constitute a popular line of work [Byrd et al., 2011, Erdogdu and Montanari, 2015, Martens, 2010, Vinyals and Povey, 2011] Roosta-Khorasani and Mahoney [2016a,b] consider a more general class of problems and, under a variety of conditions, thoroughly study the local and global convergence properties of sub-sampled Newton methods where the gradient and/or the Hessian are uniformly sub-sampled Agarwal et al. [2016] proposes a stochastic algorithm (LiSSA) that, for solving the sub-problems, employs some unbiased estimators of the inverse of the Hessian Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 7 / 36

Roadmap 1 Algorithm description Overview of the algorithm Non-uniformly sub-sampled Hessian (sampling scheme) Inexact updates (solver) 2 Convergence results 3 Empirical results Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 8 / 36

1 Algorithm description Overview of the algorithm Non-uniformly sub-sampled Hessian (sampling scheme) Inexact updates (solver) 2 Convergence results 3 Empirical results Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 9 / 36

Sub-sampled Newton methods (SSN) Algorithm 1 Construct an approximate Hessian ˜ H ( w ) by non-uniformly sub-sampling terms from { H i ( w ) } n i =1 without forming H i ( w ) ′ s based on a sampling scheme. The update formula becomes ˜ H ( w t ) v = g ( w t ) (7) 2 Solve the subproblem (7) using an iterative solver such as CG to return an approximate v , denoted by ˜ v , and w t +1 = w t − ˜ v (8) Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 9 / 36

Complexity The total complexity can be expressed as T · ( t grad + t const + t solve ) (9) Number of total iterations T determined by the convergence rate (sampling scheme and solver) t grad is the time it takes to compute the full gradient ∇ F ( w t ) (will not be discussed) In each iteration, the time t const it needs to construct { p i } n i =1 and sample s terms (sampling scheme) In each iteration, the time t solve it needs to (implicitly) form ˜ H (sampling scheme) and to (inexactly) solve the linear problem (solver) Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 10 / 36

1 Algorithm description Overview of the algorithm Non-uniformly sub-sampled Hessian (sampling scheme) Inexact updates (solver) 2 Convergence results 3 Empirical results Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 11 / 36

A simple example When f i ( w ) = ℓ ( x T i w ) and R ( w ) = 0 , H i ( w ) = ∇ 2 f i ( w ) = ℓ ′′ ( x T i w ) · x i x T (10) i Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 11 / 36

A simple example When f i ( w ) = ℓ ( x T i w ) and R ( w ) = 0 , H i ( w ) = ∇ 2 f i ( w ) = ℓ ′′ ( x T i w ) · x i x T (10) i Let A ∈ R n × d be a matrix with rows 1 A i = ( ℓ ′′ ( x T A i A T 2 x i i w )) so that i = H i ( w ) (11) Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 11 / 36

A simple example When f i ( w ) = ℓ ( x T i w ) and R ( w ) = 0 , H i ( w ) = ∇ 2 f i ( w ) = ℓ ′′ ( x T i w ) · x i x T (10) i Let A ∈ R n × d be a matrix with rows 1 A i = ( ℓ ′′ ( x T A i A T 2 x i i w )) so that i = H i ( w ) (11) Forming A takes O ( nd ) time and A T A = � i H i ( w ) = H (which needs O ( nd 2 ) to compute) Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 11 / 36

A simple example When f i ( w ) = ℓ ( x T i w ) and R ( w ) = 0 , H i ( w ) = ∇ 2 f i ( w ) = ℓ ′′ ( x T i w ) · x i x T (10) i Let A ∈ R n × d be a matrix with rows 1 A i = ( ℓ ′′ ( x T A i A T 2 x i i w )) so that i = H i ( w ) (11) Forming A takes O ( nd ) time and A T A = � i H i ( w ) = H (which needs O ( nd 2 ) to compute) Consider sub-sampling rows from A such that H ( w ) = A T A ≈ A T S T SA = ˜ H ( w ) (12) The running time is reduced to O ( sd 2 ) from O ( nd 2 ) Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 11 / 36

General case Assume each H i ( w ) has a low-rank decomposition readily accessible: H i ( w ) = A i A T i where A i ∈ R d × k i Further assume that k i = k = O (1) ( k i = 1 in the above example) Denote Q = ∇ 2 R ( w ) Sub-sampled Newton Methods with Non-uniform Sampling , PCMI, July 14, 2016 12 / 36

Sub-sampled Newton Methods with Non-uniform Sampling Jiyan Yang - PowerPoint PPT Presentation

Sub-sampled Newton Methods with Non-uniform Sampling Jiyan Yang ICME, Stanford University IAS/PCMI Research Program, July 14, 2016 Joint work with Peng Xu, Fred Roosta, Chris R e and Michael Mahoney Sub-sampled Newton Methods with

Non-Uniform Computation Lecture 10 Non-Uniform Computational Models: Circuits 1 Non-Uniform

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Curriculum on The Cadet Corps Uniform Class A Uniform Class A Uniform Agenda C1. Class A

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Curriculum on The Cadet Corps Uniform Wear It WIth honor Class C Uniform Class C Uniform

Visualizing Model Architecture john.sekar@mssm.edu SASB `17 Kinetics ~ Reaction Rules Enz Sub

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Circuits Lecture 11 Uniform Circuit Complexity 1 Recall 2 Recall Non-uniform complexity 2

Non-Uniform Computation & Circuits Lecture 10 Wherein every language can be decided 1

Learning from Irregularly-Sampled Time Series A Missing Data Perspective Steven Cheng-Xian Li

NEWTON SEPAC End of Year Report to Newton School Committee June 10, 2019 Newton SEPAC Co-Chairs

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman

Steering Committee Meeting January 19, 2016 UNIFORM METHODS PROJECT 1 Uniform Methods Project

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools

Acceleration through Optimistic No-Regret Dynamics Jun-Kun Wang and Jacob Abernethy Georgia Tech

Motivation Portfolio approaches Javier Estrada Standard/Traditional IESE Business

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 6 Mohammed

COVID Communication New Tools to Communicate in a Socially Distant World Tom Oldfather NYSAC

Agenda 1. Role of stewards & reps locally 2. PPE Digital options innovative practice

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

The Rise of Chinese Fintech: Lessons for the United States / Asset Management Richman Center for

Sub-sampled Newton Methods with Non-uniform Sampling Jiyan Yang - PowerPoint PPT Presentation

Sub-sampled Newton Methods with Non-uniform Sampling Jiyan Yang ICME, Stanford University IAS/PCMI Research Program, July 14, 2016 Joint work with Peng Xu, Fred Roosta, Chris R e and Michael Mahoney Sub-sampled Newton Methods with

Non-Uniform Computation Lecture 10 Non-Uniform Computational Models: Circuits 1 Non-Uniform

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Curriculum on The Cadet Corps Uniform Class A Uniform Class A Uniform Agenda C1. Class A

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Curriculum on The Cadet Corps Uniform Wear It WIth honor Class C Uniform Class C Uniform

Visualizing Model Architecture john.sekar@mssm.edu SASB `17 Kinetics ~ Reaction Rules Enz Sub

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Circuits Lecture 11 Uniform Circuit Complexity 1 Recall 2 Recall Non-uniform complexity 2

Non-Uniform Computation &amp; Circuits Lecture 10 Wherein every language can be decided 1

Learning from Irregularly-Sampled Time Series A Missing Data Perspective Steven Cheng-Xian Li

NEWTON SEPAC End of Year Report to Newton School Committee June 10, 2019 Newton SEPAC Co-Chairs

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sub-Sampled Newton Methods for Machine Learning Jorge Nocedal Northwestern University Goldman

Steering Committee Meeting January 19, 2016 UNIFORM METHODS PROJECT 1 Uniform Methods Project

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools

Acceleration through Optimistic No-Regret Dynamics Jun-Kun Wang and Jacob Abernethy Georgia Tech

Motivation Portfolio approaches Javier Estrada Standard/Traditional IESE Business

CS CS 466 466 In Introduct ctio ion t to B Bio ioin informatics ics Lecture 6 Mohammed

COVID Communication New Tools to Communicate in a Socially Distant World Tom Oldfather NYSAC

Agenda 1. Role of stewards &amp; reps locally 2. PPE Digital options innovative practice

CS535 Big Data 1/29/2020 Week 2- B Sangmi Lee Pallickara CS535 Big Data | Computer Science |

The Rise of Chinese Fintech: Lessons for the United States / Asset Management Richman Center for

Non-Uniform Computation & Circuits Lecture 10 Wherein every language can be decided 1

Agenda 1. Role of stewards & reps locally 2. PPE Digital options innovative practice