Second order machine learning Michael W. Mahoney ICSI and Department of Statistics UC Berkeley Michael W. Mahoney (UC Berkeley) Second order machine learning 1 / 96
Outline Machine Learning’s “Inverse” Problem Your choice: 1st Order Methods: FLAG n’ FLARE, or disentangle geometry from sequence of iterates 2nd Order Methods: Stochastic Newton-Type Methods “simple” methods for convex “more subtle” methods for non-convex Michael W. Mahoney (UC Berkeley) Second order machine learning 2 / 96
Introduction Big Data ... Massive Data ... Michael W. Mahoney (UC Berkeley) Second order machine learning 3 / 96
Introduction Humongous Data ... Michael W. Mahoney (UC Berkeley) Second order machine learning 4 / 96
Introduction Big Data How do we view BIG data? Michael W. Mahoney (UC Berkeley) Second order machine learning 5 / 96
Introduction Algorithmic & Statistical Perspectives ... Computer Scientists Data: are a record of everything that happened. Goal: process the data to find interesting patterns and associations. Methodology: Develop approximation algorithms under different models of data access since the goal is typically computationally hard. Statisticians (and Natural Scientists, etc) Data: are a particular random instantiation of an underlying process describing unobserved patterns in the world. Goal: is to extract information about the world from noisy data. Methodology: Make inferences (perhaps about unseen events) by positing a model that describes the random variability of the data around the deterministic model. Michael W. Mahoney (UC Berkeley) Second order machine learning 6 / 96
Introduction ... are VERY different paradigms Statistics, natural sciences, scientific computing, etc: Problems often involve computation, but the study of computation per se is secondary Only makes sense to develop algorithms for well-posed problems 1 First, write down a model, and think about computation later Computer science: Easier to study computation per se in discrete settings, e.g., Turing machines, logic, complexity classes Theory of algorithms divorces computation from data First, run a fast algorithm, and ask what it means later 1 Solution exists, is unique, and varies continuously with input data Michael W. Mahoney (UC Berkeley) Second order machine learning 7 / 96
Introduction Context: My first stab at deep learning Michael W. Mahoney (UC Berkeley) Second order machine learning 8 / 96
Introduction A blog about my first stab at deep learning Michael W. Mahoney (UC Berkeley) Second order machine learning 9 / 96
Introduction A blog about my first stab at deep learning Michael W. Mahoney (UC Berkeley) Second order machine learning 10 / 96
Efficient and Effective Optimization Methods Problem Statement Problem 1: Composite Optimization Problem x ∈X⊆ R d F ( x ) = f ( x ) + h ( x ) min f : Convex and Smooth h : Convex and (Non-)Smooth Problem 2: Minimizing Finite Sum Problem n x ∈X⊆ R d F ( x ) = 1 � min f i ( x ) n i =1 f i : (Non-)Convex and Smooth n ≫ 1 Michael W. Mahoney (UC Berkeley) Second order machine learning 11 / 96
Efficient and Effective Optimization Methods Modern “Big-Data” Classical Optimization Algorithms Effective but Inefficient Need to design variants, that are: 1 Efficient, i.e., Low Per-Iteration Cost 2 Effective, i.e., Fast Convergence Rate Michael W. Mahoney (UC Berkeley) Second order machine learning 12 / 96
Efficient and Effective Optimization Methods Scientific Computing and Machine Learning share the same challenges, and use the same means, but to get to different ends! Machine Learning has been, and continues to be, very busy designing efficient and effective optimization methods Michael W. Mahoney (UC Berkeley) Second order machine learning 13 / 96
Efficient and Effective Optimization Methods First Order Methods • Variants of Gradient Descent (GD): Reduce the per-iteration cost of GD ⇒ Efficiency Achieve the convergence rate of the GD ⇒ Effectiveness x ( k +1) = x ( k ) − α k ∇ F ( x ( k ) ) Michael W. Mahoney (UC Berkeley) Second order machine learning 14 / 96
Efficient and Effective Optimization Methods First Order Methods E.g.: SAG, SDCA, SVRG, Prox-SVRG, Acc-Prox-SVRG, Acc-Prox-SDCA, S2GD, mS2GD, MISO, SAGA, AMSVRG, ... Michael W. Mahoney (UC Berkeley) Second order machine learning 15 / 96
Efficient and Effective Optimization Methods But why? Q: Why do we use (stochastic) 1st order method? Cheaper Iterations? i.e., n ≫ 1 and/or d ≫ 1 Avoids Over-fitting? Michael W. Mahoney (UC Berkeley) Second order machine learning 16 / 96
Efficient and Effective Optimization Methods 1 st order method and “over-fitting” Challenges with “simple” 1st order method for “over-fitting”: Highly sensitive to ill-conditioning Very difficult to tune (many) hyper-parameters “Over-fitting” is difficult with “simple” 1st order method! Michael W. Mahoney (UC Berkeley) Second order machine learning 17 / 96
Efficient and Effective Optimization Methods Remedy? 1 “Not-So-Simple” 1st order method, e.g., accelerated and adaptive 2 2nd order methods, e.g., methods x ( k +1) = x ( k ) − [ ∇ 2 F ( x ( k ) )] − 1 ∇ F ( x ( k ) ) Michael W. Mahoney (UC Berkeley) Second order machine learning 18 / 96
Efficient and Effective Optimization Methods Your Choice Of.... Michael W. Mahoney (UC Berkeley) Second order machine learning 19 / 96
Efficient and Effective Optimization Methods Which Problem? 1 “Not-So-Simple” 1st order method: FLAG n’ FLARE Problem 1: Composite Optimization Problem x ∈X⊆ R d F ( x ) = f ( x ) + h ( x ) min f : Convex and Smooth, h : Convex and (Non-)Smooth 2 2nd order methods: Stochastic Newton-Type Methods Stochastic Newton, Trust Region, Cubic Regularization Problem 2: Minimizing Finite Sum Problem n x ∈X⊆ R d F ( x ) = 1 � min f i ( x ) n i =1 f i : (Non-)Convex and Smooth, n ≫ 1 Michael W. Mahoney (UC Berkeley) Second order machine learning 20 / 96
Efficient and Effective Optimization Methods Collaborators FLAG n’ FLARE Fred Roosta (UC Berkeley) Xiang Cheng (UC Berkeley) Stefan Palombo (UC Berkeley) Peter L. Bartlett (UC Berkeley & QUT) Sub-Sampled Newton-Type Methods for Convex Fred Roosta (UC Berkeley) Peng Xu (Stanford) Jiyan Yang (Stanford) Christopher R´ e (Stanford) Sub-Sampled Newton-Type Methods for Non-convex Fred Roosta (UC Berkeley) Peng Xu (Stanford) Implementations on GPU, etc. Fred Roosta (UC Berkeley) Sudhir Kylasa (Purdue) Ananth Grama (Purdue) Michael W. Mahoney (UC Berkeley) Second order machine learning 21 / 96
First-order methods: FLAG n’ FLARE Subgradient Method Composite Optimization Problem x ∈X⊆ R d F ( x ) = f ( x ) + h ( x ) min f : Convex (Non-)Smooth h : Convex (Non-)Smooth Michael W. Mahoney (UC Berkeley) Second order machine learning 22 / 96
First-order methods: FLAG n’ FLARE Subgradient Method Algorithm 1 Subgradient Method 1: Input: x 1 , and T 2: for k = 1 , 2 , . . . , T − 1 do - g k ∈ ∂ ( f ( x k ) + h ( x k )) 3: � 2 α k � x − x k � 2 � 1 - x k +1 = arg min x ∈X � g k , x � + 4: 5: end for � T x = 1 6: Output: ¯ t =1 x t T α k : Step-size Constant Step-size: α k = α Diminishing Step size � ∞ k =1 α k = ∞ , lim k →∞ α k = 0 Michael W. Mahoney (UC Berkeley) Second order machine learning 23 / 96
First-order methods: FLAG n’ FLARE Example: Logistic Regression { a i , b i } : features and labels a i ∈ { 0 , 1 } d , b i ∈ { 0 , 1 } n � log(1 + e � a i , x � ) − b i � a i , x � F ( x ) = i =1 n � � 1 � ∇ F ( x ) = 1 + e −� a i , x � − b i a i i =1 Infrequent Features ⇒ Small Partial Derivative Michael W. Mahoney (UC Berkeley) Second order machine learning 24 / 96
First-order methods: FLAG n’ FLARE predictive vs. irrelevant features Very infrequent features ⇒ Highly predictive (e.g. “CANON” in document classification) Very frequent features ⇒ Highly irrelevant (e.g. “and” in document classification) Michael W. Mahoney (UC Berkeley) Second order machine learning 25 / 96
First-order methods: FLAG n’ FLARE AdaGrad [Duchi et al., 2011] Frequent Features ⇒ Large Partial Derivative ⇒ Learning Rate ↓ Infrequent Features ⇒ Small Partial Derivative ⇒ Learning Rate ↑ Replace α k with scaling matrix adaptively... Many follows up works: RMSProp, Adam, Adadelta, etc... Michael W. Mahoney (UC Berkeley) Second order machine learning 26 / 96
First-order methods: FLAG n’ FLARE AdaGrad [Duchi et al., 2011] Algorithm 2 AdaGrad 1: Input: x 1 , η and T 2: for k = 1 , 2 , . . . , T − 1 do - g k ∈ ∂ f ( x k ) 3: - Form scaling matrix S k based on { g t ; t = 1 , . . . , k } 4: � g k , x � + h ( x ) + 1 2 ( x − x k ) T S k ( x − x k ) � � - x k +1 = arg min x ∈X 5: 6: end for � T x = 1 7: Output: ¯ t =1 x t T Michael W. Mahoney (UC Berkeley) Second order machine learning 27 / 96
First-order methods: FLAG n’ FLARE Convergence Convergence Let x ∗ be an optimum point. We have: AdaGrad [Duchi et al., 2011]: � √ � dD ∞ α x ) − F ( x ∗ ) ≤ O F (¯ √ , T where α ∈ [ 1 d , 1] and D ∞ = max x , y ∈X � y − x � ∞ , and √ Subgradient Descent: � D 2 � x ) − F ( x ∗ ) ≤ O √ F (¯ T where D 2 = max x , y ∈X � y − x � 2 . Michael W. Mahoney (UC Berkeley) Second order machine learning 28 / 96
Recommend
More recommend