Optimization in the Big Data Regime Sham M. Kakade Machine Learning - PowerPoint PPT Presentation

Optimization in the “Big Data” Regime Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 18

Announcements... HW2 due Mon. Work on your project milestones read/related work summary some empirical work Today: Review: discuss classical optimization New: How do we optimize in the “big data” regime, with large sample sizes and large dimension? Bridge classical to modern optimization. S. M. Kakade (UW) Optimization for Big data 2 / 18

Machine Learning and the Big Data Regime... goal: find a d -dim parameter vector which minimizes the loss on n training examples. have n training examples ( x 1 , y 1 ) , . . . ( x n , y n ) have parametric a classifier h θ ( x , w ) , where w is a d dimensional vector. � min w L ( w ) where L ( w ) = loss ( h ( x i , w ) , y i ) i “Big Data Regime”: How do you optimize this when n and d are large? memory? parallelization? Can we obtain linear time algorithms to find an ǫ -accurate solution? i.e. find ˆ w so that L ( ˆ w ) − min w L ( w ) ≤ ǫ S. M. Kakade (UW) Optimization for Big data 3 / 18

Plan: Goal: algorithms to get fixed target accuracy ǫ . Review: classical optimization viewpoints A modern view: can be bridge classical optimization to modern problems? Dual Coordinate Descent Methods Stochastic Variance Reduced Gradient method (SVRG) S. M. Kakade (UW) Optimization for Big data 4 / 18

Abstraction: Least Squares n � ( w · x i − y i ) 2 + λ � w � 2 w L ( w ) where L ( w ) = min i = 1 How much computation time is required to to get ǫ accuracy? n points, d dimensions. “Big Data Regime”: How do you optimize this when n and d are large? More general case: Optimize sums of convex (or non-convex functions? some guarantees will still hold Aside: think of x as a large feature representation. S. M. Kakade (UW) Optimization for Big data 5 / 18

Review: Direct Solution n � ( w · x i − y i ) 2 + λ � w � 2 min w L ( w ) where L ( w ) = i = 1 solution: w = ( X ⊤ X + λ I ) − 1 X ⊤ Y where X be the n × d matrix whose rows are x i , and Y is an n -dim vector. numerical solution: the “backslash” implementation. time complexity: O ( nd 2 ) and memory O ( d 2 ) Not feasible due to both time and memory. S. M. Kakade (UW) Optimization for Big data 6 / 18

Review: Gradient Descent (and Conjugate GD) n � ( w · x i − y i ) 2 + λ � w � 2 min w L ( w ) where L ( w ) = i = 1 n points, d dimensions, � λ max , λ min are max and min eigs. of “design matrix” 1 i x i x ⊤ n i # iterations and computation time to get ǫ accuracy: Gradient Descent (GD): λ max λ max log 1 /ǫ, nd log 1 /ǫ λ min λ min Conjugate Gradient Descent: � � λ max λ max log 1 /ǫ, nd log 1 /ǫ λ min λ min memory: O ( d ) Better runtime and memory, but still costly. S. M. Kakade (UW) Optimization for Big data 7 / 18

Review: Stochastic Gradient Descent (SGD) SGD update rule: at each time t , sample a point ( x i , y i ) w ← w − η ( w · x i − y i ) x i S. M. Kakade (UW) Optimization for Big data 8 / 18

Review: Stochastic Gradient Descent (SGD) SGD update rule: at each time t , sample a point ( x i , y i ) w ← w − η ( w · x i − y i ) x i Problem: even if w = w ∗ , the update changes w . Rate: convergence rate is O ( 1 /ǫ ) , with decaying η simple algorithm, light on memory, but poor convergence rate S. M. Kakade (UW) Optimization for Big data 8 / 18

Review: Stochastic Gradient Descent � λ min is the min eig. of 1 i x i x ⊤ n i Suppose gradients are bounded by B . To get ǫ accuracy: # iterations to get ǫ -accuracy: B 2 λ min ǫ Computation time to get ǫ -accuracy: dB 2 λ min ǫ S. M. Kakade (UW) Optimization for Big data 9 / 18

Regression in the big data regime? min w L ( w ) How much computation time is required to to get ǫ accuracy? “Big Data Regime”: How do you optimize this when n and d are large? Can we ’fix’ the instabilities of SGD? Let’s look at (regularized) linear regression. Convex optimization: All results can be generalized to smooth+strongly convex loss functions. Non-convex optimization: some ideas generalize. S. M. Kakade (UW) Optimization for Big data 10 / 18

Duality (without Duality) ( X ⊤ X + λ I ) − 1 X ⊤ Y w = X ⊤ ( XX ⊤ + λ I ) − 1 Y = 1 λ X ⊤ α := where α = ( I + XX ⊤ /λ ) − 1 Y . idea: let’s compute the n-dim vector α . let’s do this with coordinate ascent S. M. Kakade (UW) Optimization for Big data 11 / 18

SDCA: stochastic dual coordinate ascent G ( α 1 , α 2 , . . . α n ) = 1 2 α ⊤ ( I + XX ⊤ /λ ) α − Y ⊤ α the minimizer of G ( α ) is α = ( I + XX ⊤ /λ ) − 1 Y SDCA: start with α = 0. choose coordinate i randomly, and update: α i = argmin z G ( α 1 , . . . α i − 1 , z , . . . , α n ) easy to do as we touch just one datapoint. return w = 1 λ X ⊤ α . S. M. Kakade (UW) Optimization for Big data 12 / 18

SDCA: the algorithm G ( α 1 , α 2 , . . . α n ) = 1 2 α ⊤ ( I + XX ⊤ /λ ) α − Y ⊤ α start with α = 0, w = 1 λ X ⊤ α . choose coordinate i randomly, and compute difference: 1 ∆ α i = ( y i − w · x i ) − α i 1 + � x i � 2 /λ update: 2 w ← w + 1 α i ← α i + ∆ α i , λ x i · ∆ α i return w = 1 λ X ⊤ α . S. M. Kakade (UW) Optimization for Big data 13 / 18

Guarantees: speedups for the big data regime n points, d dimensions, λ av average eigenvalue Computation time to get ǫ accuracy gradient descent: (Shalev-Shwartz & Zhang ’12) GD vs SDCA: � � λ max n + d λ av n d log 1 /ǫ → d log 1 /ǫ λ min λ min conjugate GD vs acceleration+SDCA. One can accelerate SDCA as well. (Frosting, Ge, K., Sidford, 2015)) S. M. Kakade (UW) Optimization for Big data 14 / 18

Comparisons to GD both algorithms touch one data point at a time, with same computational cost per iteration. SDCA has “learning rate” which adaptive to the data point. GD has convergence rate of 1 /ǫ and SDCA has log 1 /ǫ convergence rate. memory: SDCA: O ( n + d ) , SGD: O ( d ) SDCA: can touch points in any order. S. M. Kakade (UW) Optimization for Big data 15 / 18

SDCA advantages/disadvantages What about more general convex problems? e.g. � w L ( w ) where L ( w ) = loss ( h ( x i , w ) , y i ) min i the basic idea (formalized with duality) is pretty general for convex loss ( · ) . works very well in practice. memory: SDCA needs O ( n + d ) memory, while SGD is only O ( d ) . What about an algorithm for non-convex problems? SDCA seems heavily tied to the convex case. would an algo that is highly accurate in the convex case and sensible in the non-convex case. S. M. Kakade (UW) Optimization for Big data 16 / 18

(another idea) Stochastic Variance Reduced Gradient (SVRG) exact gradient computation: at stage s , using � w s , compute: 1 n � w s ) = 1 ∇ L ( � ∇ loss ( � w s , ( x i , y i )) n i = 1 corrected SGD: initialize w ← � w s . for m steps, 2 sample a point ( x , y ) � � ∇ loss ( w , ( x , y )) −∇ loss ( � w s , ( x , y )) + ∇ L ( � w ← w − η w s ) update and repeat: � w s + 1 ← w . 3 S. M. Kakade (UW) Optimization for Big data 17 / 18

(another idea) Stochastic Variance Reduced Gradient (SVRG) exact gradient computation: at stage s , using � w s , compute: 1 n � w s ) = 1 ∇ L ( � ∇ loss ( � w s , ( x i , y i )) n i = 1 corrected SGD: initialize w ← � w s . for m steps, 2 sample a point ( x , y ) � � ∇ loss ( w , ( x , y )) −∇ loss ( � w s , ( x , y )) + ∇ L ( � w ← w − η w s ) update and repeat: � w s + 1 ← w . 3 Two ideas: If � w = w ∗ , then no update. unbiased updates: blue term is mean 0. S. M. Kakade (UW) Optimization for Big data 17 / 18

Guarantees of SVRG n points, d dimensions, λ av average eigenvalue Computation time to get ǫ accuracy gradient descent: (Johnson & Zhang ’13) GD vs SDCA: � � λ max n + d λ av n d log 1 /ǫ → d log 1 /ǫ λ min λ min conjugate GD vs ?? � λ max n d log 1 /ǫ → ?? λ min memory: O ( d ) S. M. Kakade (UW) Optimization for Big data 18 / 18

Optimization in the Big Data Regime Sham M. Kakade Machine Learning - PowerPoint PPT Presentation

Optimization in the Big Data Regime Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 18 Announcements... HW2 due Mon. Work on your project milestones

(recent advancements in) Optimization in the Big Data Regime Sham M. Kakade Computer

Linear Bandits: Rich decision sets Sham M. Kakade Machine Learning for Big Data CSE547/STAT548

Linear (and contextual) Bandits: Rich decision sets (and side information) Sham M. Kakade

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine

Dimension Free Optimization and Non-Convex Optimization Instructor: Sham Kakade 1 Non-convex

Stationary points, non-convex optimization, and more... Instructor: Sham Kakade 1 Terminology

Auto-Differentiation, Computation Graphs, and Evaluation Traces Instructor: Sham Kakade 1

Thompson Sampling and Linear Bandits Instructor: Sham Kakade 1 Review The basic paradigm is as

Locality sensitive hashing Instructor: Sham Kakade 1 SK notes quick sort (check)

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling Instructor: Sham Kakade 1 The

SGD and Averaging Instructor: Sham Kakade 1 SGD and optimality There is a strong sense in which

Random Projections Instructor: Sham Kakade 1 The Johnson-Lindenstrauss lemma Theorem 1.1.

Information Theoretic Metric Learning Instructor: Sham Kakade 1 Metric Learning In k -nearest

Multi-layer Perceptrons & the Back-propagation Algorithm Instructor: Sham Kakade Please email

The Provable E ff ectiveness of Policy Gradient Methods in Reinforcement Learning Sham Kakade

Identifiability and Unmixing of Latent Parse Trees Daniel Hsu, Sham Kakade, Percy Liang NIPS

Improving CEMA using Correlation Optimization Pieter Robyns Peter Quax Wim Lamotte

Dummy Fill Optimization for Enhanced Manufacturability Yaoguang Wei and Sachin S. Sapatnekar

Teacher Education Institute Get Together Monday, September 24, 2018 What is the Teacher

Formalising Concurrent UML State Machines Using Coloured Petri Nets tienne Andr, Mohamed

Event-Triggered Interactive Gradient Descent for Real-Time Multiobjective Optimization Pio Ong and

Optimizing Indirections, or using abstractions without remorse LLVMDev18 October 18, 2018

*Recommended $8.2M one time this year $16.4 M over 5 years to base $25 M one time to offset

Which t-Norm Case When This . . . Is Most Appropriate for Our Answers First Result: Product . .

Sambuz

Useful Links

Newsletter

Mail Us

Optimization in the Big Data Regime Sham M. Kakade Machine Learning - PowerPoint PPT Presentation

Optimization in the Big Data Regime Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 18 Announcements... HW2 due Mon. Work on your project milestones

(recent advancements in) Optimization in the Big Data Regime Sham M. Kakade Computer

Linear Bandits: Rich decision sets Sham M. Kakade Machine Learning for Big Data CSE547/STAT548

Linear (and contextual) Bandits: Rich decision sets (and side information) Sham M. Kakade

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine

Dimension Free Optimization and Non-Convex Optimization Instructor: Sham Kakade 1 Non-convex

Stationary points, non-convex optimization, and more... Instructor: Sham Kakade 1 Terminology

Auto-Differentiation, Computation Graphs, and Evaluation Traces Instructor: Sham Kakade 1

Thompson Sampling and Linear Bandits Instructor: Sham Kakade 1 Review The basic paradigm is as

Locality sensitive hashing Instructor: Sham Kakade 1 SK notes quick sort (check)

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling Instructor: Sham Kakade 1 The

SGD and Averaging Instructor: Sham Kakade 1 SGD and optimality There is a strong sense in which

Random Projections Instructor: Sham Kakade 1 The Johnson-Lindenstrauss lemma Theorem 1.1.

Information Theoretic Metric Learning Instructor: Sham Kakade 1 Metric Learning In k -nearest

Multi-layer Perceptrons &amp; the Back-propagation Algorithm Instructor: Sham Kakade Please email

The Provable E ff ectiveness of Policy Gradient Methods in Reinforcement Learning Sham Kakade

Identifiability and Unmixing of Latent Parse Trees Daniel Hsu, Sham Kakade, Percy Liang NIPS

Improving CEMA using Correlation Optimization Pieter Robyns Peter Quax Wim Lamotte

Dummy Fill Optimization for Enhanced Manufacturability Yaoguang Wei and Sachin S. Sapatnekar

Teacher Education Institute Get Together Monday, September 24, 2018 What is the Teacher

Formalising Concurrent UML State Machines Using Coloured Petri Nets tienne Andr, Mohamed

Event-Triggered Interactive Gradient Descent for Real-Time Multiobjective Optimization Pio Ong and

Optimizing Indirections, or using abstractions without remorse LLVMDev18 October 18, 2018

*Recommended $8.2M one time this year $16.4 M over 5 years to base $25 M one time to offset

Which t-Norm Case When This . . . Is Most Appropriate for Our Answers First Result: Product . .

Sambuz

Useful Links

Newsletter

Mail Us

Multi-layer Perceptrons & the Back-propagation Algorithm Instructor: Sham Kakade Please email