(recent advancements in) Optimization in the Big Data Regime Sham - PowerPoint PPT Presentation

(recent advancements in) Optimization in the “Big Data” Regime Sham M. Kakade Computer Science & Engineering Statistics University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 34

Machine Learning, Optimization, and more... ML is having a profound impact: speech recognition (siri, echo), computer vision (ImageNet), game playing (alpha Go), robotics (self driving cars?), personalized health care, music recommendation (spotify), ... Optimization underlies machine learning. How can we optimize faster? S. M. Kakade (UW) Optimization for Big data 1 / 34

Machine Learning and the Big Data Regime... goal: find a d -dim parameter vector which minimizes the loss on n training examples. have n training examples ( x 1 , y 1 ) , . . . ( x n , y n ) have parametric a classifier h ( x , w ) , where w is d dimensional. � min loss ( h ( x i , w ) , y i ) i “Big Data Regime”: How do you optimize this when n and d are large? memory? parallelization? Can we obtain linear time algorithms? S. M. Kakade (UW) Optimization for Big data 2 / 34

This tutorial: Part I: convexity: regression (and more...) (optimization for prediction) Part 2: non-convexity: PCA (and more...) (optimization for representation) Part 3: Statistics (what we care about) Part 4: thoughts and open problems parallelization, second order methods, non-convexity, ... S. M. Kakade (UW) Optimization for Big data 3 / 34

Part 1: Least Squares n � ( w · x i − y i ) 2 + λ � w � 2 min w i = 1 How much computation time is required to to get ǫ accuracy? n points, d dimensions. “Big Data Regime”: How do you optimize this when n and d are large? Aside: think of x as a large feature representation. S. M. Kakade (UW) Optimization for Big data 4 / 34

Review: Direct Solution n � ( w · x i − y i ) 2 + λ � w � 2 min w i = 1 solution: w = ( X ⊤ X + λ I ) − 1 X ⊤ Y where X be the n × d matrix whose rows are x i , and Y is an n -dim vector. time complexity: O ( nd 2 ) and memory O ( d 2 ) Not feasible due to both time and memory. S. M. Kakade (UW) Optimization for Big data 5 / 34

Review: Gradient Descent (and Conjugate GD) n � ( w · x i − y i ) 2 + λ � w � 2 min w i = 1 n points, d dimensions, λ max , λ min are eigs. of “design/data matrix” Computation time to get ǫ accuracy: Gradient Descent (GD): λ max nd log 1 /ǫ λ min Conjugate Gradient Descent: � λ max nd log 1 /ǫ λ min memory: O ( d ) Better runtime and memory, but still costly. S. M. Kakade (UW) Optimization for Big data 6 / 34

Review: Stochastic Gradient Descent (SGD) SGD update rule: at each time t , sample a point ( x i , y i ) w ← w − η ( w · x i − y i ) x i S. M. Kakade (UW) Optimization for Big data 7 / 34

Review: Stochastic Gradient Descent (SGD) SGD update rule: at each time t , sample a point ( x i , y i ) w ← w − η ( w · x i − y i ) x i Problem: even if w = w ∗ , the update changes w . Rate: convergence rate is O ( 1 /ǫ ) , with decaying η simple algorithm, light on memory, but poor convergence rate S. M. Kakade (UW) Optimization for Big data 7 / 34

Regression in the big data regime? n � ( w · x i − y i ) 2 + λ � w � 2 min w i = 1 How much computation time is required to to get ǫ accuracy? “Big Data Regime”: How do you optimize this when n and d are large? Convex optimization: All results can be generalized to smooth+strongly convex loss functions. S. M. Kakade (UW) Optimization for Big data 8 / 34

Duality (without Duality) ( X ⊤ X + λ I ) − 1 X ⊤ Y w = X ⊤ ( XX ⊤ + λ I ) − 1 Y = 1 λ X ⊤ α := where α = ( I + XX ⊤ /λ ) − 1 Y . idea: let’s compute the n-dim vector α . let’s do this with coordinate ascent S. M. Kakade (UW) Optimization for Big data 9 / 34

SDCA: stochastic dual coordinate ascent G ( α 1 , α 2 , . . . α n ) = � α − Y � 2 + 1 λα ⊤ XX ⊤ α the minimizer of G ( α ) is α = ( I + XX ⊤ /λ ) − 1 Y SDCA: start with α = 0. choose coordinate i randomly, and update: α i = argmin z G ( α 1 , . . . α i − 1 , z , . . . , α n ) easy to do as we touch just one datapoint. return w = 1 λ X ⊤ α . S. M. Kakade (UW) Optimization for Big data 10 / 34

SDCA: the algorithm G ( α 1 , α 2 , . . . α n ) = � α − Y � 2 + 1 λα ⊤ XX ⊤ α start with α = 0, w = 1 λ X ⊤ α . choose coordinate i randomly, and compute difference: 1 ∆ α i = ( y i − w · x i ) − α i 1 + � x i � 2 /λ update: 2 w ← w + 1 α i ← α i + ∆ α i , λ x i · ∆ α i return w = 1 λ X ⊤ α . S. M. Kakade (UW) Optimization for Big data 11 / 34

Guarantees: speedups for the big data regime n points, d dimensions, λ av average eigenvalue Computation time to get ǫ accuracy gradient descent: (Shalev-Shwartz & Zhang ’12),((Frosting, Ge, K., Sidford, 2015) GD vs SDCA: � � λ max n + d λ av n d log 1 /ǫ → d log 1 /ǫ λ min λ min conjugate GD vs acceleration+SDCA: � � � � λ max nd λ av n d log 1 /ǫ → n + d log 1 /ǫ λ min λ min memory: O ( n + d ) S. M. Kakade (UW) Optimization for Big data 12 / 34

(another idea) Stochastic Variance Reduced Gradient (SVRG) exact gradient computation: at stage s , using � w s , compute: 1 n � w s ) = 1 ∇ L ( � ∇ loss ( � w s , ( x i , y i )) n i = 1 corrected SGD: initialize w ← � w s . for m steps, 2 sample a point ( x , y ) � � ∇ loss ( w , ( x , y )) −∇ loss ( � w s , ( x , y )) + ∇ L ( � w ← w − η w s ) update and repeat: � w s + 1 ← w . 3 S. M. Kakade (UW) Optimization for Big data 13 / 34

(another idea) Stochastic Variance Reduced Gradient (SVRG) exact gradient computation: at stage s , using � w s , compute: 1 n � w s ) = 1 ∇ L ( � ∇ loss ( � w s , ( x i , y i )) n i = 1 corrected SGD: initialize w ← � w s . for m steps, 2 sample a point ( x , y ) � � ∇ loss ( w , ( x , y )) −∇ loss ( � w s , ( x , y )) + ∇ L ( � w ← w − η w s ) update and repeat: � w s + 1 ← w . 3 Two ideas: If � w = w ∗ , then no update. unbiased updates: blue term is mean 0. S. M. Kakade (UW) Optimization for Big data 13 / 34

Guarantees of SVRG n points, d dimensions, λ av average eigenvalue Computation time to get ǫ accuracy gradient descent: (Johnson & Zhang ’13) GD vs SDCA: � � λ max n + d λ av n d log 1 /ǫ → d log 1 /ǫ λ min λ min conjugate GD vs ?? � λ max n d log 1 /ǫ → ?? λ min memory: O ( d ) S. M. Kakade (UW) Optimization for Big data 14 / 34

Part 1: Summary Methods extend: to sums of convex functions � L ( w ) = min loss ( h ( x i , w ) , y i ) i for smooth loss ( · ) and strongly convex L ( · ) . Take home: Natural stochastic algorithms, similar to SGD, which enjoy “numerical accuracy” guarantees. Other good ideas: Sketching is good in the large n , but “medium sized” d regime. (Rokhlin and Tygert, 2008) Improve upon conjugate gradient in the big data regime. S. M. Kakade (UW) Optimization for Big data 15 / 34

Part 2: PCA We have n -vectors, x 1 . . . x n in d dimensions and a matrix: n � x i x ⊤ A = i ? i = 1 How much computation time do you need to get an eps -approximation of the top eigenvector? Constructing and storing the matrix may be costly. Computation: How do you accurately estimate the leading eigenvector of A , in terms of n , d , the “gap”, etc? Aside: (with modifications/CCA) this is the simplest way to learn embeddings. S. M. Kakade (UW) Optimization for Big data 16 / 34

Part 2: outline Similar to least squares, we provide speeds for eigenvector computations. Power method → improvements for large n . Lanczos method → improvements for large n . Key idea: Utilizes faster least squares algorithms. “Shift and Invert" Preconditioning S. M. Kakade (UW) Optimization for Big data 17 / 34

Review: Algebraic Methods n � w ⊤ Aw x i x ⊤ max � w � 2 , A = i w i = 1 n points, d dimensions, time complexity: O ( nd 2 ) and memory O ( d 2 ) No “gap” dependence. “Big data regime”: What about the large n regime? S. M. Kakade (UW) Optimization for Big data 18 / 34

Review: The Power Method and Lanczos n � w ⊤ Aw x i x ⊤ max � w � 2 , A = i w i = 1 n points, d dimensions, gap = λ 1 − λ 2 , nnz ( A ) is the # nonzeros in A . λ 1 Computation time to get ǫ accuracy: Power method: nnz ( A ) log 1 /ǫ ≈ nd gap log 1 /ǫ gap Lanczos method: nnz ( A ) nd √ gap log 1 /ǫ ≈ √ gap log 1 /ǫ “Big data regime”: What about the large n regime? S. M. Kakade (UW) Optimization for Big data 19 / 34

Review: Oja’s algorithm and SGD n � w ⊤ Aw x i x ⊤ max � w � 2 , A = i w i = 1 initialize w = 0 and then repeat: for datapoint i sampled randomly: 1 w ← ( I + η x i x ⊤ i ) w normalize: 2 w ← w / � w � Computation time to get ǫ accuracy: O ( 1 /ǫ ) Memory: O ( d ) . S. M. Kakade (UW) Optimization for Big data 20 / 34

PCA in the “Big Data” Regime? How do you find the top eigenvector of: n � x i x ⊤ A = i ? i = 1 “Big Data Regime”: How do you optimize this when n and d are large? S. M. Kakade (UW) Optimization for Big data 21 / 34

(recent advancements in) Optimization in the Big Data Regime Sham - PowerPoint PPT Presentation

(recent advancements in) Optimization in the Big Data Regime Sham M. Kakade Computer Science & Engineering Statistics University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 34 Machine Learning, Optimization, and

Recent Progresses in Stochastic Algorithms for Big Data Optimization Tong Zhang Rutgers

RECENT ADVANCEMENTS IN FUNCTIONAL ASSESSMENT M E T H O D S A N D A P P L I C AT I O N S NCABA

RRT RRT and Recent Advancements d R t Ad t Sung-Eui Yoon ( ) ( ) C

Recent advancements towards large-scale flow diagnostics by robotic PIV Fulvio Scarano Delft

Salmon in Scottish coastal waters: recent advancements in knowledge in relation to their

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Recent Advances and Challenges in Non-Convex Optimization Anima Anandkumar .. U.C. Irvine ..

Towards 5G: Advancements from IoT to mmWave Communcations Next Generation and Standards

Economic Implications of Economic Implications of Advancements in Radiation Advancements in

Some Recent Advances in Non-convex Optimization Purushottam Kar IIT KANPUR Outline of the Talk

Recent advances and trends in global optimization Panos M. Pardalos Center for Applied

PRIME RUS CONVOCATION APRIL 2018 - Litigating in the 21 st Century: Advancements in Technology

Recent advances on the acceleration of first-order methods in convex optimization . Juan

Wearable barcode scanning Advancements in code localization, motion blur compensation, and

Scenario Optimization for Robust Design foundations and recent developments Giuseppe Carlo

Optimization in the Big Data Regime Sham M. Kakade Machine Learning for Big Data

U NDERSTAND Y OUR U NIVERSE : K NOW Y OUR D ATA -P RIVACY O BLIGATIONS David Rice, Brian Sniffen,

Recent Results and Open Problems in Evolutionary Multiobjective Optimization Carlos A. Coello

Recent advances in optimization algorithms for image deblurring and denoising G. Landi E. Loli

Performance Optimization at Scale Recent Experiences Patrick H. Worley Oak Ridge National

solving Linear Optimization What we did so far How to model an optimization problem

Presentation constrained optimization Wenda Chen Speech Data and Constrained Optimization

Constrained optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis

Outline Performance Optimization of Component-based Data Intensive Motivation Applications

(recent advancements in) Optimization in the Big Data Regime Sham - PowerPoint PPT Presentation

(recent advancements in) Optimization in the Big Data Regime Sham M. Kakade Computer Science & Engineering Statistics University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 34 Machine Learning, Optimization, and

Recent Progresses in Stochastic Algorithms for Big Data Optimization Tong Zhang Rutgers

RECENT ADVANCEMENTS IN FUNCTIONAL ASSESSMENT M E T H O D S A N D A P P L I C AT I O N S NCABA

RRT RRT and Recent Advancements d R t Ad t Sung-Eui Yoon ( ) ( ) C

Recent advancements towards large-scale flow diagnostics by robotic PIV Fulvio Scarano Delft

Salmon in Scottish coastal waters: recent advancements in knowledge in relation to their

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Recent Advances and Challenges in Non-Convex Optimization Anima Anandkumar .. U.C. Irvine ..

Towards 5G: Advancements from IoT to mmWave Communcations Next Generation and Standards

Economic Implications of Economic Implications of Advancements in Radiation Advancements in

Some Recent Advances in Non-convex Optimization Purushottam Kar IIT KANPUR Outline of the Talk

Recent advances and trends in global optimization Panos M. Pardalos Center for Applied

PRIME RUS CONVOCATION APRIL 2018 - Litigating in the 21 st Century: Advancements in Technology

Recent advances on the acceleration of first-order methods in convex optimization . Juan

Wearable barcode scanning Advancements in code localization, motion blur compensation, and

Scenario Optimization for Robust Design foundations and recent developments Giuseppe Carlo

Optimization in the Big Data Regime Sham M. Kakade Machine Learning for Big Data

U NDERSTAND Y OUR U NIVERSE : K NOW Y OUR D ATA -P RIVACY O BLIGATIONS David Rice, Brian Sniffen,

Recent Results and Open Problems in Evolutionary Multiobjective Optimization Carlos A. Coello

Recent advances in optimization algorithms for image deblurring and denoising G. Landi E. Loli

Performance Optimization at Scale Recent Experiences Patrick H. Worley Oak Ridge National

solving Linear Optimization What we did so far How to model an optimization problem

Presentation constrained optimization Wenda Chen Speech Data and Constrained Optimization

Constrained optimization DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis

Outline Performance Optimization of Component-based Data Intensive Motivation Applications

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team