(recent advancements in) Optimization in the “Big Data” Regime Sham M. Kakade Computer Science & Engineering Statistics University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 34
Machine Learning, Optimization, and more... ML is having a profound impact: speech recognition (siri, echo), computer vision (ImageNet), game playing (alpha Go), robotics (self driving cars?), personalized health care, music recommendation (spotify), ... Optimization underlies machine learning. How can we optimize faster? S. M. Kakade (UW) Optimization for Big data 1 / 34
Machine Learning and the Big Data Regime... goal: find a d -dim parameter vector which minimizes the loss on n training examples. have n training examples ( x 1 , y 1 ) , . . . ( x n , y n ) have parametric a classifier h ( x , w ) , where w is d dimensional. � min loss ( h ( x i , w ) , y i ) i “Big Data Regime”: How do you optimize this when n and d are large? memory? parallelization? Can we obtain linear time algorithms? S. M. Kakade (UW) Optimization for Big data 2 / 34
This tutorial: Part I: convexity: regression (and more...) (optimization for prediction) Part 2: non-convexity: PCA (and more...) (optimization for representation) Part 3: Statistics (what we care about) Part 4: thoughts and open problems parallelization, second order methods, non-convexity, ... S. M. Kakade (UW) Optimization for Big data 3 / 34
Part 1: Least Squares n � ( w · x i − y i ) 2 + λ � w � 2 min w i = 1 How much computation time is required to to get ǫ accuracy? n points, d dimensions. “Big Data Regime”: How do you optimize this when n and d are large? Aside: think of x as a large feature representation. S. M. Kakade (UW) Optimization for Big data 4 / 34
Review: Direct Solution n � ( w · x i − y i ) 2 + λ � w � 2 min w i = 1 solution: w = ( X ⊤ X + λ I ) − 1 X ⊤ Y where X be the n × d matrix whose rows are x i , and Y is an n -dim vector. time complexity: O ( nd 2 ) and memory O ( d 2 ) Not feasible due to both time and memory. S. M. Kakade (UW) Optimization for Big data 5 / 34
Review: Gradient Descent (and Conjugate GD) n � ( w · x i − y i ) 2 + λ � w � 2 min w i = 1 n points, d dimensions, λ max , λ min are eigs. of “design/data matrix” Computation time to get ǫ accuracy: Gradient Descent (GD): λ max nd log 1 /ǫ λ min Conjugate Gradient Descent: � λ max nd log 1 /ǫ λ min memory: O ( d ) Better runtime and memory, but still costly. S. M. Kakade (UW) Optimization for Big data 6 / 34
Review: Stochastic Gradient Descent (SGD) SGD update rule: at each time t , sample a point ( x i , y i ) w ← w − η ( w · x i − y i ) x i S. M. Kakade (UW) Optimization for Big data 7 / 34
Review: Stochastic Gradient Descent (SGD) SGD update rule: at each time t , sample a point ( x i , y i ) w ← w − η ( w · x i − y i ) x i S. M. Kakade (UW) Optimization for Big data 7 / 34
Review: Stochastic Gradient Descent (SGD) SGD update rule: at each time t , sample a point ( x i , y i ) w ← w − η ( w · x i − y i ) x i Problem: even if w = w ∗ , the update changes w . Rate: convergence rate is O ( 1 /ǫ ) , with decaying η simple algorithm, light on memory, but poor convergence rate S. M. Kakade (UW) Optimization for Big data 7 / 34
Regression in the big data regime? n � ( w · x i − y i ) 2 + λ � w � 2 min w i = 1 How much computation time is required to to get ǫ accuracy? “Big Data Regime”: How do you optimize this when n and d are large? Convex optimization: All results can be generalized to smooth+strongly convex loss functions. S. M. Kakade (UW) Optimization for Big data 8 / 34
Duality (without Duality) ( X ⊤ X + λ I ) − 1 X ⊤ Y w = X ⊤ ( XX ⊤ + λ I ) − 1 Y = 1 λ X ⊤ α := where α = ( I + XX ⊤ /λ ) − 1 Y . idea: let’s compute the n-dim vector α . let’s do this with coordinate ascent S. M. Kakade (UW) Optimization for Big data 9 / 34
SDCA: stochastic dual coordinate ascent G ( α 1 , α 2 , . . . α n ) = � α − Y � 2 + 1 λα ⊤ XX ⊤ α the minimizer of G ( α ) is α = ( I + XX ⊤ /λ ) − 1 Y SDCA: start with α = 0. choose coordinate i randomly, and update: α i = argmin z G ( α 1 , . . . α i − 1 , z , . . . , α n ) easy to do as we touch just one datapoint. return w = 1 λ X ⊤ α . S. M. Kakade (UW) Optimization for Big data 10 / 34
SDCA: the algorithm G ( α 1 , α 2 , . . . α n ) = � α − Y � 2 + 1 λα ⊤ XX ⊤ α start with α = 0, w = 1 λ X ⊤ α . choose coordinate i randomly, and compute difference: 1 ∆ α i = ( y i − w · x i ) − α i 1 + � x i � 2 /λ update: 2 w ← w + 1 α i ← α i + ∆ α i , λ x i · ∆ α i return w = 1 λ X ⊤ α . S. M. Kakade (UW) Optimization for Big data 11 / 34
Guarantees: speedups for the big data regime n points, d dimensions, λ av average eigenvalue Computation time to get ǫ accuracy gradient descent: (Shalev-Shwartz & Zhang ’12),((Frosting, Ge, K., Sidford, 2015) GD vs SDCA: � � λ max n + d λ av n d log 1 /ǫ → d log 1 /ǫ λ min λ min conjugate GD vs acceleration+SDCA: � � � � λ max nd λ av n d log 1 /ǫ → n + d log 1 /ǫ λ min λ min memory: O ( n + d ) S. M. Kakade (UW) Optimization for Big data 12 / 34
(another idea) Stochastic Variance Reduced Gradient (SVRG) exact gradient computation: at stage s , using � w s , compute: 1 n � w s ) = 1 ∇ L ( � ∇ loss ( � w s , ( x i , y i )) n i = 1 corrected SGD: initialize w ← � w s . for m steps, 2 sample a point ( x , y ) � � ∇ loss ( w , ( x , y )) −∇ loss ( � w s , ( x , y )) + ∇ L ( � w ← w − η w s ) update and repeat: � w s + 1 ← w . 3 S. M. Kakade (UW) Optimization for Big data 13 / 34
(another idea) Stochastic Variance Reduced Gradient (SVRG) exact gradient computation: at stage s , using � w s , compute: 1 n � w s ) = 1 ∇ L ( � ∇ loss ( � w s , ( x i , y i )) n i = 1 corrected SGD: initialize w ← � w s . for m steps, 2 sample a point ( x , y ) � � ∇ loss ( w , ( x , y )) −∇ loss ( � w s , ( x , y )) + ∇ L ( � w ← w − η w s ) update and repeat: � w s + 1 ← w . 3 Two ideas: If � w = w ∗ , then no update. unbiased updates: blue term is mean 0. S. M. Kakade (UW) Optimization for Big data 13 / 34
Guarantees of SVRG n points, d dimensions, λ av average eigenvalue Computation time to get ǫ accuracy gradient descent: (Johnson & Zhang ’13) GD vs SDCA: � � λ max n + d λ av n d log 1 /ǫ → d log 1 /ǫ λ min λ min conjugate GD vs ?? � λ max n d log 1 /ǫ → ?? λ min memory: O ( d ) S. M. Kakade (UW) Optimization for Big data 14 / 34
Part 1: Summary Methods extend: to sums of convex functions � L ( w ) = min loss ( h ( x i , w ) , y i ) i for smooth loss ( · ) and strongly convex L ( · ) . Take home: Natural stochastic algorithms, similar to SGD, which enjoy “numerical accuracy” guarantees. Other good ideas: Sketching is good in the large n , but “medium sized” d regime. (Rokhlin and Tygert, 2008) Improve upon conjugate gradient in the big data regime. S. M. Kakade (UW) Optimization for Big data 15 / 34
Part 2: PCA We have n -vectors, x 1 . . . x n in d dimensions and a matrix: n � x i x ⊤ A = i ? i = 1 How much computation time do you need to get an eps -approximation of the top eigenvector? Constructing and storing the matrix may be costly. Computation: How do you accurately estimate the leading eigenvector of A , in terms of n , d , the “gap”, etc? Aside: (with modifications/CCA) this is the simplest way to learn embeddings. S. M. Kakade (UW) Optimization for Big data 16 / 34
Part 2: outline Similar to least squares, we provide speeds for eigenvector computations. Power method → improvements for large n . Lanczos method → improvements for large n . Key idea: Utilizes faster least squares algorithms. “Shift and Invert" Preconditioning S. M. Kakade (UW) Optimization for Big data 17 / 34
Review: Algebraic Methods n � w ⊤ Aw x i x ⊤ max � w � 2 , A = i w i = 1 n points, d dimensions, time complexity: O ( nd 2 ) and memory O ( d 2 ) No “gap” dependence. “Big data regime”: What about the large n regime? S. M. Kakade (UW) Optimization for Big data 18 / 34
Review: The Power Method and Lanczos n � w ⊤ Aw x i x ⊤ max � w � 2 , A = i w i = 1 n points, d dimensions, gap = λ 1 − λ 2 , nnz ( A ) is the # nonzeros in A . λ 1 Computation time to get ǫ accuracy: Power method: nnz ( A ) log 1 /ǫ ≈ nd gap log 1 /ǫ gap Lanczos method: nnz ( A ) nd √ gap log 1 /ǫ ≈ √ gap log 1 /ǫ “Big data regime”: What about the large n regime? S. M. Kakade (UW) Optimization for Big data 19 / 34
Review: Oja’s algorithm and SGD n � w ⊤ Aw x i x ⊤ max � w � 2 , A = i w i = 1 initialize w = 0 and then repeat: for datapoint i sampled randomly: 1 w ← ( I + η x i x ⊤ i ) w normalize: 2 w ← w / � w � Computation time to get ǫ accuracy: O ( 1 /ǫ ) Memory: O ( d ) . S. M. Kakade (UW) Optimization for Big data 20 / 34
PCA in the “Big Data” Regime? How do you find the top eigenvector of: n � x i x ⊤ A = i ? i = 1 “Big Data Regime”: How do you optimize this when n and d are large? S. M. Kakade (UW) Optimization for Big data 21 / 34
Recommend
More recommend