stochastic optimization in hilbert spaces
play

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut - PowerPoint PPT Presentation

Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert spaces 2 / 48 Outline


  1. Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize ? which is the limiting factor ? (time, data points) A problem is said to be large scale when time is limiting. For large scale problem : which algo ? more data less work ? (if time is limiting) 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

  2. Tradeoffs of Large scale learning - Learning Adding an optimization term When we face large datasets, it may be uneasy and useless to optimize with high accuracy the estimator. We then question the choice of an algorithm from a fixed budget time point of view. 2 It questions the following points : up to which precision is it necessary to optimize ? which is the limiting factor ? (time, data points) A problem is said to be large scale when time is limiting. For large scale problem : which algo ? more data less work ? (if time is limiting) 2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48

  3. Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning F ր n ր ε ր Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

  4. Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning F ր n ր ε ր ε app ց Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

  5. Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning F ր n ր ε ր ε app ց ε est ր ց Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

  6. Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning F ր n ր ε ր ε app ց ε est ր ց ε opt ր Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

  7. Tradeoffs of Large scale learning - Learning Tradeoffs - Large scale learning F ր n ր ε ր ε app ց ε est ր ց ε opt ր T ր ր ց Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48

  8. Tradeoffs of Large scale learning - Learning Different algorithms To minimize ERM, a bunch of algorithms may be considered : Gradient descent Second order gradient descent Stochastic gradient descent Fast stochastic algorithm (requiring high memory storage) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 13 / 48

  9. Tradeoffs of Large scale learning - Learning Different algorithms To minimize ERM, a bunch of algorithms may be considered : Gradient descent Second order gradient descent Stochastic gradient descent Fast stochastic algorithm (requiring high memory storage) Let’s compare first order methods : SGD and GD. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 13 / 48

  10. Tradeoffs of Large scale learning - Learning Stochastic gradient algorithms : Aim : min f R ( f ) we only access to unbiased estimates of R ( f ) and ∇ R ( f ). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

  11. Tradeoffs of Large scale learning - Learning Stochastic gradient algorithms : Aim : min f R ( f ) we only access to unbiased estimates of R ( f ) and ∇ R ( f ). 1 Start at some f 0 . Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

  12. Tradeoffs of Large scale learning - Learning Stochastic gradient algorithms : Aim : min f R ( f ) we only access to unbiased estimates of R ( f ) and ∇ R ( f ). 1 Start at some f 0 . 2 Iterate : Get unbiased gradient estimate g k , s.t. E [ g k ] = ∇ R ( f k ). f k +1 ← f k − γ k g k . Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

  13. Tradeoffs of Large scale learning - Learning Stochastic gradient algorithms : Aim : min f R ( f ) we only access to unbiased estimates of R ( f ) and ∇ R ( f ). 1 Start at some f 0 . 2 Iterate : Get unbiased gradient estimate g k , s.t. E [ g k ] = ∇ R ( f k ). f k +1 ← f k − γ k g k . m 3 Output f m or ¯ f m := 1 � f k (averaged SGD). m k =1 Gradient descent : same but with “true” gradient. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48

  14. Tradeoffs of Large scale learning - Learning ERM SGD in ERM min f ∈F R n ( f ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48

  15. Tradeoffs of Large scale learning - Learning ERM SGD in ERM GD in ERM min f ∈F R n ( f ) min f ∈F R n ( f ) Pick any ( x i , y i ) from empirical sample g k = ∇ f ℓ ( f k , ( x i , y i )) . f k +1 ← ( f k − γ k g k ) Output ¯ f m 1 / √ m R n (¯ f m ) − R n ( f ∗ � � n ) � O sup f ∈F | R − R n | ( f ) � O (1 / √ n ) Cost of one iteration O ( d ). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48

  16. Tradeoffs of Large scale learning - Learning ERM SGD in ERM GD in ERM min f ∈F R n ( f ) min f ∈F R n ( f ) � n g k = ∇ f i =1 ℓ ( f k , ( x i , y i )) Pick any ( x i , y i ) from empirical sample g k = ∇ f ℓ ( f k , ( x i , y i )) . = ∇ f R ( f k ) f k +1 ← ( f k − γ k g k ) f k +1 ← ( f k − γ k g k ) Output ¯ f m Output f m 1 / √ m R n (¯ n ) � O ((1 − κ ) m ) f m ) − R n ( f ∗ � � R n ( f m ) − R n ( f ∗ n ) � O sup f ∈F | R − R n | ( f ) � O (1 / √ n ) sup f ∈F | R − R n | ( f ) � O (1 / √ n ) Cost of one iteration O ( d ). Cost of one iteration O ( nd ). 1 / √ m + O (1 / √ n ) R (¯ f m ) − R ( f ∗ ) � O � � 1 With step-size γ k proportional to k . √ Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48

  17. Tradeoffs of Large scale learning - Learning Conclusion In the large scale setting, it is beneficial to use SGD ! Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48

  18. Tradeoffs of Large scale learning - Learning Conclusion In the large scale setting, it is beneficial to use SGD ! Does more data help ? 1 With global estimation error fixed, it seems T ≃ √ n is R ( f m ) − R ( f ∗ ) − 1 decreasing with n . Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48

  19. Tradeoffs of Large scale learning - Learning Conclusion In the large scale setting, it is beneficial to use SGD ! Does more data help ? 1 With global estimation error fixed, it seems T ≃ √ n is R ( f m ) − R ( f ∗ ) − 1 decreasing with n . Upper bounding R n − R uniformly is dangerous. Indeed, we have to also compare to one pass SGD, which minimizes the true risk R . Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48

  20. Tradeoffs of Large scale learning - Learning Expectation minimization Stochastic gradient descent may be used to minimize R ( f ) : SGD in ERM min f ∈F R n ( f ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

  21. Tradeoffs of Large scale learning - Learning Expectation minimization Stochastic gradient descent may be used to minimize R ( f ) : SGD in ERM SGD one pass min f ∈F R n ( f ) min f ∈F R ( f ) Pick any ( x i , y i ) from empirical sample g k = ∇ f ℓ ( f k , ( x i , y i )) . f k +1 ← ( f k − γ k g k ) Output ¯ f m 1 / √ m R n (¯ f m ) − R n ( f ∗ � � n ) � O sup f ∈F | R − R n | ( f ) � O (1 / √ n ) Cost of one iteration O ( d ). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

  22. Tradeoffs of Large scale learning - Learning Expectation minimization Stochastic gradient descent may be used to minimize R ( f ) : SGD in ERM SGD one pass min f ∈F R n ( f ) min f ∈F R ( f ) Pick any ( x i , y i ) from empirical sample Pick an independent ( x , y ) g k = ∇ f ℓ ( f k , ( x i , y i )) . g k = ∇ f ℓ ( f k , ( x , y )) . f k +1 ← ( f k − γ k g k ) f k +1 ← ( f k − γ k g k ) Output ¯ f m Output ¯ f k , k � n 1 / √ m R n (¯ f m ) − R n ( f ∗ � � √ n ) � O � � R (¯ f k ) − R ( f ∗ ) � O 1 / sup f ∈F | R − R n | ( f ) � O (1 / √ n ) k Cost of one iteration O ( d ). Cost of one iteration O ( d ). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

  23. Tradeoffs of Large scale learning - Learning Expectation minimization Stochastic gradient descent may be used to minimize R ( f ) : SGD in ERM SGD one pass min f ∈F R n ( f ) min f ∈F R ( f ) Pick any ( x i , y i ) from empirical sample Pick an independent ( x , y ) g k = ∇ f ℓ ( f k , ( x i , y i )) . g k = ∇ f ℓ ( f k , ( x , y )) . f k +1 ← ( f k − γ k g k ) f k +1 ← ( f k − γ k g k ) Output ¯ f m Output ¯ f k , k � n 1 / √ m R n (¯ f m ) − R n ( f ∗ � � √ n ) � O � � R (¯ f k ) − R ( f ∗ ) � O 1 / sup f ∈F | R − R n | ( f ) � O (1 / √ n ) k Cost of one iteration O ( d ). Cost of one iteration O ( d ). SGD with one pass ( early stopping as a regularization ) achieves a nearly optimal bias variance tradeoff with low complexity. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48

  24. Tradeoffs of Large scale learning - Learning Rate of convergence We are interested in prediction. 1 Strongly convex objective : µ n . 1 Non strongly : √ n . Aymeric Dieuleveut Stochastic optimization Hilbert spaces 18 / 48

  25. A case study -Finite dimension linear least mean squares LMS [Bach and Moulines, 2013] We now consider the simple case where X = R d , and the loss ℓ is qua- dratic. We are interested in linear predictors : θ ∈ R d E P [( θ T x − y ) 2 ] . min If we assume that the data points are generated according to y i = θ T ∗ x i + ε i . We consider stochastic gradient algorithm : θ 0 = 0 θ n +1 = θ n − γ n ( � x n , θ n � x n − y n x n ) This system may be rewritten : θ n +1 − θ ∗ = ( I − γ x n x T n )( θ n − θ ∗ ) − γ n ξ n . (1) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 19 / 48

  26. A case study -Finite dimension linear least mean squares Rate of convergence, back again ! We are interested in prediction. 1 Strongly convex objective : µ n . 1 Non strongly : √ n . We define H = E [ xx T ]. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

  27. A case study -Finite dimension linear least mean squares Rate of convergence, back again ! We are interested in prediction. 1 Strongly convex objective : µ n . 1 Non strongly : √ n . We define H = E [ xx T ]. We have µ = min Sp( H ). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

  28. A case study -Finite dimension linear least mean squares Rate of convergence, back again ! We are interested in prediction. 1 Strongly convex objective : µ n . 1 Non strongly : √ n . We define H = E [ xx T ]. We have µ = min Sp( H ). For least min squares, statistical rate with ordinary LMS estimator is σ 2 d n Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

  29. A case study -Finite dimension linear least mean squares Rate of convergence, back again ! We are interested in prediction. 1 Strongly convex objective : µ n . 1 Non strongly : √ n . We define H = E [ xx T ]. We have µ = min Sp( H ). For least min squares, statistical rate with ordinary LMS estimator is σ 2 d n there is still a gap to be bridged ! Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48

  30. A case study -Finite dimension linear least mean squares A few assumptions We define H = E [ xx T ], and C = E [ ξξ T ]. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48

  31. A case study -Finite dimension linear least mean squares A few assumptions We define H = E [ xx T ], and C = E [ ξξ T ]. Bounded noise variance : we assume C � σ 2 H . Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48

  32. A case study -Finite dimension linear least mean squares A few assumptions We define H = E [ xx T ], and C = E [ ξξ T ]. Bounded noise variance : we assume C � σ 2 H . Covariance operator : no assumption on minimal eigenvalue, E [ � x � 2 ] � R 2 . Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48

  33. A case study -Finite dimension linear least mean squares Result Theorem θ n ) − R ( θ ∗ )] � 4 E [ R ( ¯ n ( σ 2 d + R 2 � θ 0 − θ ∗ � 2 ) optimal statistical rate 1 / n without strong convexity. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 22 / 48

  34. Non parametric learning Outline What if d >> n ? Aymeric Dieuleveut Stochastic optimization Hilbert spaces 23 / 48

  35. Non parametric learning Outline What if d >> n ? Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces Aymeric Dieuleveut Stochastic optimization Hilbert spaces 24 / 48

  36. Non parametric learning Outline What if d >> n ? Non parametric regression in RKHS An interesting problem itself Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces Aymeric Dieuleveut Stochastic optimization Hilbert spaces 25 / 48

  37. Non parametric learning Outline What if d >> n ? Non parametric regression in RKHS An interesting problem itself Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces Behaviour in FD Optimal statistical rates in RKHS Adaptativity, tradeoffs. Choice of γ Aymeric Dieuleveut Stochastic optimization Hilbert spaces 26 / 48

  38. Non parametric learning Reproducing kernel Hilbert space [Dieuleveut and Bach, 2014] We denote H K a Hilbert space of function. H K ⊂ R X . Which is characterized by the kernel function K : X × X → R : for any x , K x : X → R defined by K x ( x ′ ) = K ( x , x ′ ) is in H K . Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48

  39. Non parametric learning Reproducing kernel Hilbert space [Dieuleveut and Bach, 2014] We denote H K a Hilbert space of function. H K ⊂ R X . Which is characterized by the kernel function K : X × X → R : for any x , K x : X → R defined by K x ( x ′ ) = K ( x , x ′ ) is in H K . reproducing property : for all g ∈ H K and x ∈ X , g ( x ) = � g , K x � K . Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48

  40. Non parametric learning Reproducing kernel Hilbert space [Dieuleveut and Bach, 2014] We denote H K a Hilbert space of function. H K ⊂ R X . Which is characterized by the kernel function K : X × X → R : for any x , K x : X → R defined by K x ( x ′ ) = K ( x , x ′ ) is in H K . reproducing property : for all g ∈ H K and x ∈ X , g ( x ) = � g , K x � K . Two usages : α ) A hypothesis space for regression. β ) Mapping data points in a linear space. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48

  41. Non parametric learning α ) A hypothesis space for regression. Classical regression setting : ( X i , Y i ) ∼ ρ i.i.d. ( X i , Y i ) ∈ ( X × R ) Goal : Minimizing prediction error g ∈L 2 E [( g ( X ) − Y ) 2 ] . min Aymeric Dieuleveut Stochastic optimization Hilbert spaces 28 / 48

  42. Non parametric learning α ) A hypothesis space for regression. Classical regression setting : ( X i , Y i ) ∼ ρ i.i.d. ( X i , Y i ) ∈ ( X × R ) Goal : Minimizing prediction error g ∈L 2 E [( g ( X ) − Y ) 2 ] . min g n of g ρ ( X ) = E [ Y | X ], g ρ ∈ L 2 Looking for an estimator ˆ ρ X . with � � � L 2 f 2 ( t ) d ρ X ( t ) < ∞ ρ X = f : X → R / . Aymeric Dieuleveut Stochastic optimization Hilbert spaces 28 / 48

  43. Non parametric learning β ) Mapping data points in a linear space. Linear regression on data maped into some RKHS. θ ∈H || Y − X θ || 2 . arg min Aymeric Dieuleveut Stochastic optimization Hilbert spaces 29 / 48

  44. Non parametric learning 2 approaches of regression problem : Link : In general H K ⊂ L 2 ρ X Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

  45. Non parametric learning 2 approaches of regression problem : Link : In general H K ⊂ L 2 ρ X And (RKHS) = L 2 compl || . || L 2 ρ X ρ X in some cases. We then look for an estimator of the regression function in the RKHS. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

  46. Non parametric learning 2 approaches of regression problem : Link : In general H K ⊂ L 2 ρ X And (RKHS) = L 2 compl || . || L 2 ρ X ρ X in some cases. We then look for an estimator of the regression function in the RKHS. General regression problem g ρ ∈ L 2 Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

  47. Non parametric learning 2 approaches of regression problem : Link : In general H K ⊂ L 2 ρ X And (RKHS) = L 2 compl || . || L 2 ρ X ρ X in some cases. We then look for an estimator of the regression function in the RKHS. General regression problem Linear regression problem in g ρ ∈ L 2 RKHS Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

  48. Non parametric learning 2 approaches of regression problem : Link : In general H K ⊂ L 2 ρ X And (RKHS) = L 2 compl || . || L 2 ρ X ρ X in some cases. We then look for an estimator of the regression function in the RKHS. General regression problem Linear regression problem in g ρ ∈ L 2 RKHS looking for an estimator for the first problem using natural algorithms for the second one Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48

  49. Non parametric learning Outline What if d >> n ? Non parametric regression in RKHS An interesting problem itself Carry analyse in a Hilbert space using reproducing kernel Hilbert spaces Aymeric Dieuleveut Stochastic optimization Hilbert spaces 31 / 48

  50. Non parametric learning SGD algorithm in the RKHS ∈ H K (we often consider g 0 = 0) , g 0 n � = a i K x i , (2) g n i =1 �� n − 1 � ( a n ) n such that a n := − γ n ( g n − 1 ( x n ) − y n ) = − γ n i =1 a i K ( x n , x i ) − y i . g n = g n − 1 − γ n ( g n − 1 ( x n ) − y n ) K x n n � = a i K x i with a n defined as above. i =1 ( g n − 1 ( x n ) − y n ) K x n unbiased estimate of grad E [( � K x , g n − 1 � − y ) 2 ] . SGD algorithm in the RKHS takes very simple form Aymeric Dieuleveut Stochastic optimization Hilbert spaces 32 / 48

  51. Non parametric learning Assumptions Two important points characterize the difficulty of the problem : The regularity of the objective function The spectrum of the covariance operator Aymeric Dieuleveut Stochastic optimization Hilbert spaces 33 / 48

  52. Non parametric learning Covariance operator We have Σ = E [ K x ⊗ K x ] . Where K x ⊗ K x : g �→ � K x , g � K x = g ( x ) K x Covariance operator is a self adjoint operator which contains infor- mation on the distribution of K x Aymeric Dieuleveut Stochastic optimization Hilbert spaces 34 / 48

  53. Non parametric learning Covariance operator We have Σ = E [ K x ⊗ K x ] . Where K x ⊗ K x : g �→ � K x , g � K x = g ( x ) K x Covariance operator is a self adjoint operator which contains infor- mation on the distribution of K x Assumption : tr(Σ α ) < ∞ , for α ∈ [0; 1]. on g ρ : g ρ ∈ Σ r ( L 2 ρ ( X ) ) with r ≥ 0. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 34 / 48

  54. Non parametric learning Interpretation Eigenvalues decrease Ellipsoid class of function. (we do not assume g ρ ∈ H K ) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 35 / 48

  55. Non parametric learning Result : Theorem Under a few hidden assumptions : � σ 2 tr(Σ α ) γ α � || Σ − r g ρ || 2 � � E [ R (¯ g n ) − R ( g ρ )] � O + O n 1 − α ( n γ ) 2( r ∧ 1) Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48

  56. Non parametric learning Result : Theorem Under a few hidden assumptions : � σ 2 tr(Σ α ) γ α � || Σ − r g ρ || 2 � � E [ R (¯ g n ) − R ( g ρ )] � O + O n 1 − α ( n γ ) 2( r ∧ 1) Bias Variance decomposition Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48

  57. Non parametric learning Result : Theorem Under a few hidden assumptions : � σ 2 tr(Σ α ) γ α � || Σ − r g ρ || 2 � � E [ R (¯ g n ) − R ( g ρ )] � O + O n 1 − α ( n γ ) 2( r ∧ 1) Bias Variance decomposition O is a known constant (4 or 8) Finite horizon result here but extends to online setting. Saturation Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48

  58. Non parametric learning Corollary Corollary Assume A1-8 : 2 , with γ = n − 2 r + α − 1 If 1 − α < r < 2 − α we get the optimal rate : 2 r + α 2 � 2 r � n − E [ R (¯ g n ) − R ( g ρ )] = O (3) 2 r + α Aymeric Dieuleveut Stochastic optimization Hilbert spaces 37 / 48

  59. Non parametric learning Conclusion 1 Optimal statistical rates in RKHS Choice of γ Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

  60. Non parametric learning Conclusion 1 Optimal statistical rates in RKHS Choice of γ We get statistical optimal rate of convergence for learning in RKHS with SGD with one pass. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

  61. Non parametric learning Conclusion 1 Optimal statistical rates in RKHS Choice of γ We get statistical optimal rate of convergence for learning in RKHS with SGD with one pass. We get insights on how to choose the kernel and the step size. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

  62. Non parametric learning Conclusion 1 Optimal statistical rates in RKHS Choice of γ We get statistical optimal rate of convergence for learning in RKHS with SGD with one pass. We get insights on how to choose the kernel and the step size. We compare favorably to [Ying and Pontil, 2008, Caponnetto and De Vito, 2007, Tarr` es and Yao, 2011]. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48

  63. Non parametric learning Conclusion 2 Behaviour in FD Adaptativity, tradeoffs. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48

  64. Non parametric learning Conclusion 2 Behaviour in FD Adaptativity, tradeoffs. Theorem can be rewritten : � σ 2 tr(Σ α ) γ α � θ T ∗ Σ 2 r − 1 θ T � � � ¯ � � � E R θ n − R ( θ ∗ ) � O + O (4) n 1 − α ( n γ ) 2( r ∧ 1) where the ellipsoid condition appears more clearly. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48

  65. Non parametric learning Conclusion 2 Behaviour in FD Adaptativity, tradeoffs. Theorem can be rewritten : � σ 2 tr(Σ α ) γ α � θ T ∗ Σ 2 r − 1 θ T � � � ¯ � � � E R θ n − R ( θ ∗ ) � O + O (4) n 1 − α ( n γ ) 2( r ∧ 1) where the ellipsoid condition appears more clearly. Thus : SGD is adaptative to the regularity of the problem bridges the gap between the different regimes and explains behaviour when d >> n . Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48

  66. The complexity challenge, approximation of the kernel Tradeoffs of Large scale learning - Learning 1 A case study -Finite dimension linear least mean squares 2 Non parametric learning 3 The complexity challenge, approximation of the kernel 4 Aymeric Dieuleveut Stochastic optimization Hilbert spaces 40 / 48

  67. The complexity challenge, approximation of the kernel Reducing complexity : sampling methods However the complexity of such a method remains quadratic with respect of the number of examples : iteration number n costs n kernel calculations. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48

  68. The complexity challenge, approximation of the kernel Reducing complexity : sampling methods However the complexity of such a method remains quadratic with respect of the number of examples : iteration number n costs n kernel calculations. Rate Complexity d Finite Dimension O ( dn ) n Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48

  69. The complexity challenge, approximation of the kernel Reducing complexity : sampling methods However the complexity of such a method remains quadratic with respect of the number of examples : iteration number n costs n kernel calculations. Rate Complexity d Finite Dimension O ( dn ) n d n O ( n 2 ) Infinite dimension n Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48

  70. The complexity challenge, approximation of the kernel 2 related methods Approximate the kernel matrix Approximate the kernel Results from [Bach, 2012]. Such results have been extended by [Alaoui and Mahoney, 2014, Rudi et al., 2015] There also exist results in the second situation [Rahimi and Recht, 2008, Dai et al., 2014] Aymeric Dieuleveut Stochastic optimization Hilbert spaces 42 / 48

  71. The complexity challenge, approximation of the kernel Sharp analysis We only consider a fixed design setting. Then we have to approximate the kernel matrix : instead of computing the whole matrix, we randomly pick a number d n of columns. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 43 / 48

  72. The complexity challenge, approximation of the kernel Sharp analysis We only consider a fixed design setting. Then we have to approximate the kernel matrix : instead of computing the whole matrix, we randomly pick a number d n of columns. Then we still get the same estimation errors. Leading to : Rate Complexity d Finite Dimension O ( dn ) n d n O ( nd 2 Infinite dimension n ) n Aymeric Dieuleveut Stochastic optimization Hilbert spaces 43 / 48

  73. The complexity challenge, approximation of the kernel Random feature selection Many kernels may be represented, due to Bochner’s theorem as � K ( x , y ) = φ ( w , x ) φ ( w , y ) d µ ( w ) . W (think of translation invariant kernels and Fourier transform). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 44 / 48

  74. The complexity challenge, approximation of the kernel Random feature selection Many kernels may be represented, due to Bochner’s theorem as � K ( x , y ) = φ ( w , x ) φ ( w , y ) d µ ( w ) . W (think of translation invariant kernels and Fourier transform). We thus consider the low rank approximation : n K ( x , y ) = 1 ˜ � φ ( x , w i ) φ ( y , w i ) . d i =1 where w i ∼ µ . We use this approximation of the kernel in SGD. Aymeric Dieuleveut Stochastic optimization Hilbert spaces 44 / 48

  75. The complexity challenge, approximation of the kernel Directions What I am working on for the moment : Random feature selection Tuning the sampling to improve accuracy of the approximation Acceleration + stochasticity (with Nicolas Flammarion). Aymeric Dieuleveut Stochastic optimization Hilbert spaces 45 / 48

  76. The complexity challenge, approximation of the kernel Some references I Alaoui, A. E. and Mahoney, M. W. (2014). Fast randomized kernel methods with statistical guarantees. CoRR , abs/1411.0306. Bach, F. (2012). Sharp analysis of low-rank kernel matrix approximations. ArXiv e-prints . Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). ArXiv e-prints . Bottou, L. and Bousquet, O. (2008). The tradeoffs of large scale learning. In IN : ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 20 . Caponnetto, A. and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm. Foundations of Computational Mathematics , 7(3) :331–368. Dai, B., Xie, B., He, N., Liang, Y., Raj, A., Balcan, M., and Song, L. (2014). Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems 27 : Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada , pages 3041–3049. Dieuleveut, A. and Bach, F. (2014). Non-parametric Stochastic Approximation with Large Step sizes. ArXiv e-prints . Aymeric Dieuleveut Stochastic optimization Hilbert spaces 46 / 48

Recommend


More recommend