conditional gradient algorithms for machine learning
play

Conditional gradient algorithms for machine learning Zaid Harchaoui - PowerPoint PPT Presentation

Conditional gradient algorithms for machine learning Zaid Harchaoui LEAR and LJK, INRIA Joint work with A. Juditsky (Grenoble U., France) and A. Nemirovski (GeorgiaTech) and Matthijs Douze, Miro Dudik, Jerome Malick, Mattis Paulin Gargantua


  1. Conditional gradient algorithms for machine learning Zaid Harchaoui LEAR and LJK, INRIA Joint work with A. Juditsky (Grenoble U., France) and A. Nemirovski (GeorgiaTech) and Matthijs Douze, Miro Dudik, Jerome Malick, Mattis Paulin Gargantua day, Grenoble Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 1 / 42

  2. The advent of large-scale datasets and “big learning” From “The Promise and Perils of Benchmark Datasets and Challenges”, D. Forsyth, A. Efros, F.-F. Li, A. Torralba and A. Zisserman, Talk at “Frontiers of Computer Vision” Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 2 / 42

  3. Large-scale supervised learning Large-scale supervised learning Let ( x 1 , y 1 ) , . . . , ( x n , y n ) ∈ R d × Y be i.i.d. labelled training data, and R emp ( · ) the empirical risk for any W ∈ R d × k . Constrained formulation Penalized formulation minimize R emp ( W ) minimize λ Ω( W ) + R emp ( W ) subject to Ω( W ) ≤ ρ Problem : minimize such objectives in the large-scale setting # examples ≫ 1 , # features ≫ 1 , # classes ≫ 1 Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 3 / 42

  4. Large-scale supervised learning Large-scale supervised learning Let ( x 1 , y 1 ) , . . . , ( x n , y n ) ∈ R d × Y be i.i.d. labelled training data, and R emp ( · ) the empirical risk for any W ∈ R d × k . Constrained formulation Penalized formulation minimize R emp ( W ) minimize λ Ω( W ) + R emp ( W ) subject to Ω( W ) ≤ ρ Problem : minimize such objectives in the large-scale setting n ≫ 1 , d ≫ 1 , k ≫ 1 Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 4 / 42

  5. Machine learning cuboid k d n Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 5 / 42

  6. Motivating example : multi-class classification with trace-norm penalty Motivating the trace-norm penalty Embedding assumption : classes may embedded in a low-dimensional subspace of the feature space Computational efficiency : training time and test time efficiency require sparse matrix regularizers Trace-norm The trace-norm, aka nuclear norm, is defined as min( d,k ) � � σ ( W ) � 1 = σ p ( W ) p =1 where σ 1 ( W ) , . . . , σ min( d,k ) ( W ) denote the singular values of W . Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 6 / 42

  7. Large-scale supervised learning Multi-class classification with trace-norm regularization Let ( x 1 , y 1 ) , . . . , ( x n , y n ) ∈ R d × Y be i.i.d. labelled training data, and R emp ( · ) the empirical risk for any W ∈ R d × k . Constrained formulation Penalized formulation minimize R emp ( W ) minimize λ � σ ( W ) � 1 + R emp ( W ) subject to � σ ( W ) � 1 ≤ ρ Trace-norm reg. penalty (Amit et al., 2007 ; Argyriou et al., 2007) Enforces a low-rank structure of W (sparsity of spectrum σ ( W ) ) Convex problems Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 7 / 42

  8. About the different formulations “Alleged” equivalence For a particular set of examples, for any value ρ of the constraint in the constrained formulation, there exists a value of λ in the penalized formulation so that the solutions of resp. the constrained formulation and the penalized formulation coincide. Statistical learning theory theoretical results on penalized estimators and constrained estimators are of different nature → no rigorous comparison possible equivalence frequently called as the rescue depending on the theoretical tools available to jump from one formulation to the other Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 8 / 42

  9. Summary In practice Recall that eventually “hyperparameters” ( λ, ρ, ε, · · · ) will have to be tuned. Choose the formulation in which you can easily incorporate prior knowledge � � n 1 � Constrained formulation I Minimize Loss i : � σ ( W ) � 1 ≤ ρ n W ∈ R d × k i =1 � � n 1 � Penalized formulation Minimize Loss i + λ � σ ( W ) � 1 n W ∈ R d × k i =1 � � � � n 1 � � � Loss i − R target Minimize λ � σ ( W ) � 1 : � ≤ ε Constrained formulation II � � emp � n � W ∈ R d × k � i =1 Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 9 / 42

  10. Learning with trace-norm penalty : a convex problem Supervised learning with trace-norm regularization penalty Let ( x 1 , y 1 ) , . . . , ( x n , y n ) ∈ R d × Y be a set of i.i.d. labelled training data, with Y = { 0 , 1 } k for multi-class classification n 1 � Minimize Loss i + λ � σ ( W ) � 1 n W ∈ R d × k i =1 � �� � convex Penalized formulation Trace-norm reg. penalty (Amit et al., 2007 ; Argyriou et al., 2007) Enforces a low-rank structure of W (sparsity of spectrum σ ( W ) ) Convex, but non-differentiable Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 10 / 42

  11. Possible approaches Generic approaches “Blind” approach : subgradient, bundle method → slow convergence rate Other approaches : alternating optimization, iteratively reweighted least-squares, etc. → no finite-time convergence guarantees Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 11 / 42

  12. Learning with trace-norm penalty : convex but non-smooth Supervised learning with trace-norm regularization penalty Let ( x 1 , y 1 ) , . . . , ( x n , y n ) ∈ R d × Y be a set of i.i.d. labelled training data, with Y = { 0 , 1 } k for multi-class classification n + 1 � Minimize λ � σ ( W ) � 1 Loss i n W ∈ R d × k � �� � i =1 nonsmooth � �� � smooth where Loss i is e.g. the multinomial logistic loss of i -th example   � � � w T ℓ x i − w T Loss i = log  1 + exp y x i  ℓ ∈Y\{ y i } Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 12 / 42

  13. Learning with trace-norm penalty : a convex problem Supervised learning with trace-norm regularization penalty Let ( x 1 , y 1 ) , . . . , ( x n , y n ) ∈ R d × Y be a set of i.i.d. labelled training data, with Y = { 0 , 1 } k for multi-class classification n λ � σ ( W ) � 1 + 1 � Minimize Loss i n W ∈ R d × k i =1 Penalized formulation Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 13 / 42

  14. Composite minimization for penalized formulation Strengths of composite minimization (aka proximal-gradient) Attractive algorithms when proximal operator is cheap, as e.g. for vector ℓ 1 -norm Accurate with medium-accuracy, finite-time accuracy guarantees Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 14 / 42

  15. Proximal gradient Algorithm Initialize : W = 0 Iterate : � � W t − 1 W t +1 = Prox λ/L Ω( · ) L ∇ R emp ( W t ) 1 2 � U − W � 2 + λ with Prox λ/L Ω( · ) ( U ) := min L Ω( W ) W Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 15 / 42

  16. Composite minimization for penalized formulation Strengths of composite minimization (aka proximal-gradient) Attractive algorithms when proximal operator is cheap, as e.g. for vector ℓ 1 -norm Accurate with medium-accuracy, finite-time accuracy guarantees Weaknesses of composite minimization Inappropriate when proximal operator is expensive to compute Too sensitive to conditioning of design matrix (correlated features) Situation with trace-norm, i.e. Prox µ Ω( · ) ( · ) with Ω( · ) = � · � σ, 1 proximal operator corresponds to singular value thresholding, requiring an SVD running in O ( k rk ( W ) 2 ) in time → impractical for large-scale problems Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 16 / 42

  17. Alternative approach : conditional gradient We want an algorithm with no SVD, i.e. without any projection or proximal step. Let us get some inspiration from the constrained setting. Problem � � n 1 � Minimize Loss i : W ∈ ρ · convex hull ( { M t } t ≥ 1 ) n W ∈ R d × k i =1 Gauge/atomic decomposition of trace-norm � N � N � � � σ ( W ) � 1 = inf θ i | ∃ N, θ i > 0 , M i ∈ M with W = θ i M i θ i =1 i =1 M = { uv T | u ∈ R d , v ∈ R Y , � u � 2 = � v � 2 = 1 } Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 17 / 42

  18. Conditional gradient descent Algorithm Initialize : W = 0 Iterate : Find M t ∈ ρ · convex hull ( M ) , such that M t = Arg max � M ℓ , −∇ R emp ( W t ) � M ℓ ∈M � �� � linear min. oracle Perform line-search between W t and M t W t +1 = (1 − δ ) W t + δ M t Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 18 / 42

  19. Conditional gradient descent : example with trace-norm constraint Algorithm Initialize : W = 0 Iterate : Find M t ∈ ρ · convex hull ( M ) such that � u ℓ v T M t = Arg max ℓ , −∇ R emp ( W t ) � ℓ u T ( −∇ R emp ( W t )) v = Arg max � u � 2 = � v � 2 =1 i.e. compute top pair of singular vectors of −∇ R emp ( W t ) . Perform line-search between W t and M t W t +1 = (1 − δ ) W t + δ M t Zaid Harchaoui (INRIA) Conditional gradient algorithms Nov. 26th, 2013 19 / 42

Recommend


More recommend