Large-scale machine learning and convex optimization Francis Bach - PowerPoint PPT Presentation

Complexity results in convex optimization for ML • Assumption : f convex on R d • Classical generic algorithms – (sub)gradient method/descent – Accelerated gradient descent – Newton method • Key additional properties of f – Lipschitz continuity, smoothness or strong convexity • Key insight from Bottou and Bousquet (2008) – In machine learning, no need to optimize below estimation error • Key reference : Nesterov (2004)

Lipschitz continuity • Bounded gradients of f (Lipschitz-continuity) : the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D : ∀ θ ∈ R d , � θ � 2 � D ⇒ � f ′ ( θ ) � 2 � B • Machine learning � n – with f ( θ ) = 1 i =1 ℓ ( y i , θ ⊤ Φ( x i )) n – G -Lipschitz loss and R -bounded data: B = GR

Smoothness and strong convexity • A function f : R d → R is L -smooth if and only if it is differentiable and its gradient is L -Lipschitz-continuous ∀ θ 1 , θ 2 ∈ R d , � f ′ ( θ 1 ) − f ′ ( θ 2 ) � 2 � L � θ 1 − θ 2 � 2 • If f is twice differentiable: ∀ θ ∈ R d , f ′′ ( θ ) � L · Id smooth non−smooth

Smoothness and strong convexity • A function f : R d → R is L -smooth if and only if it is differentiable and its gradient is L -Lipschitz-continuous ∀ θ 1 , θ 2 ∈ R d , � f ′ ( θ 1 ) − f ′ ( θ 2 ) � 2 � L � θ 1 − θ 2 � 2 • If f is twice differentiable: ∀ θ ∈ R d , f ′′ ( θ ) � L · Id • Machine learning � n – with f ( θ ) = 1 i =1 ℓ ( y i , θ ⊤ Φ( x i )) n � n – Hessian ≈ covariance matrix 1 i =1 Φ( x i )Φ( x i ) ⊤ n – ℓ -smooth loss and R -bounded data: L = ℓR 2

Smoothness and strong convexity • A function f : R d → R is µ -strongly convex if and only if ∀ θ 1 , θ 2 ∈ R d , f ( θ 1 ) � f ( θ 2 ) + f ′ ( θ 2 ) ⊤ ( θ 1 − θ 2 ) + µ 2 � θ 1 − θ 2 � 2 2 • If f is twice differentiable: ∀ θ ∈ R d , f ′′ ( θ ) � µ · Id strongly convex convex

Smoothness and strong convexity • A function f : R d → R is µ -strongly convex if and only if ∀ θ 1 , θ 2 ∈ R d , f ( θ 1 ) � f ( θ 2 ) + f ′ ( θ 2 ) ⊤ ( θ 1 − θ 2 ) + µ 2 � θ 1 − θ 2 � 2 2 • If f is twice differentiable: ∀ θ ∈ R d , f ′′ ( θ ) � µ · Id • Machine learning � n – with f ( θ ) = 1 i =1 ℓ ( y i , θ ⊤ Φ( x i )) n � n – Hessian ≈ covariance matrix 1 i =1 Φ( x i )Φ( x i ) ⊤ n – Data with invertible covariance matrix (low correlation/dimension)

Smoothness and strong convexity • A function f : R d → R is µ -strongly convex if and only if ∀ θ 1 , θ 2 ∈ R d , f ( θ 1 ) � f ( θ 2 ) + f ′ ( θ 2 ) ⊤ ( θ 1 − θ 2 ) + µ 2 � θ 1 − θ 2 � 2 3 • If f is twice differentiable: ∀ θ ∈ R d , f ′′ ( θ ) � µ · Id • Machine learning � n – with f ( θ ) = 1 i =1 ℓ ( y i , θ ⊤ Φ( x i )) n � n – Hessian ≈ covariance matrix 1 i =1 Φ( x i )Φ( x i ) ⊤ n – Data with invertible covariance matrix (low correlation/dimension) • Adding regularization by µ 2 � θ � 2 – creates additional bias unless µ is small

Summary of smoothness/convexity assumptions • Bounded gradients of f (Lipschitz-continuity) : the function f if convex, differentiable and has (sub)gradients uniformly bounded by B on the ball of center 0 and radius D : ∀ θ ∈ R d , � θ � 2 � D ⇒ � f ′ ( θ ) � 2 � B • Smoothness of f : the function f is convex, differentiable with L -Lipschitz-continuous gradient f ′ : ∀ θ 1 , θ 2 ∈ R d , � f ′ ( θ 1 ) − f ′ ( θ 2 ) � 2 � L � θ 1 − θ 2 � 2 • Strong convexity of f : The function f is strongly convex with respect to the norm � · � , with convexity constant µ > 0 : ∀ θ 1 , θ 2 ∈ R d , f ( θ 1 ) � f ( θ 2 ) + f ′ ( θ 2 ) ⊤ ( θ 1 − θ 2 ) + µ 2 � θ 1 − θ 2 � 2 2

Subgradient method/descent • Assumptions – f convex and B -Lipschitz-continuous on {� θ � 2 � D } � � θ t − 1 − 2 D tf ′ ( θ t − 1 ) √ • Algorithm : θ t = Π D B – Π D : orthogonal projection onto {� θ � 2 � D } • Bound : t − 1 � 1 � − f ( θ ∗ ) � 2 DB � f θ k √ t t k =0 • Three-line proof • Best possible convergence rate after O ( d ) iterations

Subgradient method/descent - proof - I 2 D • Iteration: θ t = Π D ( θ t − 1 − γ t f ′ ( θ t − 1 )) with γ t = √ B t • Assumption: � f ′ ( θ ) � 2 � B and � θ � 2 � D � θ t − θ ∗ � 2 � θ t − 1 − θ ∗ − γ t f ′ ( θ t − 1 ) � 2 2 by contractivity of projections � 2 � θ t − 1 − θ ∗ � 2 2 + B 2 γ 2 t − 2 γ t ( θ t − 1 − θ ∗ ) ⊤ f ′ ( θ t − 1 ) because � f ′ ( θ t − 1 ) � 2 � B � � θ t − 1 − θ ∗ � 2 2 + B 2 γ 2 � � t − 2 γ t f ( θ t − 1 ) − f ( θ ∗ ) (property of subgradients) � • leading to B 2 γ t + 1 � θ t − 1 − θ ∗ � 2 2 − � θ t − θ ∗ � 2 � � f ( θ t − 1 ) − f ( θ ∗ ) � 2 2 2 γ t

Subgradient method/descent - proof - II f ( θ t − 1 ) − f ( θ ∗ ) � B 2 γ t + 1 • Starting from � θ t − 1 − θ ∗ � 2 2 − � θ t − θ ∗ � 2 � � 2 2 2 γ t t t t B 2 γ u 1 � � � � θ u − 1 − θ ∗ � 2 2 − � θ u − θ ∗ � 2 � � � � f ( θ u − 1 ) − f ( θ ∗ ) + � 2 2 2 γ u u =1 u =1 u =1 t t − 1 B 2 γ u + � θ 0 − θ ∗ � 2 − � θ t − θ ∗ � 2 1 − 1 � � 2 2 � θ u − θ ∗ � 2 � � = + 2 2 2 γ u +1 2 γ u 2 γ 1 2 γ t u =1 u =1 t t − 1 B 2 γ u + 4 D 2 1 − 1 � � 4 D 2 � � + � 2 2 γ u +1 2 γ u 2 γ 1 u =1 u =1 t √ B 2 γ u + 4 D 2 t with γ t = 2 D � √ = � 2 DB 2 2 γ t B t u =1 t − 1 � 1 � − f ( θ ∗ ) � 2 DB � • Using convexity: f θ k √ t t k =0

Subgradient descent for machine learning • Assumptions ( f is the expected risk, ˆ f the empirical risk) – “Linear” predictors: θ ( x ) = θ ⊤ Φ( x ) , with � Φ( x ) � 2 � R a.s. � n – ˆ f ( θ ) = 1 i =1 ℓ ( y i , Φ( x i ) ⊤ θ ) n – G -Lipschitz loss: f and ˆ f are GR -Lipschitz on C = {� θ � 2 � D } • Statistics : with probability greater than 1 − δ � � � f ( θ ) − f ( θ ) | � GRD 2 log 2 | ˆ √ n sup 2 + δ θ ∈C • Optimization : after t iterations of subgradient method f ( η ) � GRD f (ˆ ˆ ˆ √ θ ) − min t η ∈C • t = n iterations, with total running-time complexity of O ( n 2 d )

Subgradient descent - strong convexity • Assumptions – f convex and B -Lipschitz-continuous on {� θ � 2 � D } – f µ -strongly convex � � 2 µ ( t + 1) f ′ ( θ t − 1 ) • Algorithm : θ t = Π D θ t − 1 − • Bound : t 2 B 2 � � 2 � f kθ k − 1 − f ( θ ∗ ) � t ( t + 1) µ ( t + 1) k =1 • Three-line proof • Best possible convergence rate after O ( d ) iterations

Subgradient method - strong convexity - proof - I • Iteration: θ t = Π D ( θ t − 1 − γ t f ′ ( θ t − 1 )) with γ t = 2 µ ( t +1) • Assumption: � f ′ ( θ ) � 2 � B and � θ � 2 � D and µ -strong convexity of f � θ t − θ ∗ � 2 � θ t − 1 − θ ∗ − γ t f ′ ( θ t − 1 ) � 2 2 by contractivity of projections � 2 � θ t − 1 − θ ∗ � 2 2 + B 2 γ 2 t − 2 γ t ( θ t − 1 − θ ∗ ) ⊤ f ′ ( θ t − 1 ) because � f ′ ( θ t − 1 ) � 2 � B � f ( θ t − 1 ) − f ( θ ∗ ) + µ � θ t − 1 − θ ∗ � 2 2 + B 2 γ 2 2 � θ t − 1 − θ ∗ � 2 � � t − 2 γ t � 2 (property of subgradients and strong convexity) • leading to B 2 γ t � 1 + 1 2 − 1 � θ t − 1 − θ ∗ � 2 � θ t − θ ∗ � 2 � f ( θ t − 1 ) − f ( θ ∗ ) − µ � 2 2 2 γ t 2 γ t B 2 µ ( t + 1) + µ � t − 1 2 − µ ( t + 1) � θ t − 1 − θ ∗ � 2 � θ t − θ ∗ � 2 � � 2 2 2 4

Subgradient method - strong convexity - proof - II B 2 µ ( t + 1) + µ � t − 1 2 − µ ( t + 1) � θ t − 1 − θ ∗ � 2 � θ t − θ ∗ � 2 • From f ( θ t − 1 ) − f ( θ ∗ ) � � 2 2 2 4 t u t B 2 u µ ( u + 1) + 1 � � � u ( u − 1) � θ u − 1 − θ ∗ � 2 2 − u ( u + 1) � θ u − θ ∗ � 2 � � � � u f ( θ u − 1 ) − f ( θ ∗ ) � 2 4 u =1 t =1 u =1 B 2 t � B 2 t µ + 1 0 − t ( t + 1) � θ t − θ ∗ � 2 � � � 2 4 µ t 2 B 2 � � 2 � • Using convexity: f uθ u − 1 − f ( θ ∗ ) � t ( t + 1) µ ( t + 1) u =1

(smooth) gradient descent • Assumptions – f convex with L -Lipschitz-continuous gradient – Minimum attained at θ ∗ • Algorithm : θ t = θ t − 1 − 1 Lf ′ ( θ t − 1 ) • Bound : f ( θ t ) − f ( θ ∗ ) � 2 L � θ 0 − θ ∗ � 2 t + 4 • Three-line proof • Not best possible convergence rate after O ( d ) iterations

(smooth) gradient descent - strong convexity • Assumptions – f convex with L -Lipschitz-continuous gradient – f µ -strongly convex • Algorithm : θ t = θ t − 1 − 1 Lf ′ ( θ t − 1 ) • Bound : f ( θ t ) − f ( θ ∗ ) � (1 − µ/L ) t � � f ( θ 0 ) − f ( θ ∗ ) • Three-line proof • Adaptivity of gradient descent to problem difficulty • Line search

Accelerated gradient methods (Nesterov, 1983) • Assumptions – f convex with L -Lipschitz-cont. gradient , min. attained at θ ∗ • Algorithm : η t − 1 − 1 Lf ′ ( η t − 1 ) θ t = θ t + t − 1 η t = t + 2( θ t − θ t − 1 ) • Bound : f ( θ t ) − f ( θ ∗ ) � 2 L � θ 0 − θ ∗ � 2 ( t + 1) 2 • Ten-line proof • Not improvable • Extension to strongly convex functions

Optimization for sparsity-inducing norms (see Bach, Jenatton, Mairal, and Obozinski, 2011) • Gradient descent as a proximal method (differentiable functions) θ ∈ R d f ( θ t ) + ( θ − θ t ) ⊤ ∇ f ( θ t )+ L 2 � θ − θ t � 2 – θ t +1 = arg min 2 – θ t +1 = θ t − 1 L ∇ f ( θ t )

Optimization for sparsity-inducing norms (see Bach, Jenatton, Mairal, and Obozinski, 2011) • Gradient descent as a proximal method (differentiable functions) θ ∈ R d f ( θ t ) + ( θ − θ t ) ⊤ ∇ f ( θ t )+ L 2 � θ − θ t � 2 – θ t +1 = arg min 2 – θ t +1 = θ t − 1 L ∇ f ( θ t ) • Problems of the form: θ ∈ R d f ( θ ) + µ Ω( θ ) min θ ∈ R d f ( θ t ) + ( θ − θ t ) ⊤ ∇ f ( θ t )+ µ Ω( θ )+ L 2 � θ − θ t � 2 – θ t +1 = arg min 2 – Ω( θ ) = � θ � 1 ⇒ Thresholded gradient descent • Similar convergence rates than smooth optimization – Acceleration methods (Nesterov, 2007; Beck and Teboulle, 2009)

Summary: minimizing convex functions • Assumption : f convex • Gradient descent : θ t = θ t − 1 − γ t f ′ ( θ t − 1 ) √ – O (1 / t ) convergence rate for non-smooth convex functions – O (1 /t ) convergence rate for smooth convex functions – O ( e − ρt ) convergence rate for strongly smooth convex functions • Newton method : θ t = θ t − 1 − f ′′ ( θ t − 1 ) − 1 f ′ ( θ t − 1 ) e − ρ 2 t � � – O convergence rate

Summary: minimizing convex functions • Assumption : f convex • Gradient descent : θ t = θ t − 1 − γ t f ′ ( θ t − 1 ) √ – O (1 / t ) convergence rate for non-smooth convex functions – O (1 /t ) convergence rate for smooth convex functions – O ( e − ρt ) convergence rate for strongly smooth convex functions • Newton method : θ t = θ t − 1 − f ′′ ( θ t − 1 ) − 1 f ′ ( θ t − 1 ) e − ρ 2 t � � – O convergence rate • Key insights from Bottou and Bousquet (2008) 1. In machine learning, no need to optimize below statistical error 2. In machine learning, cost functions are averages ⇒ Stochastic approximation

Outline 1. Large-scale machine learning and optimization • Traditional statistical analysis • Classical methods for convex optimization 2. Non-smooth stochastic approximation • Stochastic (sub)gradient and averaging • Non-asymptotic results and lower bounds • Strongly convex vs. non-strongly convex 3. Smooth stochastic approximation algorithms • Asymptotic and non-asymptotic results • Beyond decaying step-sizes 4. Finite data sets

Stochastic approximation • Goal : Minimizing a function f defined on R d – given only unbiased estimates f ′ n ( θ n ) of its gradients f ′ ( θ n ) at certain points θ n ∈ R d • Stochastic approximation – (much) broader applicability beyond convex optimization � � θ n = θ n − 1 − γ n h n ( θ n − 1 ) with E h n ( θ n − 1 ) | θ n − 1 = h ( θ n − 1 ) – Beyond convex problems, i.i.d assumption, finite dimension, etc. – Typically asymptotic results – See, e.g., Kushner and Yin (2003); Benveniste et al. (2012)

Stochastic approximation • Goal : Minimizing a function f defined on R d – given only unbiased estimates f ′ n ( θ n ) of its gradients f ′ ( θ n ) at certain points θ n ∈ R d • Machine learning - statistics f n ( θ ) = ℓ ( y n , θ ⊤ Φ( x n )) – loss for a single pair of observations : – f ( θ ) = E f n ( θ ) = E ℓ ( y n , θ ⊤ Φ( x n )) = generalization error – Expected gradient: f ′ ( θ ) = E f ′ � ℓ ′ ( y n , θ ⊤ Φ( x n )) Φ( x n ) � n ( θ ) = E – Non-asymptotic results • Number of iterations = number of observations

Relationship to online learning • Stochastic approximation – Minimize f ( θ ) = E z ℓ ( θ, z ) = generalization error of θ – Using the gradients of single i.i.d. observations

Relationship to online learning • Stochastic approximation – Minimize f ( θ ) = E z ℓ ( θ, z ) = generalization error of θ – Using the gradients of single i.i.d. observations • Batch learning – Finite set of observations: z 1 , . . . , z n – Empirical risk: ˆ � n f ( θ ) = 1 k =1 ℓ ( θ, z i ) n – Estimator ˆ θ = Minimizer of ˆ f ( θ ) over a certain class Θ – Generalization bound using uniform concentration results

Relationship to online learning • Stochastic approximation – Minimize f ( θ ) = E z ℓ ( θ, z ) = generalization error of θ – Using the gradients of single i.i.d. observations • Batch learning – Finite set of observations: z 1 , . . . , z n � n – Empirical risk: ˆ f ( θ ) = 1 k =1 ℓ ( θ, z i ) n – Estimator ˆ θ = Minimizer of ˆ f ( θ ) over a certain class Θ – Generalization bound using uniform concentration results • Online learning – Update ˆ θ n after each new (potentially adversarial) observation z n � n k =1 ℓ (ˆ – Cumulative loss: 1 θ k − 1 , z k ) n – Online to batch through averaging (Cesa-Bianchi et al., 2004)

Convex stochastic approximation • Key properties of f and/or f n – Smoothness: f B -Lipschitz continuous, f ′ L -Lipschitz continuous – Strong convexity: f µ -strongly convex

Convex stochastic approximation • Key properties of f and/or f n – Smoothness: f B -Lipschitz continuous, f ′ L -Lipschitz continuous – Strong convexity: f µ -strongly convex • Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n − 1 − γ n f ′ n ( θ n − 1 ) � n − 1 – Polyak-Ruppert averaging: ¯ θ n = 1 k =0 θ k n γ n = Cn − α – Which learning rate sequence γ n ? Classical setting:

Convex stochastic approximation • Key properties of f and/or f n – Smoothness: f B -Lipschitz continuous, f ′ L -Lipschitz continuous – Strong convexity: f µ -strongly convex • Key algorithm: Stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n − 1 − γ n f ′ n ( θ n − 1 ) – Polyak-Ruppert averaging: ¯ � n − 1 θ n = 1 k =0 θ k n γ n = Cn − α – Which learning rate sequence γ n ? Classical setting: • Desirable practical behavior – Applicable (at least) to classical supervised learning problems – Robustness to (potentially unknown) constants ( L , B , µ ) – Adaptivity to difficulty of the problem (e.g., strong convexity)

Stochastic subgradient descent/method • Assumptions – f n convex and B -Lipschitz-continuous on {� θ � 2 � D } – ( f n ) i.i.d. functions such that E f n = f – θ ∗ global optimum of f on {� θ � 2 � D } � � θ n − 1 − 2 D B √ nf ′ • Algorithm : θ n = Π D n ( θ n − 1 ) • Bound : n − 1 � 1 � − f ( θ ∗ ) � 2 DB � E f θ k √ n n k =0 • “Same” three-line proof as in the deterministic case • Minimax rate (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) • Running-time complexity: O ( dn ) after n iterations

Stochastic subgradient method - proof - I • Iteration: θ n = Π D ( θ n − 1 − γ n f ′ 2 D n ( θ n − 1 )) with γ n = B √ n • F n : information up to time n • � f ′ n ( θ ) � 2 � B and � θ � 2 � D , unbiased gradients/functions E ( f n |F n − 1 ) = f � θ n − θ ∗ � 2 � � θ n − 1 − θ ∗ − γ n f ′ n ( θ n − 1 ) � 2 2 by contractivity of projections 2 � � θ n − 1 − θ ∗ � 2 2 + B 2 γ 2 n − 2 γ n ( θ n − 1 − θ ∗ ) ⊤ f ′ n ( θ n − 1 ) because � f ′ n ( θ n − 1 ) � 2 � B � θ n − θ ∗ � 2 � � θ n − 1 − θ ∗ � 2 2 + B 2 γ 2 n − 2 γ n ( θ n − 1 − θ ∗ ) ⊤ f ′ ( θ n − 1 ) � � 2 |F n − 1 E � � θ n − 1 − θ ∗ � 2 2 + B 2 γ 2 � � n − 2 γ n f ( θ n − 1 ) − f ( θ ∗ ) (subgradient property) E � θ n − θ ∗ � 2 2 � E � θ n − 1 − θ ∗ � 2 2 + B 2 γ 2 � � n − 2 γ n E f ( θ n − 1 ) − f ( θ ∗ ) • leading to E f ( θ n − 1 ) − f ( θ ∗ ) � B 2 γ n + 1 E � θ n − 1 − θ ∗ � 2 2 − E � θ n − θ ∗ � 2 � � 2 2 2 γ n

Stochastic subgradient method - proof - II • Starting from E f ( θ n − 1 ) − f ( θ ∗ ) � B 2 γ n 1 E � θ n − 1 − θ ∗ � 2 2 − E � θ n − θ ∗ � 2 � � + 2 2 2 γ n n n n B 2 γ u 1 � � � E � θ u − 1 − θ ∗ � 2 2 − E � θ u − θ ∗ � 2 � � � � E f ( θ u − 1 ) − f ( θ ∗ ) + � 2 2 2 γ u u =1 u =1 u =1 n B 2 γ u + 4 D 2 � 2 DB √ n with γ n = 2 D � B √ n � 2 2 γ n u =1 n − 1 � 1 � − f ( θ ∗ ) � 2 DB � √ n • Using convexity: E f θ k n k =0

Stochastic subgradient descent - strong convexity - I • Assumptions – f n convex and B -Lipschitz-continuous – ( f n ) i.i.d. functions such that E f n = f – f µ -strongly convex on {� θ � 2 � D } – θ ∗ global optimum of f over {� θ � 2 � D } � � 2 µ ( n + 1) f ′ • Algorithm : θ n = Π D θ n − 1 − n ( θ n − 1 ) • Bound : n 2 B 2 � 2 � � E f kθ k − 1 − f ( θ ∗ ) � n ( n + 1) µ ( n + 1) k =1 • “Same” three-line proof than in the deterministic case • Minimax rate (Nemirovsky and Yudin, 1983; Agarwal et al., 2012)

Stochastic subgradient descent - strong convexity - II • Assumptions – f n convex and B -Lipschitz-continuous – ( f n ) i.i.d. functions such that E f n = f – θ ∗ global optimum of g = f + µ 2 � · � 2 2 – No compactness assumption - no projections • Algorithm : 2 2 µ ( n + 1) g ′ f ′ � � θ n = θ n − 1 − n ( θ n − 1 ) = θ n − 1 − n ( θ n − 1 )+ µθ n − 1 µ ( n + 1) n 2 B 2 � � 2 � • Bound : E g kθ k − 1 − g ( θ ∗ ) � n ( n + 1) µ ( n + 1) k =1

Stochastic approximation • Goal : Minimizing a function f defined on R p – given only unbiased estimates f ′ n ( θ n ) of its gradients f ′ ( θ n ) at certain points θ n ∈ R p • Machine learning - statistics – loss for a single pair of observations : f n ( θ ) = ℓ ( y n , � θ, Φ( x n ) � ) – f ( θ ) = E f n ( θ ) = E ℓ ( y n , � θ, Φ( x n ) � ) = generalization error – Expected gradient: f ′ ( θ ) = E f ′ � ℓ ′ ( y n , � θ, Φ( x n ) � ) Φ( x n ) � n ( θ ) = E – Non-asymptotic results

Convex stochastic approximation • Key assumption : smoothness and/or strong convexity • Key algorithm: stochastic gradient descent (a.k.a. Robbins-Monro) θ n = θ n − 1 − γ n f ′ n ( θ n − 1 ) – Polyak-Ruppert averaging: ¯ � n 1 θ n = k =0 θ k n +1 γ n = Cn − α – Which learning rate sequence γ n ? Classical setting:

Convex stochastic approximation Existing work • Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) – Strongly convex: O (( µn ) − 1 ) Attained by averaged stochastic gradient descent with γ n ∝ ( µn ) − 1 – Non-strongly convex: O ( n − 1 / 2 ) Attained by averaged stochastic gradient descent with γ n ∝ n − 1 / 2 – Bottou and Le Cun (2005); Bottou and Bousquet (2008); Hazan et al. (2007); Shalev-Shwartz and Srebro (2008); Shalev-Shwartz et al. (2007, 2009); Xiao (2010); Duchi and Singer (2009); Nesterov and Vial (2008); Nemirovski et al. (2009)

Convex stochastic approximation Existing work • Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) – Strongly convex: O (( µn ) − 1 ) Attained by averaged stochastic gradient descent with γ n ∝ ( µn ) − 1 – Non-strongly convex: O ( n − 1 / 2 ) Attained by averaged stochastic gradient descent with γ n ∝ n − 1 / 2 • Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) – All step sizes γ n = Cn − α with α ∈ (1 / 2 , 1) lead to O ( n − 1 ) for smooth strongly convex problems A single algorithm with global adaptive convergence rate for smooth problems?

Convex stochastic approximation Existing work • Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) – Strongly convex: O (( µn ) − 1 ) Attained by averaged stochastic gradient descent with γ n ∝ ( µn ) − 1 – Non-strongly convex: O ( n − 1 / 2 ) Attained by averaged stochastic gradient descent with γ n ∝ n − 1 / 2 • Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) – All step sizes γ n = Cn − α with α ∈ (1 / 2 , 1) lead to O ( n − 1 ) for smooth strongly convex problems • Non-asymptotic analysis for smooth problems? → see Bach and Moulines (2011)

Convex stochastic approximation Existing work • Known global minimax rates of convergence for non-smooth problems (Nemirovsky and Yudin, 1983; Agarwal et al., 2012) – Strongly convex: O (( µn ) − 1 ) Attained by averaged stochastic gradient descent with γ n ∝ ( µn ) − 1 – Non-strongly convex: O ( n − 1 / 2 ) Attained by averaged stochastic gradient descent with γ n ∝ n − 1 / 2 • Asymptotic analysis of averaging (Polyak and Juditsky, 1992; Ruppert, 1988) – All step sizes γ n = Cn − α with α ∈ (1 / 2 , 1) lead to O ( n − 1 ) for smooth strongly convex problems • A single adaptive algorithm for smooth problems with convergence rate O (min { 1 /µn, 1 / √ n } ) in all situations?

Adaptive algorithm for logistic regression • Logistic regression : (Φ( x n ) , y n ) ∈ R d × {− 1 , 1 } – Single data point: f n ( θ ) = log(1 + exp( − y n θ ⊤ Φ( x n ))) – Generalization error: f ( θ ) = E f n ( θ )

Adaptive algorithm for logistic regression • Logistic regression : (Φ( x n ) , y n ) ∈ R d × {− 1 , 1 } – Single data point: f n ( θ ) = log(1 + exp( − y n θ ⊤ Φ( x n ))) – Generalization error: f ( θ ) = E f n ( θ ) • Cannot be strongly convex ⇒ local strong convexity – unless restricted to | θ ⊤ Φ( x n ) | � M (and with constants e M ) – µ = lowest eigenvalue of the Hessian at the optimum f ′′ ( θ ∗ ) logistic loss

Adaptive algorithm for logistic regression • Logistic regression : (Φ( x n ) , y n ) ∈ R d × {− 1 , 1 } – Single data point: f n ( θ ) = log(1 + exp( − y n θ ⊤ Φ( x n ))) – Generalization error: f ( θ ) = E f n ( θ ) • Cannot be strongly convex ⇒ local strong convexity – unless restricted to | θ ⊤ Φ( x n ) | � M (and with constants e M ) – µ = lowest eigenvalue of the Hessian at the optimum f ′′ ( θ ∗ ) 2 R 2 √ n � � • n steps of averaged SGD with constant step-size 1 / – with R = radius of data (Bach, 2013): � 1 √ n, R 2 �� 4 E f (¯ θ n ) − f ( θ ∗ ) � min 15 + 5 R � θ 0 − θ ∗ � nµ – Proof based on self-concordance (Nesterov and Nemirovski, 1994)

Adaptive algorithm for logistic regression • Logistic regression : (Φ( x n ) , y n ) ∈ R d × {− 1 , 1 } – Single data point: f n ( θ ) = log(1 + exp( − y n θ ⊤ Φ( x n ))) – Generalization error: f ( θ ) = E f n ( θ ) • Cannot be strongly convex ⇒ local strong convexity – unless restricted to | θ ⊤ Φ( x n ) | � M (and with constants e M ) – µ = lowest eigenvalue of the Hessian at the optimum f ′′ ( θ ∗ ) 2 R 2 √ n � � • n steps of averaged SGD with constant step-size 1 / – with R = radius of data (Bach, 2013): � 1 √ n, R 2 �� 4 E f (¯ θ n ) − f ( θ ∗ ) � min 15 + 5 R � θ 0 − θ ∗ � nµ – A single adaptive algorithm for smooth problems with convergence rate O (1 /n ) in all situations?

Least-mean-square (LMS) algorithm • Least-squares : f ( θ ) = 1 ( y n − Φ( x n ) ⊤ θ ) 2 � with θ ∈ R d � 2 E – SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) – usually studied without averaging and decreasing step-sizes � Φ( x n )Φ( x n ) ⊤ � – with strong convexity assumption E = H � µ · Id

Least-mean-square (LMS) algorithm • Least-squares : f ( θ ) = 1 ( y n − Φ( x n ) ⊤ θ ) 2 � with θ ∈ R d � 2 E – SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) – usually studied without averaging and decreasing step-sizes � Φ( x n )Φ( x n ) ⊤ � – with strong convexity assumption E = H � µ · Id • New analysis for averaging and constant step-size γ = 1 / (4 R 2 ) – Assume � Φ( x n ) � � R and | y n − Φ( x n ) ⊤ θ ∗ | � σ almost surely – No assumption regarding lowest eigenvalues of H θ n ) − f ( θ ∗ ) � 4 σ 2 d + 2 R 2 � θ 0 − θ ∗ � 2 E f (¯ – Main result: n n • Matches statistical lower bound (Tsybakov, 2003) – Non-asymptotic robust version of Gy¨ orfi and Walk (1996)

Least-mean-square (LMS) algorithm • Least-squares : f ( θ ) = 1 ( y n − Φ( x n ) ⊤ θ ) 2 � with θ ∈ R d � 2 E – SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) – usually studied without averaging and decreasing step-sizes � Φ( x n )Φ( x n ) ⊤ � – with strong convexity assumption E = H � µ · Id • New analysis for averaging and constant step-size γ = 1 / (4 R 2 ) – Assume � Φ( x n ) � � R and | y n − Φ( x n ) ⊤ θ ∗ | � σ almost surely – No assumption regarding lowest eigenvalues of H θ n ) − f ( θ ∗ ) � 4 σ 2 d + 2 R 2 � θ 0 − θ ∗ � 2 E f (¯ – Main result: n n • Improvement of bias term (Flammarion and Bach, 2014): � R 2 � θ 0 − θ ∗ � 2 , R 4 ( θ 0 − θ ∗ ) ⊤ H − 1 ( θ 0 − θ ∗ ) � min n 2 n

Least-mean-square (LMS) algorithm • Least-squares : f ( θ ) = 1 ( y n − Φ( x n ) ⊤ θ ) 2 � with θ ∈ R d � 2 E – SGD = least-mean-square algorithm (see, e.g., Macchi, 1995) – usually studied without averaging and decreasing step-sizes � Φ( x n )Φ( x n ) ⊤ � – with strong convexity assumption E = H � µ · Id • New analysis for averaging and constant step-size γ = 1 / (4 R 2 ) – Assume � Φ( x n ) � � R and | y n − Φ( x n ) ⊤ θ ∗ | � σ almost surely – No assumption regarding lowest eigenvalues of H θ n ) − f ( θ ∗ ) � 4 σ 2 d + 2 R 2 � θ 0 − θ ∗ � 2 E f (¯ – Main result: n n • Extension to Hilbert spaces (Dieuleveult and Bach, 2014): – Achieves minimax statistical rates given decay of spectrum of H

Least-squares - Proof technique • LMS recursion with ε n = y n − Φ( x n ) ⊤ θ ∗ : I − γ Φ( x n )Φ( x n ) ⊤ � � θ n − θ ∗ = ( θ n − 1 − θ ∗ ) + γ ε n Φ( x n ) � Φ( x n )Φ( x n ) ⊤ � • Simplified LMS recursion: with H = E � � θ n − θ ∗ = I − γH ( θ n − 1 − θ ∗ ) + γ ε n Φ( x n ) – Direct proof technique of Polyak and Juditsky (1992), e.g., n � n ( θ 0 − θ ∗ ) + γ � n − k ε k Φ( x k ) � � � θ n − θ ∗ = I − γH I − γH k =1 – Exact computations • Infinite expansion of Aguech, Moulines, and Priouret (2000) in powers of γ

Markov chain interpretation of constant step sizes � 2 • LMS recursion for f n ( θ ) = 1 � y n − Φ( x n ) ⊤ θ 2 Φ( x n ) ⊤ θ n − 1 − y n � � θ n = θ n − 1 − γ Φ( x n ) • The sequence ( θ n ) n is a homogeneous Markov chain – convergence to a stationary distribution π γ def – with expectation ¯ � θ γ = θπ γ (d θ )

Markov chain interpretation of constant step sizes � 2 • LMS recursion for f n ( θ ) = 1 � y n − Φ( x n ) ⊤ θ 2 Φ( x n ) ⊤ θ n − 1 − y n � � θ n = θ n − 1 − γ Φ( x n ) • The sequence ( θ n ) n is a homogeneous Markov chain – convergence to a stationary distribution π γ def – with expectation ¯ � θ γ = θπ γ (d θ ) • For least-squares, ¯ θ γ = θ ∗ – θ n does not converge to θ ∗ but oscillates around it – oscillations of order √ γ – cf. Kaczmarz method (Strohmer and Vershynin, 2009) • Ergodic theorem: – Averaged iterates converge to ¯ θ γ = θ ∗ at rate O (1 /n )

Simulations - synthetic examples • Gaussian distributions - d = 20 synthetic square 0 −1 log 10 [f( θ )−f( θ * )] −2 1/2R 2 −3 1/8R 2 1/32R 2 −4 1/2R 2 n 1/2 −5 0 2 4 6 log 10 (n)

Simulations - benchmarks • alpha ( d = 500 , n = 500 000 ), news ( d = 1 300 000 , n = 20 000 ) alpha square C=1 test alpha square C=opt test 1 1 0.5 0.5 log 10 [f( θ )−f( θ * )] 0 0 −0.5 −0.5 −1 −1 1/R 2 C/R 2 −1.5 −1.5 1/R 2 n 1/2 C/R 2 n 1/2 −2 −2 SAG SAG 0 2 4 6 0 2 4 6 log 10 (n) log 10 (n) news square C=1 test news square C=opt test 0.2 0.2 0 0 log 10 [f( θ )−f( θ * )] −0.2 −0.2 −0.4 −0.4 1/R 2 C/R 2 −0.6 −0.6 1/R 2 n 1/2 C/R 2 n 1/2 −0.8 −0.8 SAG SAG 0 2 4 0 2 4 log 10 (n) log 10 (n)

Beyond least-squares - Markov chain interpretation • Recursion θ n = θ n − 1 − γf ′ n ( θ n − 1 ) also defines a Markov chain � f ′ ( θ ) π γ (d θ ) = 0 – Stationary distribution π γ such that – When f ′ is not linear, f ′ ( � � f ′ ( θ ) π γ (d θ ) = 0 θπ γ (d θ )) � =

Beyond least-squares - Markov chain interpretation • Recursion θ n = θ n − 1 − γf ′ n ( θ n − 1 ) also defines a Markov chain � f ′ ( θ ) π γ (d θ ) = 0 – Stationary distribution π γ such that – When f ′ is not linear, f ′ ( � � f ′ ( θ ) π γ (d θ ) = 0 θπ γ (d θ )) � = • θ n oscillates around the wrong value ¯ θ γ � = θ ∗ – moreover, � θ ∗ − θ n � = O p ( √ γ ) • Ergodic theorem – averaged iterates converge to ¯ θ γ � = θ ∗ at rate O (1 /n ) – moreover, � θ ∗ − ¯ θ γ � = O ( γ ) (Bach, 2013) • NB: coherent with earlier results by Nedic and Bertsekas (2000)

Simulations - synthetic examples • Gaussian distributions - d = 20 synthetic logistic − 1 0 log 10 [f( θ )−f( θ * )] −1 −2 1/2R 2 −3 1/8R 2 1/32R 2 −4 1/2R 2 n 1/2 −5 0 2 4 6 log 10 (n)

Restoring convergence through online Newton steps • Known facts 1. Averaged SGD with γ n ∝ n − 1 / 2 leads to robust rate O ( n − 1 / 2 ) for all convex functions 2. Averaged SGD with γ n constant leads to robust rate O ( n − 1 ) for all convex quadratic functions 3. Newton’s method squares the error at each iteration for smooth functions 4. A single step of Newton’s method is equivalent to minimizing the quadratic Taylor expansion – Online Newton step – Rate: O (( n − 1 / 2 ) 2 + n − 1 ) = O ( n − 1 ) – Complexity: O ( d ) per iteration for linear predictions

Restoring convergence through online Newton steps • Known facts 1. Averaged SGD with γ n ∝ n − 1 / 2 leads to robust rate O ( n − 1 / 2 ) for all convex functions 2. Averaged SGD with γ n constant leads to robust rate O ( n − 1 ) for all convex quadratic functions 3. Newton’s method squares the error at each iteration for smooth functions 4. A single step of Newton’s method is equivalent to minimizing the quadratic Taylor expansion • Online Newton step – Rate: O (( n − 1 / 2 ) 2 + n − 1 ) = O ( n − 1 ) – Complexity: O ( d ) per iteration for linear predictions

Restoring convergence through online Newton steps def at ˜ � ℓ ( y n , θ ⊤ Φ( x n )) � • The Newton step for f = E f n ( θ ) = E θ is equivalent to minimizing the quadratic approximation g ( θ ) = f (˜ θ ) + f ′ (˜ θ ) ⊤ ( θ − ˜ 2 ( θ − ˜ θ ) ⊤ f ′′ (˜ θ )( θ − ˜ θ ) + 1 θ ) = f (˜ n (˜ θ ) ⊤ ( θ − ˜ 2 ( θ − ˜ n (˜ θ )( θ − ˜ θ ) + 1 θ ) + E f ′ θ ) ⊤ E f ′′ θ ) � � f (˜ n (˜ θ ) ⊤ ( θ − ˜ 2 ( θ − ˜ n (˜ θ )( θ − ˜ θ ) + 1 θ ) + f ′ θ ) ⊤ f ′′ = E θ )

Restoring convergence through online Newton steps def at ˜ � ℓ ( y n , θ ⊤ Φ( x n )) � • The Newton step for f = E f n ( θ ) = E θ is equivalent to minimizing the quadratic approximation g ( θ ) = f (˜ θ ) + f ′ (˜ θ ) ⊤ ( θ − ˜ 2 ( θ − ˜ θ ) ⊤ f ′′ (˜ θ )( θ − ˜ θ ) + 1 θ ) = f (˜ n (˜ θ ) ⊤ ( θ − ˜ 2 ( θ − ˜ n (˜ θ )( θ − ˜ θ ) + 1 θ ) + E f ′ θ ) ⊤ E f ′′ θ ) � � f (˜ n (˜ θ ) ⊤ ( θ − ˜ 2 ( θ − ˜ n (˜ θ )( θ − ˜ θ ) + 1 θ ) + f ′ θ ) ⊤ f ′′ = E θ ) • Complexity of least-mean-square recursion for g is O ( d ) n (˜ n (˜ θ )( θ n − 1 − ˜ f ′ θ ) + f ′′ � � θ n = θ n − 1 − γ θ ) θ )Φ( x n )Φ( x n ) ⊤ has rank one n (˜ θ ) = ℓ ′′ ( y n , Φ( x n ) ⊤ ˜ – f ′′ – New online Newton step without computing/inverting Hessians

Choice of support point for online Newton step • Two-stage procedure (1) Run n/ 2 iterations of averaged SGD to obtain ˜ θ (2) Run n/ 2 iterations of averaged constant step-size LMS – Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) – Provable convergence rate of O ( d/n ) for logistic regression – Additional assumptions but no strong convexity

Choice of support point for online Newton step • Two-stage procedure (1) Run n/ 2 iterations of averaged SGD to obtain ˜ θ (2) Run n/ 2 iterations of averaged constant step-size LMS – Reminiscent of one-step estimators (see, e.g., Van der Vaart, 2000) – Provable convergence rate of O ( d/n ) for logistic regression – Additional assumptions but no strong convexity • Update at each iteration using the current averaged iterate n (¯ n (¯ θ n − 1 )( θ n − 1 − ¯ f ′ θ n − 1 ) + f ′′ � � – Recursion: θ n = θ n − 1 − γ θ n − 1 ) – No provable convergence rate (yet) but best practical behavior – Note (dis)similarity with regular SGD: θ n = θ n − 1 − γf ′ n ( θ n − 1 )

Simulations - synthetic examples • Gaussian distributions - d = 20 synthetic logistic − 1 synthetic logistic − 2 0 0 log 10 [f( θ )−f( θ * )] −1 log 10 [f( θ )−f( θ * )] −1 −2 −2 1/2R 2 every 2 p −3 −3 1/8R 2 every iter. 1/32R 2 −4 −4 2−step 1/2R 2 n 1/2 2−step−dbl. −5 −5 0 2 4 6 0 2 4 6 log 10 (n) log 10 (n)

Simulations - benchmarks • alpha ( d = 500 , n = 500 000 ), news ( d = 1 300 000 , n = 20 000 ) alpha logistic C=1 test alpha logistic C=opt test 0.5 0.5 0 0 log 10 [f( θ )−f( θ * )] −0.5 −0.5 −1 −1 1/R 2 C/R 2 1/R 2 n 1/2 C/R 2 n 1/2 −1.5 −1.5 SAG SAG −2 Adagrad −2 Adagrad Newton Newton −2.5 −2.5 0 2 4 6 0 2 4 6 log 10 (n) log 10 (n) news logistic C=1 test news logistic C=opt test 0.2 0.2 0 0 log 10 [f( θ )−f( θ * )] −0.2 −0.2 1/R 2 C/R 2 −0.4 −0.4 1/R 2 n 1/2 C/R 2 n 1/2 −0.6 −0.6 SAG SAG −0.8 −0.8 Adagrad Adagrad Newton Newton −1 −1 0 2 4 0 2 4 log 10 (n) log 10 (n)

Going beyond a single pass over the data • Stochastic approximation – Assumes infinite data stream – Observations are used only once – Directly minimizes testing cost E ( x,y ) ℓ ( y, θ ⊤ Φ( x ))

Going beyond a single pass over the data • Stochastic approximation – Assumes infinite data stream – Observations are used only once – Directly minimizes testing cost E ( x,y ) ℓ ( y, θ ⊤ Φ( x )) • Machine learning practice – Finite data set ( x 1 , y 1 , . . . , x n , y n ) – Multiple passes � n – Minimizes training cost 1 i =1 ℓ ( y i , θ ⊤ Φ( x i )) n – Need to regularize (e.g., by the ℓ 2 -norm) to avoid overfitting n • Goal : minimize g ( θ ) = 1 � f i ( θ ) n i =1

Stochastic vs. deterministic methods n • Minimizing g ( θ ) = 1 � � y i , θ ⊤ Φ( x i ) � f i ( θ ) with f i ( θ ) = ℓ + µ Ω( θ ) n i =1 n • Batch gradient descent: θ t = θ t − 1 − γ t g ′ ( θ t − 1 ) = θ t − 1 − γ t � f ′ i ( θ t − 1 ) n i =1 – Linear (e.g., exponential) convergence rate in O ( e − αt ) – Iteration complexity is linear in n (with line search)

Stochastic vs. deterministic methods n • Minimizing g ( θ ) = 1 � � y i , θ ⊤ Φ( x i ) � f i ( θ ) with f i ( θ ) = ℓ + µ Ω( θ ) n i =1 n • Batch gradient descent: θ t = θ t − 1 − γ t g ′ ( θ t − 1 ) = θ t − 1 − γ t � f ′ i ( θ t − 1 ) n i =1 – Linear (e.g., exponential) convergence rate in O ( e − αt ) – Iteration complexity is linear in n (with line search) • Stochastic gradient descent: θ t = θ t − 1 − γ t f ′ i ( t ) ( θ t − 1 ) – Sampling with replacement: i ( t ) random element of { 1 , . . . , n } – Convergence rate in O (1 /t ) – Iteration complexity is independent of n (step size selection?)

Stochastic vs. deterministic methods • Goal = best of both worlds : Linear rate with O (1) iteration cost Robustness to step size log(excess cost) stochastic deterministic time

Stochastic vs. deterministic methods • Goal = best of both worlds : Linear rate with O (1) iteration cost Robustness to step size log(excess cost) stochastic deterministic hybrid time

Accelerating gradient methods - Related work • Nesterov acceleration – Nesterov (1983, 2004) – Better linear rate but still O ( n ) iteration cost • Hybrid methods, incremental average gradient, increasing batch size – Bertsekas (1997); Blatt et al. (2008); Friedlander and Schmidt (2011) – Linear rate, but iterations make full passes through the data.

Accelerating gradient methods - Related work • Momentum, gradient/iterate averaging, stochastic version of accelerated batch gradient methods – Polyak and Juditsky (1992); Tseng (1998); Sunehag et al. (2009); Ghadimi and Lan (2010); Xiao (2010) – Can improve constants, but still have sublinear O (1 /t ) rate • Constant step-size stochastic gradient (SG), accelerated SG – Kesten (1958); Delyon and Juditsky (1993); Solodov (1998); Nedic and Bertsekas (2000) – Linear convergence, but only up to a fixed tolerance. • Stochastic methods in the dual – Shalev-Shwartz and Zhang (2012) – Similar linear rate but limited choice for the f i ’s

Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) • Stochastic average gradient (SAG) iteration – Keep in memory the gradients of all functions f i , i = 1 , . . . , n – Random selection i ( t ) ∈ { 1 , . . . , n } with replacement n � f ′ i ( θ t − 1 ) if i = i ( t ) – Iteration: θ t = θ t − 1 − γ t � y t i with y t i = y t − 1 n otherwise i i =1

Stochastic average gradient (Le Roux, Schmidt, and Bach, 2012) • Stochastic average gradient (SAG) iteration – Keep in memory the gradients of all functions f i , i = 1 , . . . , n – Random selection i ( t ) ∈ { 1 , . . . , n } with replacement n � f ′ i ( θ t − 1 ) if i = i ( t ) – Iteration: θ t = θ t − 1 − γ t � y t i with y t i = y t − 1 n otherwise i i =1 • Stochastic version of incremental average gradient (Blatt et al., 2008) • Extra memory requirement – Supervised machine learning – If f i ( θ ) = ℓ i ( y i , Φ( x i ) ⊤ θ ) , then f ′ i ( θ ) = ℓ ′ i ( y i , Φ( x i ) ⊤ θ ) Φ( x i ) – Only need to store n real numbers

Large-scale machine learning and convex optimization Francis Bach - PowerPoint PPT Presentation

Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Sup erieure, Paris, France Journ ees Statistiques du Sud - June 2014 Slides available at www.di.ens.fr/~fbach/gradsto_statsud.pdf Context Machine

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Optimization Problems

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

Some Recent Advances in Non-convex Optimization Purushottam Kar IIT KANPUR Outline of the Talk

Convex Programs COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Convex

Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. Optimization for Machine Learning by

16. Review of convex optimization Convex sets and functions Convex programming models

CSCI 446 ARTIFICIAL INTELLIGENCE EXAM 1 STUDY OUTLINE Introduction to Artificial Intelligence

Multi-agent learning Erik Berbee & Bas van Gijzel , Master Student AT, Utrecht University Erik

Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT Min-Max Optimization Solve:

Introduction to Machine Learning 25. Multiplicative Updates, Games and Boosting Alex Smola

The Challenge Initiative: Business Unusual Approach to Scale up Kojo Lokko Bill & Melinda

On Scalable and Efficient Computation of Large Scale Optimal Transport Yujia Xie, Minshuo Chen,

Point Detectors KRYSTIAN MIKOLAJCZYK AND CORDELIA SCHMID [2004] Shreyas Saxena Gurkirit Singh

Efficient Interactive Training Selection for Large-scale Entity Resolution Qing Wang, Dinusha

Large-scale machine learning and convex optimization Francis Bach - PowerPoint PPT Presentation

Large-scale machine learning and convex optimization Francis Bach INRIA - Ecole Normale Sup erieure, Paris, France Journ ees Statistiques du Sud - June 2014 Slides available at www.di.ens.fr/~fbach/gradsto_statsud.pdf Context Machine

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Optimization Problems

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

Some Recent Advances in Non-convex Optimization Purushottam Kar IIT KANPUR Outline of the Talk

Convex Programs COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning Convex

Convex Optimization by Stephen Boyd, and Lieven Vandenberghe. Optimization for Machine Learning by

16. Review of convex optimization Convex sets and functions Convex programming models

CSCI 446 ARTIFICIAL INTELLIGENCE EXAM 1 STUDY OUTLINE Introduction to Artificial Intelligence

Multi-agent learning Erik Berbee &amp; Bas van Gijzel , Master Student AT, Utrecht University Erik

Theory and Statistics Constantinos Daskalakis CSAIL and EECS, MIT Min-Max Optimization Solve:

Introduction to Machine Learning 25. Multiplicative Updates, Games and Boosting Alex Smola

The Challenge Initiative: Business Unusual Approach to Scale up Kojo Lokko Bill &amp; Melinda

On Scalable and Efficient Computation of Large Scale Optimal Transport Yujia Xie, Minshuo Chen,

Point Detectors KRYSTIAN MIKOLAJCZYK AND CORDELIA SCHMID [2004] Shreyas Saxena Gurkirit Singh

Efficient Interactive Training Selection for Large-scale Entity Resolution Qing Wang, Dinusha

Multi-agent learning Erik Berbee & Bas van Gijzel , Master Student AT, Utrecht University Erik

The Challenge Initiative: Business Unusual Approach to Scale up Kojo Lokko Bill & Melinda