Sparse Convex Optimization Methods for Machine Learning PhD Defense Talk 2011 / 10 / 04 Martin Jaggi Examiner: Co-Examiners: Emo Welzl Bernd Gärtner, Elad Hazan, Joachim Giesen, Joachim Buhmann
Convex Optimization D ⊂ R n
f ( x ) x ∈ D f ( x ) min x D ⊂ R n
f ( x ) x ∈ D f ( x ) min x D ⊂ R n
f ( x ) x ∈ D f ( x ) min x D ⊂ R n
f ( x ) x ∈ D f ( x ) min x D ⊂ R n
f ( x ) The Linearized Problem min y ∈ D f ( x ) + h y � x, d x i x s D ⊂ R n Algorithm 1 Greedy on a Compact Convex Set Pick an arbitrary starting point x (0) ⇤ D Theorem: for k = 0 . . . ⇥ do Algorithm obtains Let d x ⇤ ∂ f ( x ( k ) ) be a subgradient to f at x ( k ) � 1 accuracy � O Compute s := approx arg min ⌅ y, d x ⇧ k y ∈ D after steps . 2 k Let α := k +2 Update x ( k +1) := x ( k ) + α ( s � x ( k ) ) end for
f ( x ) The Linearized Problem min y ∈ D f ( x ) + h y � x, d x i x D ⊂ R n d x Our Method Gradient Descent Approx. solve Cost per step Projection back to D linearized problem on D 1 /k 1 /k Convergence ✓ Sparse / Low ✗ Rank Solutions (depending on the domain)
History & Related Work Known Approx. Primal-Dual Domain Stepsize Subproblem Guarantee Frank & Wolfe linear inequality ✗ ✗ ✗ 1956 constraints Dunn general bounded ✓ ✗ ✗ 1978, 1980 convex domain Zhang ✓ ✗ ✗ convex hulls 2003 Clarkson ✓ ✓ ✗ unit simplex 2008, 2010 Hazan semidefinite matrices ✓ ✓ ✓ 2008 of bounded trace general bounded ✓ ✓ ✓ J. PhD Thesis convex domain
Sparse Approximation x ∈ ∆ n f ( x ) min D := conv ( { e i | i ∈ [ n ] } ) unit simplex 1 for k = 0 . . . ∞ do Corollary: Let d x ∈ ∂ f ( x ( k ) ) be a subgradient to f at x ( k ) Algorithm gives an -approximate Compute i := arg min i ( d x ) i ε � 1 2 Let α := k +2 � solution of sparsity . O Update x ( k +1) := x ( k ) + α ( e i − x ( k ) ) end for ε [ Clarkson SODA '08 ] lower bound: Sparsity as a function of “Coresets” � 1 � the approximation quality Ω ε
Applications • Smallest enclosing ball • Linear Classifiers (such as Support Vector Machines, -loss) ` 2 x ∈ ∆ n x T ( K + t 1 ) x min • Model Predictive Control • Mean Variance Portfolio Optimization x ∈ ∆ n x T Cx − t · b T x min
Sparse Approximation k x k 1 1 f ( x ) min D := conv ( {± e i | i ∈ [ n ] } ) ` 1 -ball for k = 0 . . . ∞ do Corollary: Let d x ∈ @ f ( x ( k ) ) be a subgradient to f at x ( k ) Compute i := arg max i | ( d x ) i | , Algorithm gives an -approximate ε and let s := e i · sign (( − d x ) i ) � 1 2 Let ↵ := � solution of sparsity . O k +2 Update x ( k +1) := x ( k ) + ↵ ( s − x ( k ) ) ε end for lower bound: Sparsity as a function of “Coresets” � 1 � the approximation quality Ω ε
Applications • -regularized regression ` 1 k x k 1 t k Ax � b k 2 min 2 Sparse Recovery
Low Rank Approximation min x ∈ D f ( x ) � v 2 R n , vv T � X 2 Sym n × n � D := conv ( ) = k v k 2 =1 X ⌫ 0 Tr ( X ) = 1 spectahedron for k = 0 . . . 1 do Corollary: X 2 ∂ f ( X ( k ) ) be a subgradient to f at X ( k ) Let D 2 Let α := Algorithm gives an -approximate ε k +2 � 1 Compute v := v ( k ) = ApproxEV ( D X , α C f ) � solution of rank . Update X ( k +1) := X ( k ) + α ( vv T � X ( k ) ) O end for ε [ Hazan LATIN '08 ] lower bound: � 1 � Ω ε
Applications • Trace norm regularized problems k X k ∗ t f ( X ) min Low-Rank Matrix Recovery • Max norm regularized problems
⎫ ⎭ Matrix Factorizations for recommender systems Movies The Netflix challenge: 1 3 17k Movies Customers 500k Customers 4 100M Observed Entries 1 ( ≈ 1%) ≈ UV T Y 2 3 = v (1) 5 3 ⎬ k Sulovsk ý 2 v ( k ) = 2 1 3 u (1) u ( k ) X ⌫ 0 f ( X ) min Is equivalent to: s.t. Tr ( X ) = t ( Y ij � ( UV T ) ij ) 2 X min U,V ( i,j ) ∈ Ω 1 2 4 1 UU T UV T s.t. k U k 2 F ro + k V k 2 F ro = t 2 3 5 3 2 =: X 2 1 3 5 V U T V V T 1 1 2 2 [ J, Sulovsk ý ICML 2010 ] 4 2 2 3 1 3 3
A Simple Alternative Optimization Duality The Problem x ∈ D f ( x ) min f ( x ) The Dual gap (x) ω ( x ) := min y ∈ D f ( x ) + h y � x, d x i ω ( x ) Weak Duality x ω ( x ) ≤ f ( x ⇤ ) ≤ f ( x 0 ) D ⊂ R n
Pathwise Optimization f t 0 ( x ) 200 The Parameterized Problem x ∈ D f t ( x ) min ω t 0 ( x ) f t ( x ∗ t ) 100 “Better than necessary” g t ( x ) ≤ ε 2 0 t t 0 The difference � 1 − 1 � ⇐ | t 0 − t | ≤ ε · P f g t 0 ( x ) − g t ( x ) ≤ ε 2 “Continuity in the parameter” “Still good enough” Theorem: � 1 � g t 0 ( x ) ≤ ε O There are many intervals ε of piecewise constant -approx. solutions. ε [ Giesen, J, Laue ESA 2010 ]
Applications test accuracy • Smallest enclosing ball ionosphere breast-cancer of moving points t • SVMs, MKL (with 2 base kernels) x ∈ ∆ n x T ( K + t 1 ) x min • Model Predictive Control • robust PCA • Mean Variance Portfolio Optimization • Recommender Systems x ∈ ∆ n x T Cx − t · b T x min
f t 0 ( x ) 200 Thanks ω t 0 ( x ) f t ( x ∗ t ) 100 co-authors: 0 Bernd Gärtner t t 0 Joachim Giesen Soeren Laue Marek Sulovsk ý f ( x ) 3D visualization: Robert Carnecky x D ⊂ R n
Recommend
More recommend