scalable machine learning
play

Scalable Machine Learning 4. Optimization Alex Smola Yahoo! - PowerPoint PPT Presentation

Scalable Machine Learning 4. Optimization Alex Smola Yahoo! Research and ANU http://alex.smola.org/teaching/berkeley2012 Stat 260 SP 12 4. Optimization Optimization Basic Techniques Gradient descent Newton's method Conjugate


  1. Basic steps • Repeat until converged • Map: compute function & derivative at given parameter t • Reduce: aggregate parts of function and derivative • Decide based on f(x) and f’(x) which interval to pursue • Send updated parameter to all machines given a starting point x ∈ dom f . repeat 1. ∆ x := −∇ f ( x ). 2. Line search. Choose step size t via exact or backtracking line search. 3. Update. x := x + t ∆ x . until stopping criterion is satisfied. update value in search direction and feed back communicate final value to each machine

  2. Scalability analysis • Linear time in number of instances • Linear storage in problem size, not data • Logarithmic time in accuracy • ‘perfect’ scalability • 10s of passes through dataset for each iteration (line search is very expensive) • MapReduce loses state at each iteration • Single master as bottleneck (important if the state space is several GB)

  3. A Better Algorithm • Avoiding the line search • Not used in convergence proof anyway • Simply pick update x ← x − 1 M ∂ x f ( x ) • Only single pass through data per iteration • Only single MapReduce pass per iteration • Logarithmic iteration bound (as before) m log f ( x ) − f ( x ∗ ) M ✏

  4. Newton’s Method Isaac Newton

  5. Newton Method • Convex objective function f • Nonnegative second derivative ∂ 2 x f ( x ) ⌫ 0 • Taylor expansion f ( x + δ ) = f ( x ) + h δ , ∂ x f ( x ) i + 1 2 δ > ∂ 2 x f ( x ) δ + O ( δ 3 ) gradient Hessian • Minimize approximation & iterate til converged ⇤ − 1 ∂ x f ( x ) ∂ 2 ⇥ x f ( x ) x ← x −

  6. Convergence Analysis • There exists a region around optimality where Newton’s method converges quadratically if f is twice continuously differentiable • For some region around x* gradient is well approximated by Taylor expansion x ∗ � x, ∂ 2 �  γ k x ∗ � x k 2 � ↵� ⌦ � ∂ x f ( x ∗ ) � ∂ x f ( x ) � x f ( x ) • Expand Newton update ⇤ − 1 [ ∂ x f ( x n ) � ∂ x f ( x ∗ )] � � � x n � x ∗ � ∂ 2 ⇥ k x n +1 � x ∗ k = x f ( x n ) � � � � ⇤� ⇤ − 1 ⇥ ∂ 2 ∂ f ⇥ = x f ( x n ) x ( x n )[ x n � x ∗ ] � ∂ x f ( x n ) + ∂ x f ( x ∗ ) � � � � � ⇤ − 1 � � k x n � x ∗ k 2 ∂ 2 ⇥  γ x f ( x n ) � � �

  7. Convergence Analysis • Two convergence regimes • As slow as gradient descent outside the region where Taylor expansion is good x ∗ � x, ∂ 2 �  γ k x ∗ � x k 2 � ↵� ⌦ � ∂ x f ( x ∗ ) � ∂ x f ( x ) � x f ( x ) • Quadratic convergence once the bound holds � ⇤ − 1 � � k x n � x ∗ k 2 ∂ 2 ⇥ k x n +1 � x ∗ k  γ x f ( x n ) � � � • Newton method is affine invariant (proof by chain rule) See Boyd and Vandenberghe, Chapter 9.5 for much more

  8. Newton method rescales space wrong metric x (0) x (2) x (1) from Boyd & Vandenberghe

  9. Newton method rescales space locally adaptive metric x x + ∆ x nsd x + ∆ x nt from Boyd & Vandenberghe

  10. Parallel Newton Method • Good rate of convergence • Few passes through data needed • Parallel aggregation of gradient and Hessian • Gradient requires O(d) data • Hessian requires O(d 2 ) data • Update step is O(d 3 ) & nontrivial to parallelize • Use it only for low dimensional problems

  11. Conjugate Gradient Descent

  12. Key Idea • Minimizing quadratic function ( K ⌫ 0) f ( x ) = 1 2 x > Kx − l > x + c takes cubic time (e.g. Cholesky factorization) • Matrix vector products and orthogonalization • Vectors x, x’ are K orthogonal if x > Kx 0 = 0 • m mutually K orthogonal vectors x i ∈ R m • form a basis m x > i Kz X z = x i x > • allow expansion i Kx i i =1 m x > i y • solve linear system X z = x i for y = Kz x > i Kx i i =1

  13. Proof • m mutually K orthogonal vectors x i ∈ R m m x > • form a basis i Kz X z = x i x > i Kx i • allow expansion i =1 m x > i y X • solve linear system z = x i for y = Kz x > i Kx i i =1 • Show linear independence by contradiction X X α i x i = 0 hence 0 = x > α i x i = x > j K j Kx j α j i i • Reconstruction - expand z into basis X X α i x i hence x > j Kz = x > α i x i = x > z = j K j Kx j α j i i • For linear system plug in y = Kz

  14. ??? • Need vectors x i • Need to orthogonalize the vectors • How to select them • K-orthogonal vectors whiten the space since f ( x ) = 1 2 x > x − l > x + c has trivial solution x = l

  15. Conjugate Gradient Descent • Gradient computation f ( x ) = 1 2 x > Kx − l > x + c hence g ( x ) = Kx − l • Algorithm initialize x 0 and v 0 = g 0 = Kx 0 − l and i = 0 repeat deflation step g > i v i x i +1 = x i − v i v > i Kv i g i +1 = Kx i +1 − l g > i +1 Kv i v i +1 = − g i +1 + v i v > i Kv i i ← i + 1 K orthogonal until g i = 0

  16. Proof - Deflation property g > i v i x i +1 = x i − v i v > i Kv i g i +1 = Kx i +1 − l g > i +1 Kv i v i +1 = − g i +1 + v i v > i Kv i • First assume that the v i are K orthogonal and show that x i+1 is optimal in span of {v 1 .. v i } • Enough if we show that v > j g i = 0 for all j < i • For j=i expand g >  � i v i v > i g i +1 = v > Kx i − l − Kv i i v > i Kv i g > i v i = v > i g i − v > i Kv i = 0 v > i Kv i • For smaller j a consequence of K orthogonality

  17. Proof - K orthogonality g > i v i x i +1 = x i − v i v > i Kv i g i +1 = Kx i +1 − l g > i +1 Kv i v i +1 = − g i +1 + v i v > i Kv i • Need to check that v i+1 is K orthogonal to all v j (rest automatically true by construction) g > i +1 Kv i v > j Kv i +1 = − v > j Kg i +1 + v > j Kv i v > i KV i 0 by deflation 0 by K orthogonality

  18. Properties • Subspace expansion method for optimality (g, Kg, K 2 g, K 3 g, ...) • Focuses on leading eigenvalues • Often sufficient to take only a few steps (whenever the eigenvalues decay rapidly)

  19. Extensions Compute Hessian K i := f ′′ ( x i ) and update α i , β i with Generic Method g ⊤ i v i α i = − x and v updates v ⊤ i K i v i g ⊤ i +1 K i v i β i = v ⊤ i K i v i This requires calculation of the Hessian at each iteration. Fletcher–Reeves [163] Find α i via a line search and use Theorem 6.20 (iii) for β i α i = argmin α f ( x i + α v i ) g ⊤ i +1 g i +1 β i = g ⊤ i g i Polak–Ribiere [398] Find α i via a line search α i = argmin α f ( x i + α v i ) ( g i +1 − g i ) ⊤ g i +1 β i = g ⊤ i g i Experimentally, Polak–Ribiere tends to be better than Fletcher–Reeves.

  20. BFGS algorithm Broyden-Fletcher-Goldfarb-Shanno

  21. Basic Idea • Newton-like method to compute descent direction δ i = B − 1 ∂ x f ( x i − 1 ) i • Line search on f in direction x i +1 = x i − α i δ i • Update B with rank 2 matrix B i +1 = B i + u i u > i + v i v > i • Require that Quasi-Newton condition holds B i +1 ( x i +1 − x i ) = ∂ x f ( x i +1 ) − ∂ x f ( x i ) g i g > − B i δ i δ > i B i i B i +1 = B i + α i δ > δ > i B i δ i i g i

  22. Properties • Simple rank 2 update for B • Use matrix inversion lemma to update inverse • Memory-limited versions L-BFGS • Use toolbox if possible (TAO, MATLAB) (typically slower if you implement it yourself) • Works well for nonlinear nonconvex objectives (often even for nonsmooth objectives)

  23. 4.2 Constrained Convex Problems

  24. Basic Convexity

  25. Constrained Convex Minimization • Optimization problem minimize f ( x ) x subject to c i ( x ) ≤ 0 for all i • Common constraints • linear inequality constraints h w i , x i + b i  0 • quadratic cone constraints x > Qx + b > x  c with Q ⌫ 0 • semidefinite constraints X M ⌫ 0 or M 0 + x i M i ⌫ 0 i

  26. Constrained Convex Minimization • Optimization problem minimize f ( x ) x subject to c i ( x ) ≤ 0 for all i Equality is special case • Common constraints Why? • linear inequality constraints h w i , x i + b i  0 • quadratic cone constraints x > Qx + b > x  c with Q ⌫ 0 • semidefinite constraints X M ⌫ 0 or M 0 + x i M i ⌫ 0 i

  27. Example - Support Vectors {x | <w x> + b = + 1 } , , {x | <w x> + b = − 1 } Note: h w, x 1 i + b = 1 ◆ , <w x 1 > + b = +1 h w, x 2 i + b = � 1 , <w x 2 > + b = − 1 y i = +1 ❍ x 1 ◆ ❍ x 2 hence h w, x 1 � x 2 i + b = 2 , => <w (x 1 − x 2 )> = 2 ⌧ w ◆ � w 2 , > 2 < y i = − 1 , w => k w k , x 1 � x 2 (x 1 − x 2 ) = hence = ||w|| ◆ ||w|| k w k ❍ ❍ margin , {x | <w x> + b = 0 } ❍ 1 2 k w k 2 subject to y i [ h w, x i i + b ] � 1 minimize w,b

  28. Lagrange Multipliers • Lagrange function n X L ( x, α ) := f ( x ) + α i c i ( x ) where α i ≥ 0 i =1 • Saddlepoint Condition If there are x* and nonnegative α * such that L ( x ∗ , α ) ≤ L ( x ∗ , α ∗ ) ≤ L ( x, α ∗ ) then x* is an optimal solution to the constrained optimization problem

  29. Proof L ( x ∗ , α ) ≤ L ( x ∗ , α ∗ ) ≤ L ( x, α ∗ ) • From first inequality we see that x* is feasible ( α i − α ∗ i ) c i ( x ∗ ) ≤ 0 for all α i ≥ 0 • Setting some yields KKT conditions α i = 0 α ∗ i c i ( x ∗ ) = 0 • Consequently we have X L ( x ∗ , α ∗ ) = f ( x ∗ ) ≤ L ( x, α ∗ ) = f ( x ) + α ∗ i c i ( x ) ≤ f ( x ) This proves optimality i

  30. Constraint gymnastics (all three conditions are equivalent) • Slater’s condition There exists some x such that for all i c i ( x ) < 0 • Karlin’s condition For all nonnegative α there exists some x such that X α i c i ( x ) ≤ 0 i • Strict constraint qualification The feasible region contains at least two distinct elements and there exists an x in X such that all c i (x) are strictly convex at x with respect to X

  31. Necessary Kuhn-Tucker Conditions • Assume optimization problem • satisfies the constraint qualifications • has convex differentiable objective + constraints • Then the KKT conditions are necessary & sufficient X ∂ x L ( x ∗ , α ∗ ) = ∂ x f ( x ∗ ) + i ∂ x c i ( x ∗ ) = 0 (Saddlepoint in x ∗ ) α ∗ i ∂ α i L ( x ∗ , α ∗ ) = c i ( x ∗ ) ≤ 0 (Saddlepoint in α ∗ ) X α ∗ i c i ( x ∗ ) = 0 (Vanishing KKT-gap) i Yields algorithm for solving optimization problems Solve for saddlepoint and KKT conditions

  32. Proof f ( x ) − f ( x ⇤ ) ≥ [ ∂ x f ( x ⇤ )] > ( x − x ⇤ ) (by convexity) i [ ∂ x c i ( x ⇤ )] > ( x − x ⇤ ) X α ⇤ (by Saddlepoint in x ⇤ ) = − i X α ⇤ i ( c i ( x ) − c i ( x ⇤ )) (by convexity) ≥ − i X α ⇤ = i c i ( x ) (by vanishing KKT gap) i ≥ 0

  33. Linear and Quadratic Programs

  34. Linear Programs • Objective c > x subject to Ax + d ≤ 0 minimize x • Lagrange function L ( x, α ) = c > x + α > ( Ax + d ) • Optimality conditions ∂ x L ( x, α ) = A > α + c = 0 ∂ α L ( x, α ) = Ax + d ≤ 0 0 = α > ( Ax + d ) 0 ≤ α • Dual problem d > α subject to A > α + c = 0 and α ≥ 0 maximize i

  35. Linear Programs • Objective c > x subject to Ax + d ≤ 0 minimize x • Lagrange function L ( x, α ) = c > x + α > ( Ax + d ) • Optimality conditions plug into L plug into L ∂ x L ( x, α ) = A > α + c = 0 ∂ α L ( x, α ) = Ax + d ≤ 0 0 = α > ( Ax + d ) 0 ≤ α • Dual problem d > α subject to A > α + c = 0 and α ≥ 0 maximize i

  36. Linear Programs • Objective c > x subject to Ax + d ≤ 0 minimize x • Lagrange function L ( x, α ) = c > x + α > ( Ax + d ) • Optimality conditions plug into L plug into L ∂ x L ( x, α ) = A > α + c = 0 ∂ α L ( x, α ) = Ax + d ≤ 0 0 = α > ( Ax + d ) 0 ≤ α • Dual problem d > α subject to A > α + c = 0 and α ≥ 0 maximize i

  37. Linear Programs • Primal c > x subject to Ax + d ≤ 0 minimize x • Dual d > α subject to A > α + c = 0 and α ≥ 0 maximize i • Free variables become equality constraints • Equality constraints become free variables • Inequalities become inequalities • Dual of dual is primal

  38. Quadratic Programs • Objective 1 2 x > Qx + c > x subject to Ax + d ≤ 0 minimize x • Lagrange function L ( x, α ) = 1 2 x > Qx + c > x + α > ( Ax + d ) • Optimality conditions plug into L ∂ x L ( x, α ) = Qx + A > α + c = 0 ∂ α L ( x, α ) = Ax + d ≤ 0 0 = α > ( Ax + d ) 0 ≤ α

  39. Quadratic Program • Eliminating x from the Lagrangian via Qx + A > α + c = 0 • Lagrange function L ( x, α ) = 1 2 x > Qx + c > x + α > ( Ax + d ) = − 1 2 x > Qx + α > d = − 1 2( A > α + c ) > Q � 1 ( A > α + c ) + α > d = − 1 − 1 2 α > AQ � 1 A > α + α > ⇥ d − AQ � 1 c 2 c > Q � 1 c ⇤ subject to α ≥ 0

  40. Quadratic Program • Eliminating x from the Lagrangian via Qx + A > α + c = 0 • Lagrange function L ( x, α ) = 1 2 x > Qx + c > x + α > ( Ax + d ) = − 1 2 x > Qx + α > d = − 1 2( A > α + c ) > Q � 1 ( A > α + c ) + α > d = − 1 − 1 2 α > AQ � 1 A > α + α > ⇥ d − AQ � 1 c 2 c > Q � 1 c ⇤ dual subject to α ≥ 0

  41. Quadratic Programs • Primal 1 2 x > Qx + c > x subject to Ax + d ≤ 0 minimize x • Dual 1 2 α > AQ � 1 A > α + α > ⇥ AQ � 1 c − d ⇤ minimize subject to α ≥ 0 α • Dual constraints are simpler • Possibly many fewer variables • Dual of dual is not (always) primal (e.g. in SVMs x is in a Hilbert Space)

  42. Interior Point Solvers

  43. Constrained Newton Method • Objective minimize f ( x ) subject to Ax = b x • Lagrange function and optimality conditions L ( x, α ) = f ( x ) + α > [ Ax − b ] yields ∂ x L ( x, α ) = ∂ x f ( x ) + A > α = 0 optimality ∂ α L ( x, α ) = Ax − b = 0 • Taylor expansion of gradient x f ( x 0 ) [ x � x 0 ] + O ( k x � x 0 k 2 ) ∂ x f ( x ) = ∂ x f ( x 0 ) + ∂ 2 • Plug back into the constraints and solve �  x  ∂ 2  ∂ 2 A > � � x f ( x 0 ) x f ( x 0 ) x 0 − ∂ x f ( x 0 ) = A α b No need to be initially feasible!

  44. General Strategy • Optimality conditions X ∂ x L ( x ∗ , α ∗ ) = ∂ x f ( x ∗ ) + i ∂ x c i ( x ∗ ) = 0 (Saddlepoint in x ∗ ) α ∗ i ∂ α i L ( x ∗ , α ∗ ) = c i ( x ∗ ) ≤ 0 (Saddlepoint in α ∗ ) X α ∗ i c i ( x ∗ ) = 0 (Vanishing KKT-gap) i • Solve equations repeatedly. • Yields primal and dual solution variables • Yields size of primal/dual gap • Feasibility not necessary at start • KKT conditions are problematic - need approximation

  45. Quadratic Programs • Optimality conditions Qx + A > α + c = 0 Ax + d + ξ = 0 slack α i ξ i = 0 α , ξ ≥ 0 • Relax KKT conditions α i ξ i = 0 relaxed to α i ξ i = µ • Solve linearization of nonlinear system  c x �  δ x  A > � � Q = A − D δα c α • Predictor/corrector step for nonlinearity • Iterate until converged

  46. Implementation details • Dominant cost is solving reduced KKT system  c x �  δ x  A > � � Q = A − D δα c α Solve linear system with (dense) Q and A • Solve linear system twice (predictor / corrector) • Update steps are only taken far enough to ensure nonnegativity of dual and slack • Tighten up KKT constraints by decreasing μ • Only 10-20 iterations typically needed

  47. Solver Software • OOQP http://pages.cs.wisc.edu/~swright/ooqp/ Object oriented quadratic programming solver • LOQO http://www.princeton.edu/~rvdb/loqo/LOQO.html Interior point path following solver • HOPDM http://www.maths.ed.ac.uk/~gondzio/software/hopdm.html Linear and nonlinear infeasible IP solver • CVXOPT http://abel.ee.ucla.edu/cvxopt/ Python package for convex optimization • SeDuMi http://sedumi.ie.lehigh.edu/ Semidefinite programming solver

  48. Solver Software • OOQP http://pages.cs.wisc.edu/~swright/ooqp/ Object oriented quadratic programming solver • LOQO http://www.princeton.edu/~rvdb/loqo/LOQO.html Interior point path following solver • HOPDM http://www.maths.ed.ac.uk/~gondzio/software/hopdm.html Linear and nonlinear infeasible IP solver • CVXOPT http://abel.ee.ucla.edu/cvxopt/ nontrivial to Python package for convex optimization • SeDuMi parallelize http://sedumi.ie.lehigh.edu/ Semidefinite programming solver

  49. Bundle Methods simple parallelization

  50. Some optimization problems • Density estimation m X minimize � log p ( x i | θ ) � log p ( θ ) θ i =1 m 1 X 2 σ 2 k θ k 2 equivalently minimize [ g ( θ ) � h φ ( x i ) , θ i ] + θ i =1 • Penalized regression m 1 2 σ 2 k θ k 2 X minimize l ( y i � h φ ( x i ) , θ i ) + θ i =1 regularizer e.g. squared loss

  51. Basic Idea m X minimize l i ( θ ) + λ Ω [ θ ] θ • Loss i =1 • Convex but expensive to compute • Line search just as expensive as new computation • Gradient almost free with function value computation • Easy to compute in parallel • Regularizer • Convex and cheap to compute and to optimize • Strategy • Compute tangents on loss • Provides lower bound on objective • Solve dual optimization problem (fewer parameters)

  52. Bundle Method empirical risk

  53. Lower bound Regularized Risk Minimization R emp [ w ] + λ Ω [ w ] minimize w Taylor Approximation for R emp [ w ] R emp [ w ] � R emp [ w t ] + h w � w t , ∂ w R emp [ w t ] i = h a t , w i + b t where a t = ∂ w R emp [ w t − 1 ] and b t = R emp [ w t − 1 ] � h a t , w t − 1 i . Bundle Bound R emp [ w ] � R t [ w ] := max i ≤ t h a i , w i + b i Regularizer Ω [ w ] solves stability problems.

  54. Pseudocode Initialize t = 0 , w 0 = 0 , a 0 = 0 , b 0 = 0 repeat Find minimizer w t := argmin R t ( w ) + � Ω [ w ] w Compute gradient a t + 1 and offset b t + 1 . Increment t ← t + 1. until ✏ t ≤ ✏ Convergence Monitor R t + 1 [ w t ] − R t [ w t ] Since R t + 1 [ w t ] = R emp [ w t ] (Taylor approximation) we have R t + 1 [ w t ] + � Ω [ w t ] ≥ min w R emp [ w ] + � Ω [ w ] ≥ R t [ w t ] + � Ω [ w t ]

  55. Dual Problem Good News 2 k w k 2 Dual optimization for Ω [ w ] = 1 2 is Quadratic Program regardless of the choice of the empirical risk R emp [ w ] . Details 1 2 λ β > AA > β � β > b minimize β subject to β i � 0 and k β k 1 = 1 The primal coefficient w is given by w = � λ � 1 A > β . General Result Use Fenchel-Legendre dual of Ω [ w ] , e.g. k · k 1 ! k · k 1 . Very Cheap Variant Can even use simple line search for update (almost as good).

  56. Properties Parallelization Empirical risk sum of many terms: MapReduce Gradient sum of many terms, gather from cluster. Possible even for multivariate performance scores. Data is local . Combine data from competing entities. Solver independent of loss No need to change solver for new loss. Loss independent of solver/regularizer Add new regularizer without need to re-implement loss. Line search variant Optimization does not require QP solver at all! Update along gradient direction in the dual. We only need inner product on gradients!

  57. Implementation empirical empirical empirical empirical risk risk risk risk reducers bundle solver

  58. Guarantees Theorem The number of iterations to reach ✏ precision is bounded by + 8 G 2 � R emp [ 0 ] n ≤ log 2 �✏ − 4 G 2 steps. If the Hessian of R emp [ w ] is bounded, convergence to any ✏ ≤ � / 2 takes at most the following number of steps: � R emp [ 0 ] + 4 − 4 H ∗ 0 , 1 − 8 G 2 H ∗ / � ⇥ ⇤ n ≤ log 2 � max log 2 ✏ 4 G 2 � Advantages Linear convergence for smooth loss For non-smooth loss almost as good in practice (as long as smooth on a course scale). Does not require primal line search.

  59. Proof idea Duality Argument Dual of R i [ w ] + λ Ω [ w ] lower bounds minimum of regularized risk R emp [ w ] + λ Ω [ w ] . R i + 1 [ w i ] + λ Ω [ w i ] is upper bound. Show that the gap γ i := R i + 1 [ w i ] − R i [ w i ] vanishes. Dual Improvement Give lower bound on increase in dual problem in terms of γ i and the subgradient ∂ w [ R emp [ w ] + λ Ω [ w ]] . For unbounded Hessian we have δγ = O ( γ 2 ) . For bounded Hessian we have δγ = O ( γ ) . Convergence Solve difference equation in γ t to get desired result.

  60. More • Dual decomposition methods • Optimization problem with many constraints • Replicate variable & add equality constraints • Solve relaxed problem • Gradient descent in dual variables • Prox operator • Problems with smooth & nonsmooth objective • Generalization of Bregman projections

  61. 4.3 Online Methods

  62. The Perceptron

  63. The Perceptron Ham Spam

  64. The Perceptron Ham Spam

  65. The Perceptron initialize w = 0 and b = 0 repeat if y i [ h w, x i i + b ]  0 then w w + y i x i and b b + y i end if until all classified correctly • Nothing happens if classified correctly • Weight vector is linear combination X w = x i i ∈ I • Classifier is linear combination of inner products X f ( x ) = h x i , x i + b i ∈ I

  66. Convergence Theorem • If there exists some with unit length and ( w ∗ , b ∗ ) y i [ h x i , w ∗ i + b ∗ ] � ρ for all i then the perceptron converges to a linear separator after a number of steps bounded by b ∗ 2 + 1 ⇣ ⌘ � r 2 + 1 ρ − 2 where k x i k  r � • Dimensionality independent • Order independent (i.e. also worst case) • Scales with ‘difficulty’ of problem

  67. Proof Starting Point We start from w 1 = 0 and b 1 = 0 . Step 1: Bound on the increase of alignment Denote by w i the value of w at step i (analogously b i ). Alignment: h ( w i , b i ) , ( w ⇤ , b ⇤ ) i For error in observation ( x i , y i ) we get h ( w j +1 , b j +1 ) · ( w ⇤ , b ⇤ ) i = h [( w j , b j ) + y i ( x i , 1)] , ( w ⇤ , b ⇤ ) i = h ( w j , b j ) , ( w ⇤ , b ⇤ ) i + y i h ( x i , 1) · ( w ⇤ , b ⇤ ) i � h ( w j , b j ) , ( w ⇤ , b ⇤ ) i + ρ � j ρ . Alignment increases with number of errors.

  68. Proof Step 2: Cauchy-Schwartz for the Dot Product h ( w j +1 , b j +1 ) · ( w ⇤ , b ⇤ ) i  k ( w j +1 , b j +1 ) k k ( w ⇤ , b ⇤ ) k p 1 + ( b ⇤ ) 2 k ( w j +1 , b j +1 ) k = Step 3: Upper Bound on k ( w j , b j ) k If we make a mistake we have k ( w j +1 , b j +1 ) k 2 = k ( w j , b j ) + y i ( x i , 1) k 2 = k ( w j , b j ) k 2 + 2 y i h ( x i , 1) , ( w j , b j ) i + k ( x i , 1) k 2  k ( w j , b j ) k 2 + k ( x i , 1) k 2  j ( R 2 + 1) . Step 4: Combination of first three steps j ( R 2 + 1)(( b ⇤ ) 2 + 1) p p 1 + ( b ⇤ ) 2 k ( w j +1 , b j +1 ) k  j ρ  Solving for j proves the theorem.

  69. Consequences • Only need to store errors. This gives a compression bound for perceptron. • Stochastic gradient descent on hinge loss l ( x i , y i , w, b ) = max (0 , 1 � y i [ h w, x i i + b ]) • Fails with noisy data do NOT train your avatar with perceptrons Black & White

Recommend


More recommend