Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Part 5: Structured Support Vector Machines Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 56
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Problem (Loss-Minimizing Parameter Learning) Let d ( x, y ) be the (unknown) true data distribution. Let D = { ( x 1 , y 1 ) , . . . , ( x N , y N ) } be i.i.d. samples from d ( x, y ) . Let φ : X × Y → R D be a feature function. Let ∆ : Y × Y → R be a loss function. ◮ Find a weight vector w ∗ that leads to minimal expected loss E ( x,y ) ∼ d ( x,y ) { ∆( y, f ( x )) } for f ( x ) = argmax y ∈Y � w, φ ( x, y ) � . 2 / 56
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Problem (Loss-Minimizing Parameter Learning) Let d ( x, y ) be the (unknown) true data distribution. Let D = { ( x 1 , y 1 ) , . . . , ( x N , y N ) } be i.i.d. samples from d ( x, y ) . Let φ : X × Y → R D be a feature function. Let ∆ : Y × Y → R be a loss function. ◮ Find a weight vector w ∗ that leads to minimal expected loss E ( x,y ) ∼ d ( x,y ) { ∆( y, f ( x )) } for f ( x ) = argmax y ∈Y � w, φ ( x, y ) � . Pro: ◮ We directly optimize for the quantity of interest: expected loss. ◮ No expensive-to-compute partition function Z will show up. Con: ◮ We need to know the loss function already at training time. ◮ We can’t use probabilistic reasoning to find w ∗ . 3 / 56
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Reminder: learning by regularized risk minimization For compatibility function g ( x, y ; w ) := � w, φ ( x, y ) � find w ∗ that minimizes E ( x,y ) ∼ d ( x,y ) ∆( y, argmax y g ( x, y ; w ) ) . Two major problems: ◮ d ( x, y ) is unknown ◮ argmax y g ( x, y ; w ) maps into a discrete space → ∆( y, argmax y g ( x, y ; w )) is discontinuous, piecewise constant 4 / 56
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Task: min E ( x,y ) ∼ d ( x,y ) ∆( y, argmax y g ( x, y ; w ) ) . w Problem 1: ◮ d ( x, y ) is unknown Solution: 1 ◮ Replace E ( x,y ) ∼ d ( x,y ) � � � � · with empirical estimate � · ( x n ,y n ) N ◮ To avoid overfitting: add a regularizer , e.g. λ � w � 2 . New task: N λ � w � 2 + 1 � ∆( y n , argmax y g ( x n , y ; w ) ) . min N w n =1 5 / 56
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Task: N λ � w � 2 + 1 � ∆( y n , argmax y g ( x n , y ; w ) ) . min N w n =1 Problem: ◮ ∆( y, argmax y g ( x, y ; w ) ) discontinuous w.r.t. w . Solution: ◮ Replace ∆( y, y ′ ) with well behaved ℓ ( x, y, w ) ◮ Typically: ℓ upper bound to ∆ , continuous and convex w.r.t. w . New task: N λ � w � 2 + 1 � ℓ ( x n , y n , w )) min N w n =1 6 / 56
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Regularized Risk Minimization N 1 � λ � w � 2 ℓ ( x n , y n , w )) min + N w n =1 Regularization + Loss on training data 7 / 56
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Regularized Risk Minimization N 1 � λ � w � 2 ℓ ( x n , y n , w )) min + N w n =1 Regularization + Loss on training data Hinge loss: maximum margin training ℓ ( x n , y n , w ) := max ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � � � y ∈Y 8 / 56
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Regularized Risk Minimization N 1 � λ � w � 2 ℓ ( x n , y n , w )) min + N w n =1 Regularization + Loss on training data Hinge loss: maximum margin training ℓ ( x n , y n , w ) := max ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � � � y ∈Y ◮ ℓ is maximum over linear functions → continuous , convex . ◮ ℓ bounds ∆ from above. y = argmax y g ( x n , y, w ) Proof: Let ¯ ∆( y n , ¯ y ) ≤ ∆( y n , ¯ y ) + g ( x n , ¯ y, w ) − g ( x n , y n , w ) ∆( y n , y ) + g ( x n , y, w ) − g ( x n , y n , w ) � � ≤ max y ∈Y 9 / 56
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Regularized Risk Minimization N 1 � λ � w � 2 ℓ ( x n , y n , w )) min + N w n =1 Regularization + Loss on training data Hinge loss: maximum margin training ℓ ( x n , y n , w ) := max ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � � � y ∈Y Alternative: Logistic loss: probabilistic training � ℓ ( x n , y n , w ) := log � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � � � exp y ∈Y 10 / 56
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Structured Output Support Vector Machine N 2 � w � 2 + C 1 � � � y ∈Y ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � min max N w n =1 Conditional Random Field N � w � 2 � �� � � � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � � min 2 σ 2 + log exp w n =1 y ∈Y CRFs and SSVMs have more in common than usually assumed. ◮ both do regularized risk minimization ◮ log � y exp( · ) can be interpreted as a soft-max 11 / 56
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Solving the Training Optimization Problem Numerically Structured Output Support Vector Machine: N 1 2 � w � 2 + C � �� � ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � min max N w y ∈Y n =1 Unconstrained optimization, convex, non-differentiable objective. 12 / 56
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Structured Output SVM (equivalent formulation): N 2 � w � 2 + C 1 � ξ n min N w,ξ n =1 subject to, for n = 1 , . . . , N , � � ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � ≤ ξ n max y ∈Y N non-linear contraints, convex, differentiable objective. 13 / 56
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Structured Output SVM (also equivalent formulation): N 1 2 � w � 2 + C � ξ n min N w,ξ n =1 subject to, for n = 1 , . . . , N , ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � ≤ ξ n , for all y ∈ Y N |Y| linear constraints, convex, differentiable objective. 14 / 56
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Example: Multiclass SVM � for y � = y ′ 1 ◮ Y = { 1 , 2 , . . . , K } , ∆( y, y ′ ) = otherwise . 0 � � ◮ φ ( x, y ) = � y = 1 � φ ( x ) , � y = 2 � φ ( x ) , . . . , � y = K � φ ( x ) N 1 2 � w � 2 + C � ξ n Solve: min N w,ξ n =1 subject to, for i = 1 , . . . , n , � w, φ ( x n , y n ) � − � w, φ ( x n , y ) � ≥ 1 − ξ n for all y ∈ Y \ { y n } . Classification: f ( x ) = argmax y ∈Y � w, φ ( x, y ) � . Crammer-Singer Multiclass SVM [K. Crammer, Y. Singer: ”On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines”, JMLR, 2001] 15 / 56
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Example: Hierarchical SVM Hierarchical Multiclass Loss: ∆( y, y ′ ) := 1 2( distance in tree ) ∆( cat , cat ) = 0 , ∆( cat , dog ) = 1 , ∆( cat , bus ) = 2 , etc. N 1 2 � w � 2 + C � ξ n Solve: min N w,ξ n =1 subject to, for i = 1 , . . . , n , � w, φ ( x n , y n ) � − � w, φ ( x n , y ) � ≥ ∆( y n , y ) − ξ n for all y ∈ Y . [L. Cai, T. Hofmann: ”Hierarchical Document Categorization with Support Vector Machines”, ACM CIKM, 2004] [A. Binder, K.-R. M¨ uller, M. Kawanabe: ”On taxonomies for multi-class image categorization”, IJCV, 2011] 16 / 56
Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Solving the Training Optimization Problem Numerically We can solve SSVM training like CRF training: N 1 2 � w � 2 + C � � � y ∈Y ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � min max N w n =1 ◮ continuous � ◮ unconstrained � ◮ convex � ◮ non-differentiable � → we can’t use gradient descent directly. → we’ll have to use subgradients 17 / 56
f(w 0 )+⟨v,w-w 0 ⟩ f(w) f(w 0 ) w w 0 Sebastian Nowozin and Christoph Lampert – Structured Models in Computer Vision – Part 5. Structured SVMs Definition Let f : R D → R be a convex, not necessarily differentiable, function. A vector v ∈ R D is called a subgradient of f at w 0 , if f ( w ) ≥ f ( w 0 ) + � v, w − w 0 � for all w . f(w 0 )+⟨v,w-w 0 ⟩ f(w) f(w 0 ) w w 0 18 / 56
Recommend
More recommend