learning with structured inputs and outputs
play

Learning with Structured Inputs and Outputs Christoph H. Lampert - PowerPoint PPT Presentation

Learning with Structured Inputs and Outputs Christoph H. Lampert IST Austria (Institute of Science and Technology Austria), Vienna ENS/INRIA Summer School, Paris, July 2013 Slides: http://www.ist.ac.at/~chl/ 1 / 10 Schedule Monday Introduction


  1. Milestone I: Probabilistic Training (Conditional Random Fields) ◮ p ( y | x, w ) log-linear in w ∈ R D . ◮ Training: minimize negative conditional log-likelihood, L ( w ) ◮ L ( w ) is differentiable and convex , → gradient descent will find global optimum with ∇ w L ( w ) = 0 ◮ Same structure as multi-class logistic regression . 14 / 29

  2. Milestone I: Probabilistic Training (Conditional Random Fields) ◮ p ( y | x, w ) log-linear in w ∈ R D . ◮ Training: minimize negative conditional log-likelihood, L ( w ) ◮ L ( w ) is differentiable and convex , → gradient descent will find global optimum with ∇ w L ( w ) = 0 ◮ Same structure as multi-class logistic regression . For logistic regression: this is where the textbook ends. We’re done. For conditional random fields: we’re not in safe waters, yet! 14 / 29

  3. Solving the Training Optimization Problem Numerically Task: Compute v = ∇ w L ( w cur ) , evaluate L ( w cur + ηv ) : N L ( w ) = λ � � 2 � w � 2 + � e −� w,φ ( x n ,y ) � � � w, φ ( x n , y n ) � + log n =1 y ∈Y N w L ( w ) = λ � � � � φ ( x n , y n ) − p ( y | x n , w ) φ ( x n , y ) ∇ 2 w + n =1 y ∈Y Problem: Y typically is very (exponentially) large: ◮ binary image segmentation: |Y| = 2 640 × 480 ≈ 10 92475 ◮ ranking N images: |Y| = N ! , e.g. N = 1000 : |Y| ≈ 10 2568 . We must use the structure in Y , or we’re lost. 15 / 29

  4. Solving the Training Optimization Problem Numerically N � � � φ ( x n , y n ) − E y ∼ p ( y | x n ,w ) φ ( x n , y ) ∇ w L ( w ) = λw + n =1 Computing the Gradient (naive): O ( K M ND ) N L ( w ) = λ � 2 � w � 2 + � � � w, φ ( x n , y n ) � + log Z ( x n , w ) n =1 Line Search (naive): O ( K M ND ) per evaluation of L ◮ N : number of samples ◮ D : dimension of feature space ◮ M : number of output nodes ◮ K : number of possible labels of each output nodes 16 / 29

  5. Solving the Training Optimization Problem Numerically N � � � φ ( x n , y n ) − E y ∼ p ( y | x n ,w ) φ ( x n , y ) ∇ w L ( w ) = λw + n =1 Computing the Gradient (naive): O ( K M ND ) N L ( w ) = λ � 2 � w � 2 + � � � w, φ ( x n , y n ) � + log Z ( x n , w ) n =1 Line Search (naive): O ( K M ND ) per evaluation of L ◮ N : number of samples ◮ D : dimension of feature space ◮ M : number of output nodes ≈ 100s to 1,000,000s ◮ K : number of possible labels of each output nodes ≈ 2 to 1000s 16 / 29

  6. Solving the Training Optimization Problem Numerically In a graphical model with factors F , the features decompose: � � φ ( x, y ) = φ F ( x, y F ) F ∈F � � E y ∼ p ( y | x,w ) φ ( x, y ) = E y ∼ p ( y | x,w ) φ F ( x, y F ) F ∈F � � = E y F ∼ p ( y F | x,w ) φ F ( x, y F ) F ∈F � E y F ∼ p ( y F | x,w ) φ F ( x, y F ) = p ( y F | x, w ) φ F ( x, y F ) � �� � y F ∈Y F factor marginals � �� � K | F | terms Factor marginals µ F = p ( y F | x, w ) ◮ are much smaller than complete joint distribution p ( y | x, w ) , ◮ can be computed/approximated, e.g., with (loopy) belief propagation . 17 / 29

  7. Solving the Training Optimization Problem Numerically N � � � φ ( x n , y n ) − E y ∼ p ( y | x n ,w ) φ ( x n , y ) ∇ w L ( w ) = λw + n =1 ❳❳❳❳❳ ✘ Computing the Gradient: ✘✘✘✘✘ O ( K M nd ) , O ( MK | F max | ND ) : ❳ N L ( w ) = λ � � 2 � w � 2 + � e −� w,φ ( x n ,y ) � � � w, φ ( x n , y n ) � + log n =1 y ∈Y Line Search: ✘✘✘✘✘ ❳❳❳❳❳ O ( K M nd ) , O ( MK | F max | ND ) per evaluation of L ✘ ❳ ◮ N : number of samples ◮ D : dimension of feature space ◮ M : number of output nodes ◮ K : number of possible labels of each output nodes 18 / 29

  8. Solving the Training Optimization Problem Numerically N � � � φ ( x n , y n ) − E y ∼ p ( y | x n ,w ) φ ( x n , y ) ∇ w L ( w ) = λw + n =1 ❳❳❳❳❳ ✘ Computing the Gradient: ✘✘✘✘✘ O ( K M nd ) , O ( MK | F max | ND ) : ❳ N L ( w ) = λ � � 2 � w � 2 + � e −� w,φ ( x n ,y ) � � � w, φ ( x n , y n ) � + log n =1 y ∈Y Line Search: ✘✘✘✘✘ ❳❳❳❳❳ O ( K M nd ) , O ( MK | F max | ND ) per evaluation of L ✘ ❳ ◮ N : number of samples ≈ 10s to 1,000,000s ◮ D : dimension of feature space ◮ M : number of output nodes ◮ K : number of possible labels of each output nodes 18 / 29

  9. Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) ◮ Minimize L ( w ) , but without ever computing L ( w ) or ∇L ( w ) exactly ◮ In each gradient descent step: ◮ Pick random subset D ′ ⊂ D , ← often just 1–3 elements! ◮ Follow approximate gradient � � � ∇L ( w ) = λw + |D| ˜ φ ( x n , y n ) − E y ∼ p ( y | x n ,w ) φ ( x n , y ) |D ′ | ( x n ,y n ) ∈D ′ more: see L. Bottou, O. Bousquet: ”The Tradeoffs of Large Scale Learning” , NIPS 2008. also: http://leon.bottou.org/research/largescale 19 / 29

  10. Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) ◮ Minimize L ( w ) , but without ever computing L ( w ) or ∇L ( w ) exactly ◮ In each gradient descent step: ◮ Pick random subset D ′ ⊂ D , ← often just 1–3 elements! ◮ Follow approximate gradient � � � ∇L ( w ) = λw + |D| ˜ φ ( x n , y n ) − E y ∼ p ( y | x n ,w ) φ ( x n , y ) |D ′ | ( x n ,y n ) ∈D ′ ◮ Avoid line search by using fixed stepsize rule η (new parameter) more: see L. Bottou, O. Bousquet: ”The Tradeoffs of Large Scale Learning” , NIPS 2008. also: http://leon.bottou.org/research/largescale 19 / 29

  11. Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) ◮ Minimize L ( w ) , but without ever computing L ( w ) or ∇L ( w ) exactly ◮ In each gradient descent step: ◮ Pick random subset D ′ ⊂ D , ← often just 1–3 elements! ◮ Follow approximate gradient � � � ∇L ( w ) = λw + |D| ˜ φ ( x n , y n ) − E y ∼ p ( y | x n ,w ) φ ( x n , y ) |D ′ | ( x n ,y n ) ∈D ′ ◮ Avoid line search by using fixed stepsize rule η (new parameter) ◮ SGD converges to argmin w L ( w ) ! (if η chosen right) more: see L. Bottou, O. Bousquet: ”The Tradeoffs of Large Scale Learning” , NIPS 2008. also: http://leon.bottou.org/research/largescale 19 / 29

  12. Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) ◮ Minimize L ( w ) , but without ever computing L ( w ) or ∇L ( w ) exactly ◮ In each gradient descent step: ◮ Pick random subset D ′ ⊂ D , ← often just 1–3 elements! ◮ Follow approximate gradient � � � ∇L ( w ) = λw + |D| ˜ φ ( x n , y n ) − E y ∼ p ( y | x n ,w ) φ ( x n , y ) |D ′ | ( x n ,y n ) ∈D ′ ◮ Avoid line search by using fixed stepsize rule η (new parameter) ◮ SGD converges to argmin w L ( w ) ! (if η chosen right) ◮ SGD needs more iterations, but each one is much faster more: see L. Bottou, O. Bousquet: ”The Tradeoffs of Large Scale Learning” , NIPS 2008. also: http://leon.bottou.org/research/largescale 19 / 29

  13. Solving the Training Optimization Problem Numerically N � � � φ ( x n , y n ) − E y ∼ p ( y | x n ,w ) φ ( x n , y ) ∇ w L ( w ) = λw + n =1 ❳❳❳❳❳ O ( K M nd ) , O ( MK 2 ND ) (if BP is possible): ✘ Computing the Gradient: ✘✘✘✘✘ ❳ N L ( w ) = λ � � 2 � w � 2 + � e −� w,φ ( x n ,y ) � � � w, φ ( x n , y n ) � + log n =1 y ∈Y ❳❳❳❳❳ ✘ Line Search: ✘✘✘✘✘ O ( K M nd ) , O ( MK 2 ND ) per evaluation of L ❳ ◮ N : number of samples ◮ D : dimension of feature space: ≈ φ i,j 1–10s, φ i : 100s to 10000s ◮ M : number of output nodes ◮ K : number of possible labels of each output nodes 20 / 29

  14. Solving the Training Optimization Problem Numerically Typical feature functions in image segmentation: ◮ φ i ( y i , x ) ∈ R ≈ 1000 : local image features, e.g. bag-of-words → � w i , φ i ( y i , x ) � : local classifier (like logistic-regression) ◮ φ i,j ( y i , y j ) = � y i = y j � ∈ R 1 : test for same label → � w ij , φ ij ( y i , y j ) � : penalizer for label changes (if w ij > 0 ) ◮ combined: argmax y p ( y | x ) is smoothed version of local cues original local confidence local + smoothness 21 / 29

  15. Solving the Training Optimization Problem Numerically Typical feature functions in pose estimation: ◮ φ i ( y i , x ) ∈ R ≈ 1000 : local image representation, e.g. HoG → � w i , φ i ( y i , x ) � : local confidence map ◮ φ i,j ( y i , y j ) = good fit ( y i , y j ) ∈ R 1 : test for geometric fit → � w ij , φ ij ( y i , y j ) � : penalizer for unrealistic poses ◮ together: argmax y p ( y | x ) is sanitized version of local cues original local confidence local + geometry [V. Ferrari, M. Marin-Jimenez, A. Zisserman: ”Progressive Search Space Reduction for Human Pose Estimation”, CVPR 2008.] 22 / 29

  16. Solving the Training Optimization Problem Numerically Idea: split learning of unary potentials into two parts: ◮ local classifiers, ◮ their importance. Two-Stage Training ◮ pre-train f y i ( x ) ˆ = log p ( y i | x ) i ( x ) ∈ R K (low-dimensional) ◮ use ˜ φ i ( y i , x ) := f y ◮ keep φ ij ( y i , y j ) are before ◮ perform CRF learning with ˜ φ i and φ ij 23 / 29

  17. Solving the Training Optimization Problem Numerically Idea: split learning of unary potentials into two parts: ◮ local classifiers, ◮ their importance. Two-Stage Training ◮ pre-train f y i ( x ) ˆ = log p ( y i | x ) i ( x ) ∈ R K (low-dimensional) ◮ use ˜ φ i ( y i , x ) := f y ◮ keep φ ij ( y i , y j ) are before ◮ perform CRF learning with ˜ φ i and φ ij Advantage: ◮ lower dimensional feature space during inference → faster ◮ f y i ( x ) can be any classifiers, e.g. non-linear SVMs, deep network,. . . Disadvantage: ◮ if local classifiers are bad, CRF training cannot fix that. 23 / 29

  18. Solving the Training Optimization Problem Numerically CRF training methods is based on gradient-descent optimization. The faster we can do it, the better (more realistic) models we can use: N � � � � ˜ φ ( x n , y n ) − p ( y | x n , w ) φ ( x n , y ) ∈ R D ∇ w L ( w ) = λw − n =1 y ∈Y A lot of research on accelerating CRF training: problem ”solution” method(s) |Y| too large exploit structure (loopy) belief propagation smart sampling contrastive divergence use approximate L e.g. pseudo-likelihood N too large mini-batches stochastic gradient descent D too large trained φ unary two-stage training 24 / 29

  19. CRFs with Latent Variables So far, training was fully supervised , all variables were observed. In real life, some variables can be unobserved even during training. missing labels in training data latent variables, e.g. part location latent variables, e.g. part occlusion latent variables, e.g. viewpoint 25 / 29

  20. CRFs with Latent Variables Three types of variables in graphical model: ◮ x ∈ X always observed (input), ◮ y ∈ Y observed only in training (output), ◮ z ∈ Z never observed (latent). Example: ◮ x : image ◮ y : part positions ◮ z ∈ { 0 , 1 } : flag front-view or side-view images: [Felzenszwalb et al., ”Object Detection with Discriminatively Trained Part Based Models”, T-PAMI, 2010] 26 / 29

  21. CRFs with Latent Variables Marginalization over Latent Variables Construct conditional likelihood as usual: 1 p ( y, z | x, w ) = Z ( x, w ) exp( −� w, φ ( x, y, z ) � ) Derive p ( y | x, w ) by marginalizing over z : 1 � � p ( y | x, w ) = p ( y, z | x, w ) = exp( −� w, φ ( x, y, z ) � ) Z ( x, w ) z ∈Z z ∈Z 27 / 29

  22. Negative regularized conditional log-likelihood: N L ( w ) = λ � 2 � w � 2 + log p ( y n | x n , w ) n =1 N = λ � � 2 � w � 2 + p ( y n , z | x n , w ) log n =1 z ∈Z N = λ � � 2 � w � 2 + exp( −� w, φ ( x n , y n , z ) � ) log n =1 z ∈Z N � � exp( −� w, φ ( x n , y, z ) � ) − log n =1 z ∈Z y ∈Y ◮ L is not convex in w → local minima possible How to train CRFs with latent variables is active research. 28 / 29

  23. Summary – CRF Learning Given: ◮ training set { ( x 1 , y 1 ) , . . . , ( x N , y N ) } ⊂ X × Y ◮ feature functions φ : X × Y → R D that decomposes over factors, φ F : X × Y F → R d for F ∈ F Overall model is log-linear (in parameter w ) p ( y | x ; w ) ∝ e −� w,φ ( x,y ) � CRF training requires minimizing negative conditional log-likelihood : N λ � � w ∗ = argmin 2 � w � 2 + � e −� w,φ ( x n ,y ) � � � w, φ ( x n , y n ) � − log w n =1 y ∈Y ◮ convex optimization problem → (stochastic) gradient descent works ◮ training needs repeated runs of probabilistic inference ◮ latent variables are possible, but make training non-convex 29 / 29

  24. Part 2: Structured Support Vector Machines

  25. Supervised Learning Problem ◮ Training examples ( x 1 , y 1 ) , . . . , ( x N , y N ) ∈ X × Y ◮ Loss function ∆ : Y × Y → R . ◮ How to make predictions g : X → Y ? Approach 2) Loss-minimizing Parameter Estimation 1) Use training data to learn an energy function E ( x, y ) 2) Use f ( x ) := argmin y ∈Y E ( x, y ) to make predictions. Slight variation (for historic reasons): 1) Learn a compatibility function g ( x, y ) (think: ” g = − E ”) 2) Use f ( x ) := argmax y ∈Y g ( x, y ) to make predictions. 2 / 1

  26. Loss-Minimizing Parameter Learning ◮ D = { ( x 1 , y 1 ) , . . . , ( x N , y N ) } i.i.d. training set ◮ φ : X × Y → R D be a feature function. ◮ ∆ : Y × Y → R be a loss function. ◮ Find a weight vector w ∗ that minimizes the expected loss E ( x,y ) ∆( y, f ( x )) for f ( x ) = argmax y ∈Y � w, φ ( x, y ) � . 3 / 1

  27. Loss-Minimizing Parameter Learning ◮ D = { ( x 1 , y 1 ) , . . . , ( x N , y N ) } i.i.d. training set ◮ φ : X × Y → R D be a feature function. ◮ ∆ : Y × Y → R be a loss function. ◮ Find a weight vector w ∗ that minimizes the expected loss E ( x,y ) ∆( y, f ( x )) for f ( x ) = argmax y ∈Y � w, φ ( x, y ) � . Advantage: ◮ We directly optimize for the quantity of interest: expected loss. ◮ No expensive-to-compute partition function Z will show up. Disadvantage: ◮ We need to know the loss function already at training time. ◮ We can’t use probabilistic reasoning to find w ∗ . 3 / 1

  28. Reminder: Regularized Risk Minimization Task: for f ( x ) = argmax y ∈Y � w, φ ( x, y ) � min E ( x,y ) ∆( y, f ( x )) w ∈ R D Two major problems: ◮ data distribution is unknown → we can’t compute E ◮ f : X → Y has output in a discrete space → f is piecewise constant w.r.t. w → ∆( y, f ( x )) is discontinuous, piecewise constant w.r.t w we can’t apply gradient-based optimization 4 / 1

  29. Reminder: Regularized Risk Minimization Task: for f ( x ) = argmax y ∈Y � w, φ ( x, y ) � min E ( x,y ) ∆( y, f ( x )) w ∈ R D Problem 1: ◮ data distribution is unknown Solution: � � � � � 1 ◮ Replace E ( x,y ) ∼ d ( x,y ) · with empirical estimate · ( x n ,y n ) N ◮ To avoid overfitting: add a regularizer , e.g. λ 2 � w � 2 . New task: N λ 2 � w � 2 + 1 � ∆( y n , f ( x n ) ) . min N w ∈ R D n =1 5 / 1

  30. Reminder: Regularized Risk Minimization Task: for f ( x ) = argmax y ∈Y � w, φ ( x, y ) � N λ 2 � w � 2 + 1 � ∆( y n , f ( x n ) ) . min N w ∈ R D n =1 Problem: ◮ ∆( y n , f ( x n ) ) = ∆( y, argmax y � w, φ ( x, y ) � ) discontinuous w.r.t. w . Solution: ◮ Replace ∆( y, y ′ ) with well behaved ℓ ( x, y, w ) ◮ Typically: ℓ upper bound to ∆ , continuous and convex w.r.t. w . New task: N λ 2 � w � 2 + 1 � ℓ ( x n , y n , w )) min N w ∈ R D n =1 6 / 1

  31. Reminder: Regularized Risk Minimization N 1 λ � 2 � w � 2 ℓ ( x n , y n , w )) min + N w ∈ R D n =1 Regularization + Loss on training data 7 / 1

  32. Reminder: Regularized Risk Minimization N 1 λ � 2 � w � 2 ℓ ( x n , y n , w )) min + N w ∈ R D n =1 Regularization + Loss on training data Hinge loss: maximum margin training � � ℓ ( x n , y n , w ) := max ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � y ∈Y 7 / 1

  33. Reminder: Regularized Risk Minimization N 1 λ � 2 � w � 2 ℓ ( x n , y n , w )) min + N w ∈ R D n =1 Regularization + Loss on training data Hinge loss: maximum margin training � � ℓ ( x n , y n , w ) := max ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � y ∈Y ◮ ℓ is maximum over linear functions → continuous , convex . ◮ ℓ is an upper bound to ∆ : ”small ℓ ⇒ small ∆ ” 7 / 1

  34. Reminder: Regularized Risk Minimization N λ 1 � 2 � w � 2 ℓ ( x n , y n , w )) min + N w ∈ R D n =1 Regularization + Loss on training data Hinge loss: maximum margin training � � ℓ ( x n , y n , w ) := max ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � y ∈Y Alternative: Logistic loss: probabilistic training � � � ℓ ( x n , y n , w ) := log � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � exp y ∈Y Differentiable, convex, not an upper bound to ∆( y, y ′ ) . 7 / 1

  35. Structured Output Support Vector Machine N λ 2 � w � 2 + 1 � � � ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � min max N w y ∈Y n =1 Conditional Random Field N λ � � 2 � w � 2 + � � � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � min log exp w n =1 y ∈Y � �� � = −� w,φ ( x n ,y n ) � +exp( � w,φ ( x n ,y ) � ) = cond.log.likelihood CRFs and SSVMs have more in common than usually assumed. ◮ log � y exp( · ) can be interpreted as a soft-max ◮ but: CRF doesn’t take loss function into account at training time 8 / 1

  36. Example: Multiclass SVM � for y � = y ′ 1 ◮ Y = { 1 , 2 , . . . , K } , ∆( y, y ′ ) = otherwise . 0 � � ◮ φ ( x, y ) = � y = 1 � φ ( x ) , � y = 2 � φ ( x ) , . . . , � y = K � φ ( x ) Solve: N 2 � w � 2 + 1 � � λ � ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � min max N w y ∈Y n =1 � �� � � for y = y n 0 = 1 + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � for y � = y n Classification: f ( x ) = argmax y ∈Y � w, φ ( x, y ) � . Crammer-Singer Multiclass SVM [K. Crammer, Y. Singer: ”On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines”, JMLR, 2001] 9 / 1

  37. Example: Multiclass SVM � for y � = y ′ 1 ◮ Y = { 1 , 2 , . . . , K } , ∆( y, y ′ ) = otherwise . 0 � � ◮ φ ( x, y ) = � y = 1 � φ ( x ) , � y = 2 � φ ( x ) , . . . , � y = K � φ ( x ) Solve: N 2 � w � 2 + 1 � � λ � ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � min max N w y ∈Y n =1 � �� � � for y = y n 0 = 1 + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � for y � = y n Classification: f ( x ) = argmax y ∈Y � w, φ ( x, y ) � . Crammer-Singer Multiclass SVM [K. Crammer, Y. Singer: ”On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines”, JMLR, 2001] 9 / 1

  38. Example: Hierarchical Multiclass SVM Hierarchical Multiclass Loss: ∆( y, y ′ ) := 1 2( distance in tree ) ∆( cat , cat ) = 0 , ∆( cat , dog ) = 1 , ∆( cat , bus ) = 2 , etc. N 2 � w � 2 + 1 � � λ � ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � min max N w y ∈Y n =1 � �� �  ! � w, φ ( x n , cat ) � − � w, φ ( x n , dog ) � ≥ 1   e.g. if y n = cat , ! � w, φ ( x n , cat ) � − � w, φ ( x n , car ) � ≥ 2  !  � w, φ ( x n , cat ) � − � w, φ ( x n , bus ) � ≥ 2 . [L. Cai, T. Hofmann: ”Hierarchical Document Categorization with Support Vector Machines”, ACM CIKM, 2004] [A. Binder, K.-R. M¨ uller, M. Kawanabe: ”On taxonomies for multi-class image categorization”, IJCV, 2011] 10 / 1

  39. Example: Hierarchical Multiclass SVM Hierarchical Multiclass Loss: ∆( y, y ′ ) := 1 2( distance in tree ) ∆( cat , cat ) = 0 , ∆( cat , dog ) = 1 , ∆( cat , bus ) = 2 , etc. N 2 � w � 2 + 1 � � λ � ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � min max N w y ∈Y n =1 � �� �  ! � w, φ ( x n , cat ) � − � w, φ ( x n , dog ) � ≥ 1   e.g. if y n = cat , ! � w, φ ( x n , cat ) � − � w, φ ( x n , car ) � ≥ 2  !  � w, φ ( x n , cat ) � − � w, φ ( x n , bus ) � ≥ 2 . ◮ labels that cause more loss are pushed further away → lower chance of high-loss mistake at test time [L. Cai, T. Hofmann: ”Hierarchical Document Categorization with Support Vector Machines”, ACM CIKM, 2004] [A. Binder, K.-R. M¨ uller, M. Kawanabe: ”On taxonomies for multi-class image categorization”, IJCV, 2011] 10 / 1

  40. Solving S-SVM Training Numerically We can solve SSVM training like CRF training: N λ 2 � w � 2 + 1 � � � y ∈Y ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � min max N w n =1 ◮ continuous � ◮ unconstrained � ◮ convex � ◮ non-differentiable � → we can’t use gradient descent directly. → we’ll have to use subgradients 11 / 1

  41. f(w) v f(w 0 )+ ⟨ v,w-w 0 ⟩ f(w 0 ) w w 0 Solving S-SVM Training Numerically – Subgradient Method Definition Let f : R D → R be a convex, not necessarily differentiable, function. A vector v ∈ R D is called a subgradient of f at w 0 , if f ( w ) ≥ f ( w 0 ) + � v, w − w 0 � for all w . f(w) v f(w 0 )+ ⟨ v,w-w 0 ⟩ f(w 0 ) w w 0 12 / 1

  42. f(w) v f(w 0 )+ ⟨ v,w-w 0 ⟩ f(w 0 ) w w 0 Solving S-SVM Training Numerically – Subgradient Method Definition Let f : R D → R be a convex, not necessarily differentiable, function. A vector v ∈ R D is called a subgradient of f at w 0 , if f ( w ) ≥ f ( w 0 ) + � v, w − w 0 � for all w . f(w) v f(w 0 )+ ⟨ v,w-w 0 ⟩ f(w 0 ) w w 0 12 / 1

  43. f(w) v f(w 0 )+ ⟨ v,w-w 0 ⟩ f(w 0 ) w w 0 Solving S-SVM Training Numerically – Subgradient Method Definition Let f : R D → R be a convex, not necessarily differentiable, function. A vector v ∈ R D is called a subgradient of f at w 0 , if f ( w ) ≥ f ( w 0 ) + � v, w − w 0 � for all w . f(w) v f(w 0 ) f(w 0 )+ ⟨ v,w-w 0 ⟩ w w 0 12 / 1

  44. Solving S-SVM Training Numerically – Subgradient Method Definition Let f : R D → R be a convex, not necessarily differentiable, function. A vector v ∈ R D is called a subgradient of f at w 0 , if f ( w ) ≥ f ( w 0 ) + � v, w − w 0 � for all w . f(w) v f(w 0 ) f(w 0 )+ ⟨ v,w-w 0 ⟩ w w 0 For differentiable f , the gradient v = ∇ f ( w 0 ) is the only subgradient. f(w) v f(w 0 )+ ⟨ v,w-w 0 ⟩ f(w 0 ) w w 0 12 / 1

  45. Solving S-SVM Training Numerically – Subgradient Method Subgradient method works basically like gradient descent: Subgradient Method Minimization – minimize F ( w ) ◮ require: tolerance ǫ > 0 , stepsizes η t ◮ w cur ← 0 ◮ repeat ◮ v ∈ ∇ sub F ( w cur ) w ◮ w cur ← w cur − η t v ◮ until F changed less than ǫ ◮ return w cur Converges to global minimum, but rather inefficient if F non-differentiable. [Shor, ”Minimization methods for non-differentiable functions”, Springer, 1985.] 13 / 1

  46. ℓ (w) y w Solving S-SVM Training Numerically – Subgradient Method Computing a subgradient: N 2 � w � 2 + 1 λ � ℓ n ( w ) min N w n =1 with ℓ n ( w ) = max y ℓ n y ( w ) , and ℓ n y ( w ) := ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � 14 / 1

  47. Solving S-SVM Training Numerically – Subgradient Method Computing a subgradient: N 2 � w � 2 + 1 λ � ℓ n ( w ) min N w n =1 with ℓ n ( w ) = max y ℓ n y ( w ) , and ℓ n y ( w ) := ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � ℓ (w) y w For each y ∈ Y , ℓ n y ( w ) is a linear function of w . 14 / 1

  48. Solving S-SVM Training Numerically – Subgradient Method Computing a subgradient: N 2 � w � 2 + 1 λ � ℓ n ( w ) min N w n =1 with ℓ n ( w ) = max y ℓ n y ( w ) , and ℓ n y ( w ) := ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � ℓ (w) y' w For each y ∈ Y , ℓ n y ( w ) is a linear function of w . 14 / 1

  49. Solving S-SVM Training Numerically – Subgradient Method Computing a subgradient: N 2 � w � 2 + 1 λ � ℓ n ( w ) min N w n =1 with ℓ n ( w ) = max y ℓ n y ( w ) , and ℓ n y ( w ) := ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � ℓ (w) w For each y ∈ Y , ℓ n y ( w ) is a linear function of w . 14 / 1

  50. Solving S-SVM Training Numerically – Subgradient Method Computing a subgradient: N λ 2 � w � 2 + 1 � ℓ n ( w ) min N w n =1 with ℓ n ( w ) = max y ℓ n y ( w ) , and ℓ n y ( w ) := ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � ℓ (w) w max over finite Y : piece-wise linear 14 / 1

  51. Solving S-SVM Training Numerically – Subgradient Method Computing a subgradient: N λ 2 � w � 2 + 1 � ℓ n ( w ) min N w n =1 with ℓ n ( w ) = max y ℓ n y ( w ) , and ℓ n y ( w ) := ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � ℓ (w) w w 0 Subgradient of ℓ n at w 0 : 14 / 1

  52. Solving S-SVM Training Numerically – Subgradient Method Computing a subgradient: N λ 2 � w � 2 + 1 � ℓ n ( w ) min N w n =1 with ℓ n ( w ) = max y ℓ n y ( w ) , and ℓ n y ( w ) := ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � ℓ (w) w w 0 Subgradient of ℓ n at w 0 : find maximal (active) y . 14 / 1

  53. Solving S-SVM Training Numerically – Subgradient Method Computing a subgradient: N 2 � w � 2 + 1 λ � ℓ n ( w ) min N w n =1 with ℓ n ( w ) = max y ℓ n y ( w ) , and ℓ n y ( w ) := ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � ℓ (w) v w w 0 Subgradient of ℓ n at w 0 : find maximal (active) y , use v = ∇ ℓ n y ( w 0 ) . 14 / 1

  54. Solving S-SVM Training Numerically – Subgradient Method Subgradient Method S-SVM Training input training pairs { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ⊂ X × Y , input feature map φ ( x, y ) , loss function ∆( y, y ′ ) , regularizer λ , input number of iterations T , stepsizes η t for t = 1 , . . . , T 1: w ← � 0 2: for t=1,. . . ,T do for i=1,. . . ,n do 3: ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � y ← argmax y ∈Y ˆ 4: v n ← φ ( x n , ˆ y ) − φ ( x n , y n ) 5: end for 6: � w ← w − η t ( λw − 1 n v n ) 7: N 8: end for output prediction function f ( x ) = argmax y ∈Y � w, φ ( x, y ) � . Obs: each update of w needs N argmax -prediction (one per example). 15 / 1

  55. Solving S-SVM Training Numerically – Subgradient Method Same trick as for CRFs: stochastic updates : Stochastic Subgradient Method S-SVM Training input training pairs { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ⊂ X × Y , input feature map φ ( x, y ) , loss function ∆( y, y ′ ) , regularizer λ , input number of iterations T , stepsizes η t for t = 1 , . . . , T 1: w ← � 0 2: for t=1,. . . ,T do ( x n , y n ) ← randomly chosen training example pair 3: y ← argmax y ∈Y ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � ˆ 4: w ← w − η t ( λw − 1 N [ φ ( x n , ˆ y ) − φ ( x n , y n )]) 5: 6: end for output prediction function f ( x ) = argmax y ∈Y � w, φ ( x, y ) � . Observation: each update of w needs only 1 argmax -prediction (but we’ll need many iterations until convergence) 16 / 1

  56. Example: Image Segmenatation ◮ X images, Y = { binary segmentation masks } . � � ◮ Training example(s): ( x n , y n ) = , y ) = � ◮ ∆( y, ¯ p � y p � = ¯ y p � (Hamming loss) Images: [Carreira, Li, Sminchisescu, ”Object Recognition by Sequential Figure-Ground Ranking”, IJCV 2010] 17 / 1

  57. Example: Image Segmenatation ◮ X images, Y = { binary segmentation masks } . � � ◮ Training example(s): ( x n , y n ) = , y ) = � ◮ ∆( y, ¯ p � y p � = ¯ y p � (Hamming loss) t = 1 : ˆ y = φ ( y n ) − φ (ˆ y ) : black + , white + , green − , blue − , gray − Images: [Carreira, Li, Sminchisescu, ”Object Recognition by Sequential Figure-Ground Ranking”, IJCV 2010] 17 / 1

  58. Example: Image Segmenatation ◮ X images, Y = { binary segmentation masks } . � � ◮ Training example(s): ( x n , y n ) = , y ) = � ◮ ∆( y, ¯ p � y p � = ¯ y p � (Hamming loss) t = 1 : ˆ y = φ ( y n ) − φ (ˆ y ) : black + , white + , green − , blue − , gray − t = 2 : ˆ y = φ ( y n ) − φ (ˆ y ) : black + , white + , green = , blue = , gray − Images: [Carreira, Li, Sminchisescu, ”Object Recognition by Sequential Figure-Ground Ranking”, IJCV 2010] 17 / 1

  59. Example: Image Segmenatation ◮ X images, Y = { binary segmentation masks } . � � ◮ Training example(s): ( x n , y n ) = , y ) = � ◮ ∆( y, ¯ p � y p � = ¯ y p � (Hamming loss) t = 1 : ˆ y = φ ( y n ) − φ (ˆ y ) : black + , white + , green − , blue − , gray − t = 2 : ˆ y = φ ( y n ) − φ (ˆ y ) : black + , white + , green = , blue = , gray − t = 3 : ˆ y = φ ( y n ) − φ (ˆ y ) : black = , white = , green − , blue − , gray − Images: [Carreira, Li, Sminchisescu, ”Object Recognition by Sequential Figure-Ground Ranking”, IJCV 2010] 17 / 1

  60. Example: Image Segmenatation ◮ X images, Y = { binary segmentation masks } . � � ◮ Training example(s): ( x n , y n ) = , y ) = � ◮ ∆( y, ¯ p � y p � = ¯ y p � (Hamming loss) t = 1 : ˆ y = φ ( y n ) − φ (ˆ y ) : black + , white + , green − , blue − , gray − t = 2 : ˆ y = φ ( y n ) − φ (ˆ y ) : black + , white + , green = , blue = , gray − t = 3 : ˆ y = φ ( y n ) − φ (ˆ y ) : black = , white = , green − , blue − , gray − t = 4 : ˆ y = φ ( y n ) − φ (ˆ y ) : black = , white = , green − , blue = , gray = Images: [Carreira, Li, Sminchisescu, ”Object Recognition by Sequential Figure-Ground Ranking”, IJCV 2010] 17 / 1

  61. Example: Image Segmenatation ◮ X images, Y = { binary segmentation masks } . � � ◮ Training example(s): ( x n , y n ) = , y ) = � ◮ ∆( y, ¯ p � y p � = ¯ y p � (Hamming loss) t = 1 : ˆ y = φ ( y n ) − φ (ˆ y ) : black + , white + , green − , blue − , gray − t = 2 : ˆ y = φ ( y n ) − φ (ˆ y ) : black + , white + , green = , blue = , gray − t = 3 : ˆ y = φ ( y n ) − φ (ˆ y ) : black = , white = , green − , blue − , gray − t = 4 : ˆ y = φ ( y n ) − φ (ˆ y ) : black = , white = , green − , blue = , gray = t = 5 : ˆ y = φ ( y n ) − φ (ˆ y ) : black = , white = , green = , blue = , gray = Images: [Carreira, Li, Sminchisescu, ”Object Recognition by Sequential Figure-Ground Ranking”, IJCV 2010] 17 / 1

  62. Example: Image Segmenatation ◮ X images, Y = { binary segmentation masks } . � � ◮ Training example(s): ( x n , y n ) = , y ) = � ◮ ∆( y, ¯ p � y p � = ¯ y p � (Hamming loss) t = 1 : ˆ y = φ ( y n ) − φ (ˆ y ) : black + , white + , green − , blue − , gray − t = 2 : ˆ y = φ ( y n ) − φ (ˆ y ) : black + , white + , green = , blue = , gray − t = 3 : ˆ y = φ ( y n ) − φ (ˆ y ) : black = , white = , green − , blue − , gray − t = 4 : ˆ y = φ ( y n ) − φ (ˆ y ) : black = , white = , green − , blue = , gray = t = 5 : ˆ y = φ ( y n ) − φ (ˆ y ) : black = , white = , green = , blue = , gray = t = 6 , . . . : no more changes. Images: [Carreira, Li, Sminchisescu, ”Object Recognition by Sequential Figure-Ground Ranking”, IJCV 2010] 17 / 1

  63. Solving S-SVM Training Numerically Structured Support Vector Machine: N 2 � w � 2 + 1 � �� λ � ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � min max N w y ∈Y n =1 Subgradient method converges slowly. Can we do better? 18 / 1

  64. Solving S-SVM Training Numerically Structured Support Vector Machine: N 2 � w � 2 + 1 � �� λ � ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � min max N w y ∈Y n =1 Subgradient method converges slowly. Can we do better? Remember from SVM: We can use inequalities and slack variables to encode the loss. 18 / 1

  65. Solving S-SVM Training Numerically Structured SVM (equivalent formulation): Idea: slack variables N λ 2 � w � 2 + 1 � ξ n min N w,ξ n =1 subject to, for n = 1 , . . . , N , � � ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � ≤ ξ n max y ∈Y Note: ξ n ≥ 0 automatic, because left hand side is non-negative. Differentiable objective, convex, N non-linear contraints, 19 / 1

  66. Solving S-SVM Training Numerically Structured SVM (also equivalent formulation): Idea: expand max -constraint into individual cases N 2 � w � 2 + 1 λ � ξ n min N w,ξ n =1 subject to, for n = 1 , . . . , N , ∆( y n , y ) + � w, φ ( x n , y ) � − � w, φ ( x n , y n ) � ≤ ξ n , for all y ∈ Y Differentiable objective, convex, N |Y| linear constraints 20 / 1

  67. Solving S-SVM Training Numerically Solve an S-SVM like a linear SVM: N λ 1 � 2 � w � 2 + ξ n min N w ∈ R D ,ξ ∈ R n n =1 subject to, for i = 1 , . . . n , � w, φ ( x n , y n ) �−� w, φ ( x n , y ) � ≥ ∆( y n , y ) − ξ n , for all y ∈ Y . Introduce feature vectors δφ ( x n , y n , y ) := φ ( x n , y n ) − φ ( x n , y ) . 21 / 1

  68. Solving S-SVM Training Numerically Solve N 1 λ � 2 � w � 2 + ξ n min N w ∈ R D ,ξ ∈ R n + n =1 subject to, for i = 1 , . . . n , for all y ∈ Y , � w, δφ ( x n , y n , y ) � ≥ ∆( y n , y ) − ξ n . Same structure as an ordinary SVM! ◮ quadratic objective � ◮ linear constraints � 22 / 1

  69. Solving S-SVM Training Numerically Solve N 1 λ � 2 � w � 2 + ξ n min N w ∈ R D ,ξ ∈ R n + n =1 subject to, for i = 1 , . . . n , for all y ∈ Y , � w, δφ ( x n , y n , y ) � ≥ ∆( y n , y ) − ξ n . Same structure as an ordinary SVM! ◮ quadratic objective � ◮ linear constraints � Question: Can we use an ordinary SVM/QP solver? 22 / 1

  70. Solving S-SVM Training Numerically Solve N 1 λ � 2 � w � 2 + ξ n min N w ∈ R D ,ξ ∈ R n + n =1 subject to, for i = 1 , . . . n , for all y ∈ Y , � w, δφ ( x n , y n , y ) � ≥ ∆( y n , y ) − ξ n . Same structure as an ordinary SVM! ◮ quadratic objective � ◮ linear constraints � Question: Can we use an ordinary SVM/QP solver? Answer: Almost! We could, if there weren’t N |Y| constraints . ◮ E.g. 100 binary 16 × 16 images: 10 79 constraints 22 / 1

  71. Solving S-SVM Training Numerically – Working Set Solution: working set training ◮ It’s enough if we enforce the active constraints . The others will be fulfilled automatically. ◮ We don’t know which ones are active for the optimal solution. ◮ But it’s likely to be only a small number ← can of course be formalized. Keep a set of potentially active constraints and update it iteratively: 23 / 1

  72. Solving S-SVM Training Numerically – Working Set Solution: working set training ◮ It’s enough if we enforce the active constraints . The others will be fulfilled automatically. ◮ We don’t know which ones are active for the optimal solution. ◮ But it’s likely to be only a small number ← can of course be formalized. Keep a set of potentially active constraints and update it iteratively: Solving S-SVM Training Numerically – Working Set ◮ Start with working set S = ∅ (no contraints) ◮ Repeat until convergence: ◮ Solve S-SVM training problem with constraints from S ◮ Check, if solution violates any of the full constraint set ◮ if no: we found the optimal solution, terminate . ◮ if yes: add most violated constraints to S , iterate . 23 / 1

  73. Solving S-SVM Training Numerically – Working Set Solution: working set training ◮ It’s enough if we enforce the active constraints . The others will be fulfilled automatically. ◮ We don’t know which ones are active for the optimal solution. ◮ But it’s likely to be only a small number ← can of course be formalized. Keep a set of potentially active constraints and update it iteratively: Solving S-SVM Training Numerically – Working Set ◮ Start with working set S = ∅ (no contraints) ◮ Repeat until convergence: ◮ Solve S-SVM training problem with constraints from S ◮ Check, if solution violates any of the full constraint set ◮ if no: we found the optimal solution, terminate . ◮ if yes: add most violated constraints to S , iterate . Good practical performance and theoretic guarantees : ◮ polynomial time convergence ǫ -close to the global optimum 23 / 1

  74. Working Set S-SVM Training input training pairs { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ⊂ X × Y , input feature map φ ( x, y ) , loss function ∆( y, y ′ ) , regularizer λ 1: w ← 0 , S ← ∅ 2: repeat ( w, ξ ) ← solution to QP only with constraints from S 3: for i=1,. . . ,n do 4: ∆( y n , y ) + � w, φ ( x n , y ) � y ← argmax y ∈Y ˆ 5: y � = y n then if ˆ 6: S ← S ∪ { ( x n , ˆ y ) } 7: end if 8: end for 9: 10: until S doesn’t change anymore. output prediction function f ( x ) = argmax y ∈Y � w, φ ( x, y ) � . Obs: each update of w needs N argmax -predictions (one per example), but we solve globally for next w , not by local steps. 24 / 1

  75. Example: Object Localization Y = { object bounding box } ⊂ R 4 . ◮ X images, ◮ Training examples: ◮ Goal: f : X → Y �→ ◮ Loss function: area overlap ∆( y, y ′ ) = 1 − area ( y ∩ y ′ ) area ( y ∪ y ′ ) [Blaschko, Lampert: ”Learning to Localize Objects with Structured Output Regression”, ECCV 2008] 25 / 1

  76. Example: Object Localization Structured SVM: ◮ φ ( x, y ) := ”bag-of-words histogram of region y in image x ” N λ 1 � 2 � w � 2 + ξ n min N w ∈ R D ,ξ ∈ R n n =1 subject to, for i = 1 , . . . n , � w, φ ( x n , y n ) �−� w, φ ( x n , y ) � ≥ ∆( y n , y ) − ξ n , for all y ∈ Y . Interpretation: ◮ For every image, the correct bounding box, y n , should have a higher score than any wrong bounding box. ◮ Less overlap between the boxes → bigger difference in score 26 / 1

Recommend


More recommend