Optimization for Machine Learning Lecture 4: SMO-MKL S.V . N. (vishy) Vishwanathan Purdue University vishy@purdue.edu July 11, 2012 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 22
Motivation Binary Classification y i = +1 y i = − 1 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22
Motivation Binary Classification y i = +1 y i = − 1 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22
Motivation Binary Classification y i = +1 y i = − 1 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22
Motivation Binary Classification � w , x 1 � + b = +1 y i = +1 � w , x 2 � + b = − 1 � w , x 1 − x 2 � = 2 � � 2 w � w � , x 1 − x 2 = � w � x 2 x 1 y i = − 1 { x | � w , x � + b = 1 } { x | � w , x � + b = − 1 } { x | � w , x � + b = 0 } S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 22
Motivation Linear Support Vector Machines Optimization Problem m 1 2 � w � 2 + C � min ξ i w , b ,ξ i =1 s.t. y i ( � w , x i � + b ) ≥ 1 − ξ i for all i ξ i ≥ 0 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 22
Motivation The Kernel Trick y x S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 22
Motivation The Kernel Trick y x S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 22
Motivation The Kernel Trick x 2 + y 2 x S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 22
Motivation Kernel Trick Optimization Problem m 1 2 � w � 2 + C � min ξ i w , b ,ξ i =1 s.t. y i ( � w , φ ( x i ) � + b ) ≥ 1 − ξ i for all i ξ i ≥ 0 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 22
Motivation Kernel Trick Optimization Problem − 1 2 α ⊤ H α + 1 ⊤ α max α s.t. 0 ≤ α i ≤ C � α i y i = 0 i H ij = y i y j � φ ( x i ) , φ ( x j ) � S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 22
Motivation Kernel Trick Optimization Problem − 1 2 α ⊤ H α + 1 ⊤ α max α s.t. 0 ≤ α i ≤ C � α i y i = 0 i H ij = y i y j k ( x i , x j ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 22
Motivation Key Question Which kernel should I use? The Multiple Kernel Learning Answer Cook up as many (base) kernels as you can Compute a data dependent kernel function as a linear combination of base kernels k ( x , x ′ ) = � d k k k ( x , x ′ ) s.t. d k ≥ 0 k S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 22
Motivation Key Question Which kernel should I use? The Multiple Kernel Learning Answer Cook up as many (base) kernels as you can Compute a data dependent kernel function as a linear combination of base kernels k ( x , x ′ ) = � d k k k ( x , x ′ ) s.t. d k ≥ 0 k S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 22
Motivation Object Detection Localize a specified object of interest if it exists in a given image S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 22
Motivation Some Examples of MKL Detection S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 22
Motivation Summary of Our Results Sonar Dataset with 800 kernels Training Time (s) # Kernels Selected p SMO-MKL Shogun SMO-MKL Shogun 1.1 4.71 47.43 91.20 258.00 1.33 3.21 19.94 248.20 374.20 2.0 3.39 34.67 661.20 664.80 Web dataset: ≈ 50,000 points and 50 kernels ≈ 30 minutes Sonar with a hundred thousand kernels Precomputed: ≈ 8 minutes Kernels computed on-the-fly: ≈ 30 minutes S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 22
Motivation Setting up the Optimization Problem -I The Setup We are given K kernel functions k 1 , . . . , k n with corresponding feature maps φ 1 ( · ) , . . . , φ n ( · ) We are interested in deriving the feature map √ d 1 φ 1 ( x ) . . φ ( x ) = . √ d n φ n ( x ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 22
Motivation Setting up the Optimization Problem -I The Setup We are given K kernel functions k 1 , . . . , k n with corresponding feature maps φ 1 ( · ) , . . . , φ n ( · ) We are interested in deriving the feature map √ d 1 φ 1 ( x ) w 1 . . . . φ ( x ) = = ⇒ w = . . √ d n φ n ( x ) w n S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 22
Motivation Setting up the Optimization Problem Optimization Problem m 1 2 � w � 2 + C � min ξ i w , b ,ξ i =1 s.t. y i ( � w , φ ( x i ) � + b ) ≥ 1 − ξ i for all i ξ i ≥ 0 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22
Motivation Setting up the Optimization Problem Optimization Problem m 1 � w k � 2 + C � � min ξ i 2 w , b ,ξ, d i =1 k �� � � s.t. d k � w k , φ k ( x i ) � + b ≥ 1 − ξ i for all i y i k ξ i ≥ 0 d k ≥ 0 for all k S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22
Motivation Setting up the Optimization Problem Optimization Problem � 2 m �� p 1 ξ i + ρ � w k � 2 + C � � d p min k 2 2 w , b ,ξ, d i =1 k k �� � � s.t. d k � w k , φ k ( x i ) � + b ≥ 1 − ξ i for all i y i k ξ i ≥ 0 d k ≥ 0 for all k S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22
Motivation Setting up the Optimization Problem Optimization Problem � 2 � w k � 2 m �� p 1 ξ i + ρ � � d p min + C k 2 d k 2 w , b ,ξ, d i =1 k k �� � s.t. � w k , φ k ( x i ) � + b ≥ 1 − ξ i for all i y i k ξ i ≥ 0 d k ≥ 0 for all k S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22
Motivation Setting up the Optimization Problem Optimization Problem � 2 �� p − 1 d k α ⊤ H k α + 1 ⊤ α + ρ � d p min d max k 2 2 α k k s.t. 0 ≤ α i ≤ C � α i y i = 0 i d k ≥ 0 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 22
Motivation Saddle Point Problem 20 0 5 − 20 0 − 4 − 2 0 2 α 4 − 5 d S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 22
Motivation Solving the Saddle Point Saddle Point Problem � 2 �� p − 1 d k α ⊤ H k α + 1 ⊤ α + ρ � d p min d max k 2 2 α k k s.t. 0 ≤ α i ≤ C � α i y i = 0 i d k ≥ 0 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 22
Our Approach The Key Insight Eliminate d � 2 �� q − 1 � q � + 1 ⊤ α α ⊤ H k α D ( α ) := max 8 ρ α k s.t. 0 ≤ α i ≤ C � α i y i = 0 i p + 1 1 q = 1 Not a QP but very close to one! S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 14 / 22
Our Approach SMO-MKL: High Level Overview � 2 �� q − 1 � q � + 1 ⊤ α α ⊤ H k α D ( α ) := max 8 ρ α k s.t. 0 ≤ α i ≤ C � α i y i = 0 i Algorithm Choose two variables α i and α j to optimize Solve the one dimensional reduced optimization problem Repeat until convergence S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 15 / 22
Our Approach SMO-MKL: High Level Overview Selecting the Working Set Compute directional derivative and directional Hessian Greedily select the variables Solving the Reduced Problem Analytic solution for p = q = 2 (one dimensional quartic) For other values of p use Newton Raphson S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 22
Our Approach SMO-MKL: High Level Overview Selecting the Working Set Compute directional derivative and directional Hessian Greedily select the variables Solving the Reduced Problem Analytic solution for p = q = 2 (one dimensional quartic) For other values of p use Newton Raphson S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 22
Experiments Generalization Performance Australian 90 88 Test Accuracy (%) 86 84 82 80 1 . 1 1 . 33 1 . 66 2 . 0 2 . 33 2 . 66 3 . 0 SMO-MKL Shogun S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 22
Experiments Generalization Performance ionosphere 94 Test Accuracy (%) 92 90 88 1 . 1 1 . 33 1 . 66 2 . 0 2 . 33 2 . 66 3 . 0 SMO-MKL Shogun S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 17 / 22
Experiments Scaling with Training Set Size Adult: 123 dimensions, 50 RBF kernels, p = 1 . 33, C = 1 SMO-MKL 10 4 Shogun CPU Time in seconds 10 3 10 2 10 1 10 3 . 5 10 4 10 4 . 5 Number of Training Examples S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 18 / 22
Recommend
More recommend