Optimization for Machine Learning Lecture 3: Bundle Methods S.V . N. (vishy) Vishwanathan Purdue University vishy@purdue.edu July 11, 2012 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 1 / 30
Motivation Outline Motivation 1 Cutting Plane Methods 2 Non Smooth Functions 3 Bundle Methods 4 BMRM 5 Convergence Analysis 6 Experiments 7 Lower Bounds 8 References 9 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 2 / 30
Motivation Regularized Risk Minimization Objective Function Training data: { x 1 , . . . , x m } Labels: { y 1 , . . . , y m } Learn a vector: w m + 1 � J ( w ) := λ Ω( w ) l ( x i , y i , w ) minimize m w � �� � i =1 Regularizer � �� � Risk R emp S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 3 / 30
Motivation Binary Classification y i = +1 y i = − 1 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 30
Motivation Binary Classification y i = +1 y i = − 1 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 30
Motivation Binary Classification y i = +1 y i = − 1 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 30
Motivation Binary Classification � w , x 1 � + b = +1 y i = +1 � w , x 2 � + b = − 1 � w , x 1 − x 2 � = 2 � � 2 w � w � , x 1 − x 2 = � w � x 2 x 1 y i = − 1 { x | � w , x � + b = 1 } { x | � w , x � + b = − 1 } { x | � w , x � + b = 0 } S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 4 / 30
Motivation Linear Support Vector Machines Optimization Problem 2 max � w � w , b s.t. y i ( � w , x i � + b ) ≥ 1 for all i S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30
Motivation Linear Support Vector Machines Optimization Problem 1 2 � w � 2 min w , b s.t. y i ( � w , x i � + b ) ≥ 1 for all i S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30
Motivation Linear Support Vector Machines Optimization Problem 1 2 � w � 2 min w , b ,ξ s.t. y i ( � w , x i � + b ) ≥ 1 − ξ i for all i ξ i ≥ 0 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30
Motivation Linear Support Vector Machines Optimization Problem m 2 � w � 2 + 1 λ � min ξ i m w , b ,ξ i =1 s.t. y i ( � w , x i � + b ) ≥ 1 − ξ i for all i ξ i ≥ 0 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30
Motivation Linear Support Vector Machines Optimization Problem m 2 � w � 2 + 1 λ � min ξ i m w , b ,ξ i =1 s.t. ξ i ≥ 1 − y i ( � w , x i � + b ) for all i ξ i ≥ 0 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30
Motivation Linear Support Vector Machines Optimization Problem m λ 2 � w � 2 + 1 � min max(0 , 1 − y i ( � w , x i � + b )) m w , b i =1 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30
Motivation Linear Support Vector Machines Optimization Problem m λ + 1 � 2 � w � 2 min max(0 , 1 − y i ( � w , x i � + b )) m w , b i =1 � �� � � �� � λ Ω( w ) R emp ( w ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 5 / 30
Motivation Binary Hinge Loss loss y ( � w , x � + b ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 6 / 30
Cutting Plane Methods Outline Motivation 1 Cutting Plane Methods 2 Non Smooth Functions 3 Bundle Methods 4 BMRM 5 Convergence Analysis 6 Experiments 7 Lower Bounds 8 References 9 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 7 / 30
Cutting Plane Methods First Order Taylor Expansion The First Order Taylor approximation globally lower bounds the function For any x and x ′ we have � � f ( x ) ≥ f ( x ′ ) + x − x ′ , ∇ f ( x ′ ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 8 / 30
Cutting Plane Methods Cutting Plane Methods S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30
Cutting Plane Methods Cutting Plane Methods S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30
Cutting Plane Methods Cutting Plane Methods S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30
Cutting Plane Methods Cutting Plane Methods S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30
Cutting Plane Methods Cutting Plane Methods S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30
Cutting Plane Methods Cutting Plane Methods S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30
Cutting Plane Methods Cutting Plane Methods S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30
Cutting Plane Methods Cutting Plane Methods S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30
Cutting Plane Methods Cutting Plane Methods S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 9 / 30
Cutting Plane Methods In a Nutshell Cutting Plane Methods work by forming the piecewise linear lower bound J ( w ) ≥ J CP ( w ) := max 1 ≤ i ≤ t { J ( w i − 1 ) + � w − w i − 1 , s i �} . t where s i denotes the gradient ∇ J ( w i − 1 ). At iteration t the set { w i } t − 1 i =0 is augmented by J CP w t := argmin ( w ) . t w Stop when the duality gap 0 ≤ i ≤ t J ( w i ) − J CP ǫ t := min ( w t ) t falls below a pre-specified threshold ǫ . S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 30
Cutting Plane Methods In a Nutshell Cutting Plane Methods work by forming the piecewise linear lower bound J ( w ) ≥ J CP ( w ) := max 1 ≤ i ≤ t { J ( w i − 1 ) + � w − w i − 1 , s i �} . t where s i denotes the gradient ∇ J ( w i − 1 ). At iteration t the set { w i } t − 1 i =0 is augmented by J CP w t := argmin ( w ) . t w Stop when the duality gap 0 ≤ i ≤ t J ( w i ) − J CP ǫ t := min ( w t ) t falls below a pre-specified threshold ǫ . S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 30
Cutting Plane Methods In a Nutshell Cutting Plane Methods work by forming the piecewise linear lower bound J ( w ) ≥ J CP ( w ) := max 1 ≤ i ≤ t { J ( w i − 1 ) + � w − w i − 1 , s i �} . t where s i denotes the gradient ∇ J ( w i − 1 ). At iteration t the set { w i } t − 1 i =0 is augmented by J CP w t := argmin ( w ) . t w Stop when the duality gap 0 ≤ i ≤ t J ( w i ) − J CP ǫ t := min ( w t ) t falls below a pre-specified threshold ǫ . S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 10 / 30
Non Smooth Functions Outline Motivation 1 Cutting Plane Methods 2 Non Smooth Functions 3 Bundle Methods 4 BMRM 5 Convergence Analysis 6 Experiments 7 Lower Bounds 8 References 9 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 11 / 30
Non Smooth Functions What if the Function is NonSmooth? The piecewise linear function J ( w ) := max � u i , w � i is convex but not differentiable at the kinks! S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 12 / 30
Non Smooth Functions Subgradients to the Rescue A subgradient at w ′ is any vector s which satisfies � � J ( w ) ≥ J ( w ′ ) + w − w ′ , s for all w Set of all subgradients is denoted as ∂ J ( w ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 30
Non Smooth Functions Subgradients to the Rescue A subgradient at w ′ is any vector s which satisfies � � J ( w ) ≥ J ( w ′ ) + w − w ′ , s for all w Set of all subgradients is denoted as ∂ J ( w ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 30
Non Smooth Functions Subgradients to the Rescue A subgradient at w ′ is any vector s which satisfies � � J ( w ) ≥ J ( w ′ ) + w − w ′ , s for all w Set of all subgradients is denoted as ∂ J ( w ) S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 13 / 30
Non Smooth Functions Good News! Cutting Plane Methods work with subgradients Just choose an arbitrary one S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 14 / 30
Non Smooth Functions Good News! Cutting Plane Methods work with subgradients Just choose an arbitrary one Then what is the bad news? S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 14 / 30
Non Smooth Functions Bad News 1 0 . 8 0 . 6 1 0 . 5 0 − 1 − 0 . 5 0 − 0 . 5 0 . 5 1 − 1 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 15 / 30
Bundle Methods Outline Motivation 1 Cutting Plane Methods 2 Non Smooth Functions 3 Bundle Methods 4 BMRM 5 Convergence Analysis 6 Experiments 7 Lower Bounds 8 References 9 S.V . N. Vishwanathan (Purdue University) Optimization for Machine Learning 16 / 30
Recommend
More recommend