Learning Kernel-Based Halfspaces with the Zero-One Loss Shai Shalev-Shwartz 1 , Ohad Shamir 1 and Karthik Sridharan 2 1 The Hebrew University 2 TTI Chicago COLT, June 2010 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Halfspaces Hypothesis Class { x �→ φ 0 − 1 ( � w , x � ) } φ 0 − 1 ( � w , x � ) 1 1 � w , x � 0 -1 1 Sample Complexity: O ( d /ǫ 2 ) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Kernel-Based Halfspaces Hypothesis Class { x �→ φ 0 − 1 ( � w , ϕ ( x ) � ) } φ 0 − 1 ( � w , ϕ ( x ) � ) 1 0 1 � w , ϕ ( x ) � -1 1 Sample Complexity: ∞ Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Fuzzy Kernel-Based Halfspaces Hypothesis Class { x �→ φ sig ( � w , ϕ ( x ) � ) } φ sig ( � w , ϕ ( x ) � ) 1 0 1 � w , ϕ ( x ) � -1 1 Sample Complexity: O ( L 2 /ǫ 2 ) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Fuzzy Kernel-Based Halfspaces Hypothesis Class { x �→ φ sig ( � w , ϕ ( x ) � ) } φ sig ( � w , ϕ ( x ) � ) 1 0 1 � w , ϕ ( x ) � -1 1 Sample Complexity: O ( L 2 /ǫ 2 ) Time Complexity: ?? Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Formal Results Time complexity of learning Fuzzy Halfspaces Positive Result : can be done in poly(1 /ǫ ) for any fixed L ( worst case ) Do convex optimization, just use a different kernel... Negative Result : can’t be done in poly( L , 1 /ǫ ) time Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Related Work: Surrogates to 0 − 1 loss Popular fix: replace 0 − 1 loss with convex loss (e.g., hinge loss) No finite-sample approximation guarantees! Asymptotic guarantees exist (Zhang 2004; Bartlett, Jordan, McAuliffe 2006) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Related Work: Surrogates to 0 − 1 loss Popular fix: replace 0 − 1 loss with convex loss (e.g., hinge loss) No finite-sample approximation guarantees! Asymptotic guarantees exist (Zhang 2004; Bartlett, Jordan, McAuliffe 2006) Ben-David & Simon 2000: By a covering technique, can learn fuzzy halfspaces in exp( O ( L 2 /ǫ 2 )) time Worst case = best case Exponentially worse than our bound (however, requires exponentially less examples) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Related Work: Directly for 0 − 1 loss Agnostically learning halfspaces in poly( d 1 /ǫ 4 ) time (Kalai, Klivans, Mansour, Servedio 2005; Blais, O’Donell, Wimmer 2008) But only under distributional assumptions. Dimension-dependent (problematic for kernels) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Technique Idea Original class: H = { x �→ φ ( � w , x � ) : � w � = 1 } Loss function: E ˆ y ∼ φ ( � w , x � ) 1 ˆ y = y Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Technique Idea Original class: H = { x �→ φ ( � w , x � ) : � w � = 1 } Loss function: E ˆ y ∼ φ ( � w , x � ) 1 ˆ y = y = | φ ( � w , x � ) − y | Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Technique Idea Original class: H = { x �→ φ ( � w , x � ) : � w � = 1 } Loss function: E ˆ y ∼ φ ( � w , x � ) 1 ˆ y = y = | φ ( � w , x � ) − y | Problem: Loss is non-convex w.r.t. w The main idea: Work with a larger hypothesis class for which the loss becomes convex x �→ � v , ψ ( x ) � x �→ φ ( � w , x � ) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Technique Idea Assume � x � ≤ 1, and suppose that φ ( a ) is a polynomial � ∞ j =0 β j a j Then ∞ � β j ( � w , x � ) j φ ( � w , x � ) = j =0 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Technique Idea Assume � x � ≤ 1, and suppose that φ ( a ) is a polynomial � ∞ j =0 β j a j Then ∞ � β j ( � w , x � ) j φ ( � w , x � ) = j =0 ∞ � � (2 j / 2 β j w k 1 · · · w k j )(2 − j / 2 x k 1 · · · x k j ) = j =0 k 1 ,..., k j Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Technique Idea Assume � x � ≤ 1, and suppose that φ ( a ) is a polynomial � ∞ j =0 β j a j Then ∞ � β j ( � w , x � ) j φ ( � w , x � ) = j =0 ∞ � � (2 j / 2 β j w k 1 · · · w k j )(2 − j / 2 x k 1 · · · x k j ) = j =0 k 1 ,..., k j = � v w , Ψ( x ) � Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Technique Idea Assume � x � ≤ 1, and suppose that φ ( a ) is a polynomial � ∞ j =0 β j a j Then ∞ � β j ( � w , x � ) j φ ( � w , x � ) = j =0 ∞ � � (2 j / 2 β j w k 1 · · · w k j )(2 − j / 2 x k 1 · · · x k j ) = j =0 k 1 ,..., k j = � v w , Ψ( x ) � Ψ is the feature mapping of the RKHS corresponding to the infinite-dimensional polynomial kernel 1 k ( x , x ′ ) = 1 − 1 2 � x , x ′ � Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Technique Idea Therefore, given sample ( x 1 , y 1 ) , . . . , ( x m , y m ), m 1 � min | φ ( � w , x i � ) − y i | m w : � w � =1 i =1 equivalent to m 1 � min | � v w , Ψ( x i ) � − y i | m v w : � w � =1 i =1 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Technique Idea Therefore, given sample ( x 1 , y 1 ) , . . . , ( x m , y m ), m 1 � min | φ ( � w , x i � ) − y i | m w : � w � =1 i =1 equivalent to m 1 � min | � v w , Ψ( x i ) � − y i | m v w : � w � =1 i =1 Algorithm m 1 � arg min | � v , Ψ( x i ) � − y i | , m v : � v �≤ B i =1 using the infinite-dimensional polynomial kernel Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Technique Idea Theorem Let H B consist of all predictors of the form x �→ φ ( � w , x � ) , where φ ( a ) = � ∞ j =0 β j a j � ∞ j =0 2 j β 2 j ≤ B With O ( B /ǫ 2 ) examples, returned predictor ˆ v satisfies w.h.p. err D (ˆ v ) ≤ min err D ( v ) + ǫ v ∈ H B Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Technique Idea Algorithm m 1 � arg min | � v , Ψ( x i ) � − y i | , m v : � v �≤ B i =1 using the infinite-dimensional polynomial kernel Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Technique Idea Algorithm m 1 � arg min | � v , Ψ( x i ) � − y i | , m v : � v �≤ B i =1 using the infinite-dimensional polynomial kernel Same algorithm competitive against all φ with coefficient bound B - including optimal one for data distribution 1 1 -1 1 -1 1 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Technique Idea Algorithm m 1 � arg min | � v , Ψ( x i ) � − y i | , m v : � v �≤ B i =1 using the infinite-dimensional polynomial kernel Same algorithm competitive against all φ with coefficient bound B - including optimal one for data distribution 1 1 -1 1 -1 1 In practice, parameter B chosen by cross validation. Algorithm can work much faster depending on distribution Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Example - Error Function φ erf ( � w , Ψ( x ) � ) 1 φ erf ( � w , x � ) = 1 + erf( √ π L � w , x � ) 2 � w , Ψ( x ) � -1 1 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Example - Error Function φ erf ( � w , Ψ( x ) � ) 1 φ erf ( � w , x � ) = 1 + erf( √ π L � w , x � ) 2 � w , Ψ( x ) � -1 1 φ erf can be written as an infinite-degree polynomial x �→ � v , ψ ( x ) � x �→ φ erf ( � w , x � ) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Example - Error Function φ erf ( � w , Ψ( x ) � ) 1 φ erf ( � w , x � ) = 1 + erf( √ π L � w , x � ) 2 � w , Ψ( x ) � -1 1 φ erf can be written as an infinite-degree polynomial x �→ � v , ψ ( x ) � x �→ φ erf ( � w , x � ) Unfortunately, bad dependence on L . Can we get a better bound? Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Sigmoid Function φ sig ( � w , Ψ( x ) � ) 1 1 φ sig ( � w , x � ) = 1 + exp( − 4 L � w , x � ) � w , Ψ( x ) � -1 1 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss
Recommend
More recommend