MAXIMUM MARGIN CLASSIFIERS MAXIMUM MARGIN CLASSIFIERS Matthieu R Bloch Tuesday, February 11, 2020 1
LOGISTICS LOGISTICS TAs and Office hours Tuesday: TJ (VL C449 Cubicle D) - 1:30pm - 2:45pm Wednesday: Matthieu (TSRB 423) - 12:00:pm-1:15pm Thursday: Hossein (VL C449 Cubicle B): 10:45pm - 12:00pm Friday: Brighton (TSRB 523a) - 12pm-1:15pm Homework 3 Due Wednesday February 19, 2020 11:59pm EST (Friday February 126, 2020 for DL) Please include separate PDF with your plots and listings Make sure you show your work, don’t leave gaps in loggic Honor code Cite your sources Refrain from using solutions from previous years 2
RECAP: MAXIMUM MARGIN HYPERPLANE RECAP: MAXIMUM MARGIN HYPERPLANE “All separating hyperplanes are equal but some are more equal than others” Margin w ⊺ x i | + b | ρ ( w , b ) ≜ min i ∥ w ∥ 2 The maximum margin hyperplane is the solution of w ∗ b ∗ ( , ) = argmax ρ ( w , b ) w , b Larger margin leads to better generalization Definition. The canonical form of a separating plane is such that ( w , b ) y i w ⊺ x i i ∗ y i ∗ w ⊺ x i ∗ ∀ i ( + b ) ≥ 1 and ∃ s.t. ( + b ) = 1 For canonical hyperplanes, the optimization problem is 1 2 ∥ w ∥ 2 y i w ⊺ x i argmin s.t. ∀ i ( + b ) ≥ 1 2 w , b this is a constrained quadratic program we know how to solve this really well Will come back when we talk about support vector machines 3
4
5
OPTIMAL SOFT-MARGIN HYPERPLANE OPTIMAL SOFT-MARGIN HYPERPLANE What if our data is not linearly separable? The constraint cannot be satisfied y i w ⊺ x i ∀ i ( + b ) ≥ 1 Introduce slack variables such that y i w ⊺ x i ξ i > 0 ∀ i ( + b ) ≥ 1 − ξ i The optimal so�-margin hyperplane is the solution of the following N 1 C 2 ∥ w ∥ 2 y i w ⊺ x i argmin + n ∑ ξ i s.t. ∀ i ( + b ) ≥ 1 − ξ i and ξ i ≥ 0 2 w , b , ξ i =1 is a cost set by the user, which controls the influence of outliers C > 0 6
8
NON LINEAR FEATURES NON LINEAR FEATURES LDA, logistic, PLA, are all linear classifiers : classification region boundaries are hyperplanes Some datasets are not linearly separable! We can create nonlinear classifiers by transforming the data through a non linear map R d R p Φ : → ϕ 1 ( x ) ⎡ ⎤ x 1 ⎡ ⎤ ⎢ ⎥ ⋮ ⎢ ⎥ ⎢ ⎥ Φ : → ⎢ ⎥ ⎢ ⎥ ⋮ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ⋮ x d ⎣ ⎦ ϕ p ( x ) One can then apply linear methods on the transformed feature vector Φ( x ) Example. Ring data Challenges : if this gets computationally challenging and there is a risk of overfitting! p ≫ n 9
10
11
12
KERNEL METHODS - AN OBSERVATION KERNEL METHODS - AN OBSERVATION Consider the maximum margin hyperplane with non linear transform R d R p Φ : → 1 2 ∥ w ∥ 2 y i w ⊺ argmin s.t. ∀ i ( Φ( x i ) + b ) ≥ 1 2 w , b One can show (later) that the optimal is of the form ∑ N w w = i =1 α i Φ( x i ) N N N N ∥ w ∥ 2 w ⊺ x i ) ⊺ = w = ∑ ∑ α i α j Φ( Φ( x j ) = ∑ ∑ α i α j ⟨ Φ( x i ), Φ( x j ) ⟩ 2 i =1 j =1 i =1 j =1 N N w ⊺ x i ) ⊺ Φ( x j ) = ∑ α i Φ( Φ( x j ) = ∑ α i ⟨ Φ( x i ), Φ( x j ) ⟩ i =1 i =1 The only quantities we really care about are the dot products ⟨ Φ( x i ), Φ( x j ) ⟩ There are only of them N 2 The dimension of does not appear explicitly (hidden in dot product), we only work in R d Φ( x ) The nonlinear features may not be computed explicitly in ⟨ Φ( x i ), Φ( x j ) ⟩ 13
KERNEL METHODS - THE TRICK KERNEL METHODS - THE TRICK Implicitly define features through the choice of a kernel Definition. (Inner product kernel) An inner product kernel is a mapping for which there exists a Hilbert space and a R d R d k : × → R H mapping such that R d Φ : → H R d ∀ u , v ∈ k ( u , v ) = ⟨ Φ( u ), Φ( v ) ⟩ H Example. Quadratic kernel u ⊺ ) 2 k ( u , v ) = ( v Definition. (Positive semidefinite kernel) A function is a positive semidefinite kernel if R d R d k : × → R is symmetric, i.e., k k ( u , v ) = k ( v , u ) for all , the Gram matrix is positive semidefinite, i.e., { x i } N K i =1 x ⊺ Kx ≥ 0 with K = [ K i , j ] and K i , j ≜ k ( x i x j , ) 14
15
Recommend
More recommend