Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML)
Nonlinear Features x4: -1 x1: +1 x2: -1 x3: +1 • Concatenated (combined) features • XOR: x = (x 1 , x 2 , x 1 x 2 ) • income: add “degree + major” • Perceptron • Map data into feature space x → φ ( x ) • Solution in span of φ ( x i )
Quadratic Features • Separating surfaces are Circles, hyperbolae, parabolae
Kernels as dot products Problem Extracting features can sometimes be very costly. Example: second order features in 1000 dimensions. This leads to 5005 numbers. For higher order polyno- 5 · 10 5 mial features much worse. Solution Don’t compute the features, try to compute dot products implicitly. For some features this works . . . Definition A kernel function k : X ⇥ X ! R is a symmetric function in its arguments for which the following property holds k ( x, x 0 ) = h Φ ( x ) , Φ ( x 0 ) i for some feature map Φ . If k ( x, x 0 ) is much cheaper to compute than Φ ( x ) . . .
Quadratic Kernel x4: -1 x1: +1 x2: -1 x3: +1 for x in ℝ n , quadratic ɸ : Quadratic Features in R 2 naive: ɸ (x): O ( n 2 ) p ⇣ ⌘ x 2 2 x 1 x 2 , x 2 ɸ (x) ∙ ɸ (x’): O ( n 2 ) Φ ( x ) := 1 , 2 kernel k (x,x’): O ( n ) Dot Product p p D⇣ ⌘ ⇣ 2 ⌘E 2 , h Φ ( x ) , Φ ( x 0 ) i = x 0 2 x 0 1 x 0 2 , x 0 x 2 2 x 1 x 2 , x 2 1 , , 2 1 2 = h x, x 0 i 2 . = k ( x, x 0 ) Insight Trick works for any polynomials of order d via h x, x 0 i d .
The Perceptron on features initialize w, b = 0 repeat Pick ( x i , y i ) from data if y i ( w · Φ ( x i ) + b ) 0 then w 0 = w + y i Φ ( x i ) b 0 = b + y i until y i ( w · Φ ( x i ) + b ) > 0 for all i • Nothing happens if classified correctly end • Weight vector is linear combination X w = α i φ ( x i ) i ∈ I • Classifier is (implicitly) a linear combination of inner products X f ( x ) = α i h φ ( x i ) , φ ( x ) i i ∈ I
Kernelized Perceptron Functional Form initialize f = 0 repeat Pick ( x i , y i ) from data if y i f ( x i ) ≤ 0 then increase its vote by 1 f ( · ) ← f ( · ) + y i k ( x i , · ) + y i α i ← α i + y i until y i f ( x i ) > 0 for all i end • instead of updating w , now update α i • Weight vector is linear combination X w = α i φ ( x i ) • Classifier is linear combination of inner products i ∈ I X X f ( x ) = α i h φ ( x i ) , φ ( x ) i = α i k ( x i , x ) i ∈ I i ∈ I
Kernelized Perceptron Primal Form Dual Form update weights update linear coefficients α i ← α i + y i w ← w + y i φ ( x i ) classify implicitly equivalent to: X f ( k ) = w · φ ( x ) w = α i φ ( x i ) i ∈ I • Nothing happens if classified correctly • Weight vector is linear combination X w = α i φ ( x i ) • Classifier is linear combination of inner products i ∈ I X X f ( x ) = α i h φ ( x i ) , φ ( x ) i = α i k ( x i , x ) i ∈ I i ∈ I
Kernelized Perceptron Primal Form Dual Form update weights update linear coefficients α i ← α i + y i w ← w + y i φ ( x i ) classify implicitly equivalent to: X f ( k ) = w · φ ( x ) w = α i φ ( x i ) i ∈ I classify X f ( x ) = w · φ ( x ) = [ α i φ ( x i )] φ ( x ) slow i ∈ I X = α i h φ ( x i ) , φ ( x ) i O ( d 2 ) i ∈ I fast X = α i k ( x i , x ) O ( d ) i ∈ I
Kernelized Perceptron Dual Form initialize for all α i = 0 i update linear coefficients repeat Pick from data ( x i , y i ) α i ← α i + y i if then y i f ( x i ) ≤ 0 implicitly α i ← α i + y i X w = α i φ ( x i ) until for all y i f ( x i ) > 0 i i ∈ I classify X f ( x ) = w · φ ( x ) = [ α i φ ( x i )] φ ( x ) if #features >> #examples, slow i ∈ I dual is easier; X = α i h φ ( x i ) , φ ( x ) i O ( d 2 ) otherwise primal is easier i ∈ I fast X = α i k ( x i , x ) O ( d ) i ∈ I
Kernelized Perceptron Primal Perceptron Dual Perceptron update weights update linear coefficients α i ← α i + y i w ← w + y i φ ( x i ) classify implicitly X f ( k ) = w · φ ( x ) w = α i φ ( x i ) i ∈ I Q: when is #features >> #examples? if #features >> #examples, dual is easier; A: higher-order polynomial kernels otherwise primal is easier or exponential kernels (inf. dim.)
Kernelized Perceptron Pros/Cons of Kernel in Dual Dual Perceptron • pros: update linear coefficients • no need to compute ɸ (x) (time) α i ← α i + y i • no need to store ɸ (x) and w implicitly (memory) X w = α i φ ( x i ) • cons: i ∈ I classify • sum over all misclassified training examples for test X f ( x ) = w · φ ( x ) = [ α i φ ( x i )] φ ( x ) • need to store all misclassified i ∈ I slow training examples (memory) X = α i h φ ( x i ) , φ ( x ) i O ( d 2 ) • called “support vector set” i ∈ I fast X = α i k ( x i , x ) • SVM will minimize this set! O ( d ) i ∈ I
Kernelized Perceptron Dual Perceptron Primal Perceptron update on new param. w (implicit) update on new param. x1: -1 α = (-1, 0, 0) -x1 x1: -1 w = (0, -1) x2: +1 α = (-1, 1, 0) -x1 + x2 x2: +1 w = (2, 0) x3: +1 α = (-1, 1, 1) -x1 + x2 + x3 x3: +1 w = (2, -1) linear kernel (identity map) final implicit w = (2, -1) x 1 (0 , 1) : − 1 x 2 (2 , 1) : +1 geometric interpretation of dual classification: sum of dot-products with x2 & x3 x 3 (0 , − 1) : +1 bigger than dot-product with x1 (agreement w/ positive > w/ negative)
XOR Example Dual Perceptron x4: -1 x1: +1 update on new param. w (implicit) x1: +1 α = (+1, 0, 0, 0) φ (x1) x2: -1 x2: -1 α = (+1, -1, 0, 0) φ (x1) - φ (x2) x3: +1 √ k ( x, x 0 ) = ( x · x 0 ) 2 ⇔ φ ( x ) = ( x 2 1 , x 2 √ 2 x 1 x 2 ) 2 , w = (0 , 0 , 2 2) classification rule in dual/geom: in dual/algebra: ( x · x 1 ) 2 > ( x · x 2 ) 2 ( x · x 1 ) 2 > ( x · x 2 ) 2 x1: +1 ⇒ cos 2 θ 1 > cos 2 θ 2 ⇒ ( x 1 + x 2 ) 2 > ( x 1 − x 2 ) 2 ⇒ | cos θ 1 | > | cos θ 2 | ⇒ x 1 x 2 > 0 also verify in primal x2: -1
Circle Example?? Dual Perceptron update on new param. w (implicit) x1: +1 α = (+1, 0, 0, 0) φ (x1) x2: -1 α = (+1, -1, 0, 0) φ (x1) - φ (x2) √ k ( x, x 0 ) = ( x · x 0 ) 2 ⇔ φ ( x ) = ( x 2 1 , x 2 2 x 1 x 2 ) 2 ,
Polynomial Kernels Idea We want to extend k ( x, x 0 ) = h x, x 0 i 2 to k ( x, x 0 ) = ( h x, x 0 i + c ) d where c > 0 and d 2 N . Prove that such a kernel corresponds to a dot product. Proof strategy + c is just augmenting space. Simple and straightforward: compute the explicit sum simpler proof: set x 0 = sqrt( c ) given by the kernel, i.e. m ✓ d ◆ k ( x, x 0 ) = ( h x, x 0 i + c ) d = ( h x, x 0 i ) i c d � i X i i =0 Individual terms ( h x, x 0 i ) i are dot products for some Φ i ( x ) .
y +1 Circle Example +1 -1 x2 Dual Perceptron update on new param. w (implicit) x3 x5 x1 x1: +1 α = (+1, 0, 0, 0, 0) φ (x1) x2: -1 α = (+1, -1, 0, 0, 0) φ (x1) - φ (x2) x4 x3: -1 α = (+1, -1, -1, 0, 0) √ k ( x, x 0 ) = ( x · x 0 ) 2 ⇔ φ ( x ) = ( x 2 1 , x 2 2 x 1 x 2 ) 2 , k ( x, x 0 ) = ( x · x 0 + 1) 2 ⇔ φ ( x ) =?
Examples you only need to know polynomial and gaussian. Examples of kernels k ( x, x 0 ) h x, x 0 i Linear exp ( � λ k x � x 0 k ) Laplacian RBF distorts distance � λ k x � x 0 k 2 � � Gaussian RBF exp ( h x, x 0 i + c i ) d , c � 0 , d 2 N Polynomial B 2 n +1 ( x � x 0 ) B-Spline distorts angle E c [ p ( x | c ) p ( x 0 | c )] Cond. Expectation Simple trick for checking Mercer’s condition Compute the Fourier transform of the kernel and check that it is nonnegative.
Kernel Summary • For a feature map ɸ , find a magic function k , s.t.: • the dot-product ɸ (x) ∙ ɸ (x’) = k (x, x’) • this k (x, x’) should be much faster than ɸ (x) • k (x, x’) should be computable in O ( n ) if x in ℝ n • ɸ (x) is much slower: O( n d ) for poly d, more for Gaussian • But for any k function, is there a ɸ s.t. ɸ (x) ∙ ɸ (x’) = k (x,x’)? Examples of kernels k ( x, x 0 ) h x, x 0 i Linear exp ( � λ k x � x 0 k ) Laplacian RBF � λ k x � x 0 k 2 � � Gaussian RBF exp ( h x, x 0 i + c i ) d , c � 0 , d 2 N Polynomial 0 B-Spline
Mercer’s Theorem The Theorem For any symmetric function k : X ⇥ X ! R which is square integrable in X ⇥ X and which satisfies Z k ( x, x 0 ) f ( x ) f ( x 0 ) dxdx 0 � 0 for all f 2 L 2 ( X ) X ⇥ X there exist φ i : X ! R and numbers λ i � 0 where λ i φ i ( x ) φ i ( x 0 ) for all x, x 0 2 X . X k ( x, x 0 ) = i Interpretation Double integral is the continuous version of a vector- matrix-vector multiplication. For positive semidefinite matrices we have X X k ( x i , x j ) α i α j � 0
Properties Distance in Feature Space Distance between points in feature space via d ( x, x 0 ) 2 := k Φ ( x ) � Φ ( x 0 ) k 2 = h Φ ( x ) , Φ ( x ) i � 2 h Φ ( x ) , Φ ( x 0 ) i + h Φ ( x 0 ) , Φ ( x 0 ) i = k ( x, x ) + k ( x 0 , x 0 ) � 2 k ( x, x ) Kernel Matrix To compare observations we compute dot products, so we study the matrix K given by K ij = h Φ ( x i ) , Φ ( x j ) i = k ( x i , x j ) where x i are the training patterns. Similarity Measure The entries K ij tell us the overlap between Φ ( x i ) and Φ ( x j ) , so k ( x i , x j ) is a similarity measure.
Kernelized Pegasos for SVM for HW2, you don’t need to randomly choose training examples. just go over all training examples in the original order, and call that an epoch (same as HW1).
Recommend
More recommend