data mining techniques
play

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent ( credit : Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang) Kernel Regression Basis function regression Linear regression Basis


  1. Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent ( credit : Yijun Zhao, Arthur Gretton 
 Rasmussen & Williams, Percy Liang)

  2. Kernel Regression

  3. Basis function regression Linear regression Basis function regression For N samples Polynomial regression

  4. Basis Function Regression M = 3 1 t 0 − 1 0 1 x

  5. The Kernel Trick Define a kernel function such that k can be cheaper to evaluate than φ !

  6. Kernel Ridge Regression MAP / Expected value for Weights ( requires inversion of D x D matrix ) A : = ( Φ > Φ + λ I ) E [ w | y ] = A � 1 Φ > y Φ : = Φ ( X ) Alternate representation ( requires inversion of N x N matrix ) A � 1 Φ > = Φ > ( K + λ I ) � 1 K : = λ � 1 ΦΦ > Predictive posterior (using kernel function) E [ f ( x ⇤ ) | y ] = φ ( x ⇤ ) > E [ w | y ] = φ ( x ⇤ ) > Φ > ( K + λ I ) � 1 y X

  7. Kernel Ridge Regression MAP / Expected value for Weights ( requires inversion of D x D matrix ) A : = ( Φ > Φ + λ I ) E [ w | y ] = A � 1 Φ > y Φ : = Φ ( X ) Alternate representation ( requires inversion of N x N matrix ) A � 1 Φ > = Φ > ( K + λ I ) � 1 K : = λ � 1 ΦΦ > Predictive posterior (using kernel function) E [ f ( x ⇤ ) | y ] = φ ( x ⇤ ) > E [ w | y ] = φ ( x ⇤ ) > Φ > ( K + λ I ) � 1 y X k ( x ⇤ , x n )( K + λ I ) � 1 = nm y m n , m

  8. Kernel Ridge Regression n ! ( y i � h f , φ ( x i ) i H ) 2 + λ k f k 2 X f ∗ = arg min H . f ∈ H i = 1 λ =0.1, σ =0.6 λ =10, σ =0.6 λ =1e − 07, σ =0.6 1 1 1.5 1 0.5 0.5 0.5 0 0 0 − 0.5 − 0.5 − 0.5 − 1 − 1 − 1 − 0.5 0 0.5 1 1.5 − 0.5 0 0.5 1 1.5 − 0.5 0 0.5 1 1.5 Closed form Solution

  9. Gaussian Processes (a.k.a. Kernel Ridge Regression with Variance Estimates) 2 2 1 1 output, f(x) output, f(x) 0 0 − 1 − 1 − 2 − 2 − 5 0 5 − 5 0 5 input, x input, x k ( x ⇤ , x ) > [ K + σ 2 noise I ] − 1 y , � p ( y ⇤ | x ⇤ , x , y ) ∼ N k ( x ⇤ , x ⇤ ) + σ 2 noise − k ( x ⇤ , x ) > [ K + σ 2 noise I ] − 1 k ( x ⇤ , x ) � adapted from : Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

  10. Choosing Kernel Hyperparameters � � too long 1.0 about right too short function value, y 0.5 0.0 − 0.5 − 10 − 5 0 5 10 input, x The mean posterior predictive function is plotted for 3 different length scales (the − ( x − x 0 ) 2 function: k ( x , x 0 ) = v 2 exp + � 2 � � noise � xx 0 . 2 ` 2 Characteristic Lengthscales adapted from : Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

  11. Intermezzo: Kernels Borrowing from : 
 Arthur Gretton 
 (Gatsby, UCL)

  12. Hilbert Spaces Definition (Inner product) Let H be a vector space over R . A function h · , · i H : H ⇥ H ! R is an inner product on H if 1 Linear: h α 1 f 1 + α 2 f 2 , g i H = α 1 h f 1 , g i H + α 2 h f 2 , g i H 2 Symmetric: h f , g i H = h g , f i H 3 h f , f i H � 0 and h f , f i H = 0 if and only if f = 0. p Norm induced by the inner product: k f k H := h f , f i H

  13. Example: Fourier Bases

  14. Example: Fourier Bases

  15. Example: Fourier Bases

  16. Example: Fourier Bases Fourier modes define a vector space

  17. Kernels Definition Let X be a non-empty set. A function k : X × X → R is a kernel if there exists an R -Hilbert space and a map φ : X → H such that ∀ x , x 0 ∈ X , ⌦ ↵ k ( x , x 0 ) := φ ( x ) , φ ( x 0 ) H . Almost no conditions on X (eg, X itself doesn’t need an inner product, eg. documents). A single kernel can correspond to several possible features. A trivial example for X := R :  x / √ � 2 √ φ 1 ( x ) = x φ 2 ( x ) = and 2 x /

  18. Sums, Transformations, Products Theorem (Sums of kernels are kernels) Given α > 0 and k, k 1 and k 2 all kernels on X , then α k and k 1 + k 2 are kernels on X . (Proof via positive definiteness: later!) A di ff erence of kernels may not be a kernel ( why? ) Theorem (Mappings between spaces) Let X and e X be sets, and define a map A : X → e X . Define the kernel k on e X . Then the kernel k ( A ( x ) , A ( x 0 )) is a kernel on X . Example: k ( x , x 0 ) = x 2 ( x 0 ) 2 . Theorem (Products of kernels are kernels) Given k 1 on X 1 and k 2 on X 2 , then k 1 ⇥ k 2 is a kernel on X 1 ⇥ X 2 . If X 1 = X 2 = X , then k := k 1 ⇥ k 2 is a kernel on X . Proof: Main idea only!

  19. Polynomial Kernels Theorem (Polynomial kernels) Let x , x 0 2 R d for d � 1 , and let m � 1 be an integer and c � 0 be a positive real. Then � m �⌦ x , x 0 ↵ k ( x , x 0 ) := + c is a valid kernel. To prove : expand into a sum (with non-negative scalars) of kernels h x , x 0 i raised to integer powers. These individual terms are valid kernels by the product rule.

  20. Infinite Sequences Definition The space ` 2 ( square summable sequences) comprises all sequences a := ( a i ) i � 1 for which 1 k a k 2 X a 2 ` 2 = i < 1 . i = 1 Definition Given sequence of functions ( � i ( x )) i � 1 in ` 2 where � i : X ! R is the i th coordinate of � ( x ) . Then 1 X k ( x , x 0 ) := � i ( x ) � i ( x 0 ) (1) i = 1

  21. Infinite Sequences Why square summable? By Cauchy-Schwarz, � � 1 � � X � � φ i ( x ) φ i ( x 0 ) � φ ( x 0 ) �  k φ ( x ) k ` 2 � � � ` 2 , � � � i = 1 so the sequence defining the inner product converges for all x , x 0 2 X

  22. Taylor Series Kernels Definition (Taylor series kernel) For r 2 ( 0 , 1 ] , with a n � 0 for all n � 0 1 X a n z n f ( z ) = | z | < r , z 2 R , n = 0 Define X to be the p r -ball in R d , so k x k < p r , 1 x , x 0 ↵ n . X �⌦ x , x 0 ↵� ⌦ k ( x , x 0 ) = f = a n n = 0 Example (Exponential kernel) �⌦ x , x 0 ↵� k ( x , x 0 ) := exp .

  23. Gaussian Kernel (also known as Radial Basis Function (RBF) kernel) Example (Gaussian kernel) The Gaussian kernel on R d is defined as ⇣ � 2 ⌘ − γ � 2 � � x − x 0 � k ( x , x 0 ) := exp . Proof : an exercise! Use product rule, mapping rule, exponential kernel.

  24. Gaussian Kernel (also known as Radial Basis Function (RBF) kernel) Example (Gaussian kernel) The Gaussian kernel on R d is defined as ⇣ � 2 ⌘ − γ � 2 � � x − x 0 � k ( x , x 0 ) := exp . Proof : an exercise! Use product rule, mapping rule, exponential kernel. Squared Exponential (SE) Automatic Relevance 
 Determination (ARD)

  25. Products of Kernels me: Squared-exp ( SE ) Periodic ( Per ) Linear ( Lin ) − ( x ≠ x Õ ) 2 f ( x − c )( x Õ − c ) 1 2 1 ¸ 2 sin 2 1 22 − 2 π x ≠ x Õ σ 2 σ 2 σ 2 f exp f exp 2 ¸ 2 p ) : 0 0 0 x (with x Õ = 1 ) x − x Õ x − x Õ Lin × Lin SE × Per Lin × SE Lin × Per 0 0 0 0 x (with x Õ = 1 ) x (with x Õ = 1 ) x (with x Õ = 1 ) x − x Õ source: David Duvenaud (PhD Thesis)

  26. Positive Definiteness Definition (Positive definite functions) A symmetric function k : X × X → R is positive definite if ∀ n ≥ 1 , ∀ ( a 1 , . . . a n ) ∈ R n , ∀ ( x 1 , . . . , x n ) ∈ X n , n n X X a i a j k ( x i , x j ) ≥ 0 . i = 1 j = 1 The function k ( · , · ) is strictly positive definite if for mutually distinct x i , the equality holds only when all the a i are zero.

  27. Mercer’s Theorem Theorem Let H be a Hilbert space, X a non-empty set and φ : X ! H . Then h φ ( x ) , φ ( y ) i H =: k ( x , y ) is positive definite. Proof. n n n n X X X X a i a j k ( x i , x j ) = h a i φ ( x i ) , a j φ ( x j ) i H i = 1 j = 1 i = 1 j = 1 2 � � n � � X = a i φ ( x i ) � 0 . � � � � � � i = 1 H Reverse also holds: positive definite k ( x , x 0 ) is inner product in a unique H (Moore-Aronsajn: coming later!).

  28. DIMENSIONALITY REDUCTION Borrowing from : 
 Percy Liang 
 (Stanford)

  29. Linear Dimensionality Reduction Idea : Project high-dimensional vector 
 onto a lower dimensional space ∈ x ∈ R 361 z = U > x z ∈ R 10

  30. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Transpose of X 
 used in regression!

  31. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Want to reduce dimensionality from d to k Choose k directions u 1 , . . . , u k U = ( u 1 ·· u k ) ∈ R d ⇥ k

  32. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Want to reduce dimensionality from d to k Choose k directions u 1 , . . . , u k U = ( u 1 ·· u k ) ∈ R d ⇥ k For each u j , compute “similarity” z j = u > j x

  33. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Want to reduce dimensionality from d to k Choose k directions u 1 , . . . , u k U = ( u 1 ·· u k ) ∈ R d ⇥ k For each u j , compute “similarity” z j = u > j x Project x down to z = ( z 1 , . . . , z k ) > = U > x How to choose U ?

  34. Principal Component Analysis ∈ x ∈ R 361 z = U > x z ∈ R 10 Optimize two equivalent objectives 1. Minimize the reconstruction error 2. Maximizes the projected variance

  35. PCA Objective 1: Reconstruction Error U serves two functions: • Encode: z = U > x , z j = u > j x P

Recommend


More recommend