data mining techniques
play

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 3 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 3 Jan-Willem van de Meent ( credit : Yijun Zhao, Arthur Gretton Rasmussen & Williams) Linear Regression Linear Regression Assume f is a linear combination of D features


  1. Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 3 Jan-Willem van de Meent ( credit : Yijun Zhao, Arthur Gretton 
 Rasmussen & Williams)

  2. Linear Regression

  3. Linear Regression Assume f is a linear combination of D features ε ∼ Norm ( 0, σ 2 ) For N points we write Learning task : Estimate w

  4. Linear Regression

  5. Error Measure: Sum of Squares Mean Squared Error (MSE): N E ( w ) = 1 X ( w T x n � y n ) 2 N n =1 = 1 N k Xw � y k 2 where — x 1 T — 2 3 2 y 1 T 3 — x 2 T — y 2 T 6 7 6 7 X = y = 6 7 6 7 4 5 4 5 . . . . . . — x NT — y NT

  6. Minimizing the Error E ( w ) = 1 N k Xw � y k 2 5 E ( w ) = 2 N X T ( Xw � y ) = 0 2 X T Xw = X T y w = X † y where X † = ( X T X ) � 1 X T is the ’pseudo-inverse’ of X

  7. Minimizing the Error E ( w ) = 1 N k Xw � y k 2 5 E ( w ) = 2 N X T ( Xw � y ) = 0 2 X T Xw = X T y w = X † y where X † = ( X T X ) � 1 X T is the ’pseudo-inverse’ of X Matrix Cookbook (on course website)

  8. Ordinary Least Squares Construct matrix X and the vector y from the dataset { ( x 1 , y 1 ) , x 2 , y 2 ) , . . . , ( x N , y N ) } (each x includes x 0 = 1) as follows:  — x T   y T  1 — 1 — x T y T 2 —     2 X = y =         . . . . . . — x T y T N — N Compute X † = ( X T X ) − 1 X T Return w = X † y

  9. Basis function regression Linear regression Basis function regression For N samples Polynomial regression

  10. Polynomial Regression M = 0 M = 1 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x M = 3 M = 9 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x

  11. Polynomial Regression Underfit M = 0 M = 1 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x M = 3 M = 9 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x

  12. Polynomial Regression M = 0 M = 1 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x Overfit M = 3 M = 9 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x

  13. Regularization L 2 regularization (ridge regression) minimizes: E ( w ) = 1 N k Xw � y k 2 + λ k w k 2 where λ � 0 and k w k 2 = w T w � k k L 1 regularization (LASSO) minimizes: E ( w ) = 1 N k Xw � y k 2 + λ | w | 1 D where λ � 0 and | w | 1 = P | ω i | i =1

  14. Regularization

  15. Regularization L 2: closed form solution w = ( X T X + λ I ) � 1 X T y L 1: No closed form solution. Use quadratic programming: minimize k Xw � y k 2 k w k 1  s s . t .

  16. Review: Probability

  17. Examples: Independent Events 1. What’s the probability of getting a sequence of 1,2,3,4,5,6 if we roll a dice six times? 2. A school survey found that 9 out of 10 students like pizza. If three students are chosen at random with replacement, what is the probability that all three students like pizza?

  18. Dependent Events uit Apple or- intro- Orange Red bin Blue bin If I take a fruit from the red bin, what is the probability that I get an apple ?

  19. Dependent Events uit Apple or- intro- Orange Red bin Blue bin Conditional Probability P(fruit = apple | bin = red ) = 2 / 8

  20. Dependent Events uit Apple or- intro- Orange Red bin Blue bin Joint Probability P(fruit = apple , bin = red ) = 2 / 12

  21. Dependent Events uit Apple or- intro- Orange Red bin Blue bin Joint Probability P(fruit = apple , bin = blue ) = ?

  22. Dependent Events uit Apple or- intro- Orange Red bin Blue bin Joint Probability P(fruit = apple , bin = blue ) = 3 / 12

  23. Dependent Events uit Apple or- intro- Orange Red bin Blue bin Joint Probability P(fruit = orange , bin = blue ) = ?

  24. Dependent Events uit Apple or- intro- Orange Red bin Blue bin Joint Probability P(fruit = orange , bin = blue ) = 1 / 12

  25. Two rules of Probability uit or- intro- 1. Sum Rule (Marginal Probabilities) P(fruit = apple ) = P(fruit = apple , bin = blue ) + P(fruit = apple , bin = red ) = ?

  26. Two rules of Probability uit or- intro- 1. Sum Rule (Marginal Probabilities) P(fruit = apple ) = P(fruit = apple , bin = blue ) + P(fruit = apple , bin = red ) = 3 / 12 + 2 / 12 = 5 / 12

  27. Two rules of Probability uit or- intro- 2. Product Rule P(fruit = apple , bin = red ) = P(fruit = apple | bin = red ) p(bin = red ) = ?

  28. Two rules of Probability uit or- intro- 2. Product Rule P(fruit = apple , bin = red ) = P(fruit = apple | bin = red ) p(bin = red ) = 2 / 8 * 8 / 12 = 2 / 12

  29. Two rules of Probability uit or- intro- 2. Product Rule (reversed) P(fruit = apple , bin = red ) = P(bin = red | fruit = apple ) p(fruit = apple ) = ?

  30. Two rules of Probability uit or- intro- 2. Product Rule (reversed) P(fruit = apple , bin = red ) = P(bin = red | fruit = apple ) p(fruit = apple ) = 2 / 5 * 5 / 12 = 2 / 12

  31. Bayes' Rule Posterior Likelihood Prior Sum Rule: Product Rule:

  32. Bayes' Rule Posterior Likelihood Prior Probability of rare disease: 0.005 Probability of detection: 0.98 Probability of false positive: 0.05 Probability of disease when test positive?

  33. Bayes' Rule Posterior Likelihood Prior 0.99 * 0.005 = 0.00495 0.99 * 0.005 + 0.05 * 0.995 = 0.0547 0.00495 / 0.0547 = 0.09

  34. Normal Distribution ∼ ⇒ Density:

  35. Multivariate Normal Density: Σ i j = E [( x i − µ i )( x j − µ j )] Parameters:

  36. Covariance Matrices Density:  1 . 0  2 . 0  1 . 0 � � �  � 0 . 0 0 . 0 0 . 5 1 . 0 − 0 . 5 0 . 0 1 . 0 0 . 0 0 . 5 0 . 5 1 . 0 − 0 . 5 1 . 0 Question: Which covariance matrix Σ 
 corresponds to which plot?

  37. Marginals and Conditionals Suppose that x and y are jointly Gaussian:  A  x � ✓ a � �◆ C ∼ N z = , C T y b B Question: What are the marginal 
 distributions p ( x ) and p ( y )? y x ∼ N ( a , A ) y ∼ N ( b , B ) x x

  38. Marginals and Conditionals Suppose that x and y are jointly Gaussian:  A  x � ✓ a � �◆ C ∼ N z = , C T y b B Question: What are the conditional 
 distributions p ( x | y ) and p ( y | x )? y a + CB − 1 ( y − b ) , A − CB − 1 C T � � x | y ∼ N b + C T A − 1 ( x − a ) , B − C T A − 1 C � � y | x ∼ N x x

  39. Maximum Likelihood

  40. Regression: Probabilistic Interpretation ? What is the probability

  41. Regression: Probabilistic Interpretation Least Squares 
 Objective Likelihood

  42. Maximum Likelihood Least Squares 
 Objective Log-Likelihood Maximizing the likelihood minimizes the sum of squares

  43. Maximum a Posteriori

  44. Regression with Priors Can we maximize ? (this is known as maximum a posteriori estimation)

  45. Regression with Priors From Bayes Rule

  46. Maximum a Posteriori Maximum a Posteriori is Equivalent to Ridge Regression

  47. Maximum a Posteriori Maximum a Posteriori is Equivalent to Ridge Regression

  48. Basis Function Regression M = 3 1 t 0 − 1 0 1 x

  49. Basis Function Regression M = 3 1 t 0 − 1 0 1 x

  50. Predictive Posterior

  51. Priors on Functions M=0 M=1 M=2 2 4 5 1 2 0 0 0 − 2 − 1 − 5 − 4 − 2 − 1 0 1 2 − 1 0 1 2 − 1 0 1 2 M=3 M=5 5 M=17 x 10 50 2 10 0 0 0 − 10 − 2 − 50 − 1 0 1 2 − 1 0 1 2 − 1 0 1 2 Idea: sampling w ~ p( w ) defines a function w T φ ( x ), so p( w ) is equivalent to a prior on functions . adapted from : Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

  52. Posterior Uncertainty Can we reason about the posterior on functions ? 2 0 − 2 Increasing λ − 6 − 4 − 2 0 2 4 6 2 0 − 2 − 6 − 4 − 2 0 2 4 6 2 0 − 2 − 6 − 4 − 2 0 2 4 6 Idea: sample w ~ p( w | X , y ) and plot functions adapted from : Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

  53. The Predictive Distribution Predictive distribution on observations Prior Predictive distribution on the function value � 1 φ ( x ⇤ ) > A � 1 Φ y , φ ( x ⇤ ) > A � 1 φ ( x ⇤ ) � � f ⇤ | x ⇤ , X, y ⇠ N σ 2 n ameter is typic n ΦΦ > + Σ � 1 and A = σ � 2 p . for f ⇤ , f ( x ⇤ ) with Φ = Φ ( X ) invert the A matrix of w.r.t. the Gau equation we need adapted from : Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

  54. The Predictive Distribution Idea: Average over all possible values of w 2 0 − 2 − 6 − 4 − 2 0 2 4 6 Increasing λ 2 0 − 2 − 6 − 4 − 2 0 2 4 6 2 0 − 2 − 6 − 4 − 2 0 2 4 6 adapted from : Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

  55. The Kernel Trick

  56. Cost of Feature Computation Example: Mapping with linear and quadratic terms

  57. Cost of Feature Computation Example: Mapping with linear and quadratic terms 1+d+d 2 /2 terms

  58. Cost of Feature Computation Example: Mapping with linear and quadratic terms ϕ ( x ) Polynomial Cost 100 features > d 2 /2 terms up d 2 N 2 /4 2,500 N 2 Quadratic to degree 2 > d 3 /6 terms up d 3 N 2 /12 83,000 N 2 Cubic to degree 3 > d 4 /24 terms d 4 N 2 /48 1,960,000 N 2 Quartic up to degree 4 •

  59. The Kernel Trick Define a kernel function such that k can be cheaper to evaluate than φ !

Recommend


More recommend