Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 3 Jan-Willem van de Meent ( credit : Yijun Zhao, Arthur Gretton Rasmussen & Williams)
Linear Regression
Linear Regression Assume f is a linear combination of D features ε ∼ Norm ( 0, σ 2 ) For N points we write Learning task : Estimate w
Linear Regression
Error Measure: Sum of Squares Mean Squared Error (MSE): N E ( w ) = 1 X ( w T x n � y n ) 2 N n =1 = 1 N k Xw � y k 2 where — x 1 T — 2 3 2 y 1 T 3 — x 2 T — y 2 T 6 7 6 7 X = y = 6 7 6 7 4 5 4 5 . . . . . . — x NT — y NT
Minimizing the Error E ( w ) = 1 N k Xw � y k 2 5 E ( w ) = 2 N X T ( Xw � y ) = 0 2 X T Xw = X T y w = X † y where X † = ( X T X ) � 1 X T is the ’pseudo-inverse’ of X
Minimizing the Error E ( w ) = 1 N k Xw � y k 2 5 E ( w ) = 2 N X T ( Xw � y ) = 0 2 X T Xw = X T y w = X † y where X † = ( X T X ) � 1 X T is the ’pseudo-inverse’ of X Matrix Cookbook (on course website)
Ordinary Least Squares Construct matrix X and the vector y from the dataset { ( x 1 , y 1 ) , x 2 , y 2 ) , . . . , ( x N , y N ) } (each x includes x 0 = 1) as follows: — x T y T 1 — 1 — x T y T 2 — 2 X = y = . . . . . . — x T y T N — N Compute X † = ( X T X ) − 1 X T Return w = X † y
Basis function regression Linear regression Basis function regression For N samples Polynomial regression
Polynomial Regression M = 0 M = 1 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x M = 3 M = 9 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x
Polynomial Regression Underfit M = 0 M = 1 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x M = 3 M = 9 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x
Polynomial Regression M = 0 M = 1 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x Overfit M = 3 M = 9 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x
Regularization L 2 regularization (ridge regression) minimizes: E ( w ) = 1 N k Xw � y k 2 + λ k w k 2 where λ � 0 and k w k 2 = w T w � k k L 1 regularization (LASSO) minimizes: E ( w ) = 1 N k Xw � y k 2 + λ | w | 1 D where λ � 0 and | w | 1 = P | ω i | i =1
Regularization
Regularization L 2: closed form solution w = ( X T X + λ I ) � 1 X T y L 1: No closed form solution. Use quadratic programming: minimize k Xw � y k 2 k w k 1 s s . t .
Review: Probability
Examples: Independent Events 1. What’s the probability of getting a sequence of 1,2,3,4,5,6 if we roll a dice six times? 2. A school survey found that 9 out of 10 students like pizza. If three students are chosen at random with replacement, what is the probability that all three students like pizza?
Dependent Events uit Apple or- intro- Orange Red bin Blue bin If I take a fruit from the red bin, what is the probability that I get an apple ?
Dependent Events uit Apple or- intro- Orange Red bin Blue bin Conditional Probability P(fruit = apple | bin = red ) = 2 / 8
Dependent Events uit Apple or- intro- Orange Red bin Blue bin Joint Probability P(fruit = apple , bin = red ) = 2 / 12
Dependent Events uit Apple or- intro- Orange Red bin Blue bin Joint Probability P(fruit = apple , bin = blue ) = ?
Dependent Events uit Apple or- intro- Orange Red bin Blue bin Joint Probability P(fruit = apple , bin = blue ) = 3 / 12
Dependent Events uit Apple or- intro- Orange Red bin Blue bin Joint Probability P(fruit = orange , bin = blue ) = ?
Dependent Events uit Apple or- intro- Orange Red bin Blue bin Joint Probability P(fruit = orange , bin = blue ) = 1 / 12
Two rules of Probability uit or- intro- 1. Sum Rule (Marginal Probabilities) P(fruit = apple ) = P(fruit = apple , bin = blue ) + P(fruit = apple , bin = red ) = ?
Two rules of Probability uit or- intro- 1. Sum Rule (Marginal Probabilities) P(fruit = apple ) = P(fruit = apple , bin = blue ) + P(fruit = apple , bin = red ) = 3 / 12 + 2 / 12 = 5 / 12
Two rules of Probability uit or- intro- 2. Product Rule P(fruit = apple , bin = red ) = P(fruit = apple | bin = red ) p(bin = red ) = ?
Two rules of Probability uit or- intro- 2. Product Rule P(fruit = apple , bin = red ) = P(fruit = apple | bin = red ) p(bin = red ) = 2 / 8 * 8 / 12 = 2 / 12
Two rules of Probability uit or- intro- 2. Product Rule (reversed) P(fruit = apple , bin = red ) = P(bin = red | fruit = apple ) p(fruit = apple ) = ?
Two rules of Probability uit or- intro- 2. Product Rule (reversed) P(fruit = apple , bin = red ) = P(bin = red | fruit = apple ) p(fruit = apple ) = 2 / 5 * 5 / 12 = 2 / 12
Bayes' Rule Posterior Likelihood Prior Sum Rule: Product Rule:
Bayes' Rule Posterior Likelihood Prior Probability of rare disease: 0.005 Probability of detection: 0.98 Probability of false positive: 0.05 Probability of disease when test positive?
Bayes' Rule Posterior Likelihood Prior 0.99 * 0.005 = 0.00495 0.99 * 0.005 + 0.05 * 0.995 = 0.0547 0.00495 / 0.0547 = 0.09
Normal Distribution ∼ ⇒ Density:
Multivariate Normal Density: Σ i j = E [( x i − µ i )( x j − µ j )] Parameters:
Covariance Matrices Density: 1 . 0 2 . 0 1 . 0 � � � � 0 . 0 0 . 0 0 . 5 1 . 0 − 0 . 5 0 . 0 1 . 0 0 . 0 0 . 5 0 . 5 1 . 0 − 0 . 5 1 . 0 Question: Which covariance matrix Σ corresponds to which plot?
Marginals and Conditionals Suppose that x and y are jointly Gaussian: A x � ✓ a � �◆ C ∼ N z = , C T y b B Question: What are the marginal distributions p ( x ) and p ( y )? y x ∼ N ( a , A ) y ∼ N ( b , B ) x x
Marginals and Conditionals Suppose that x and y are jointly Gaussian: A x � ✓ a � �◆ C ∼ N z = , C T y b B Question: What are the conditional distributions p ( x | y ) and p ( y | x )? y a + CB − 1 ( y − b ) , A − CB − 1 C T � � x | y ∼ N b + C T A − 1 ( x − a ) , B − C T A − 1 C � � y | x ∼ N x x
Maximum Likelihood
Regression: Probabilistic Interpretation ? What is the probability
Regression: Probabilistic Interpretation Least Squares Objective Likelihood
Maximum Likelihood Least Squares Objective Log-Likelihood Maximizing the likelihood minimizes the sum of squares
Maximum a Posteriori
Regression with Priors Can we maximize ? (this is known as maximum a posteriori estimation)
Regression with Priors From Bayes Rule
Maximum a Posteriori Maximum a Posteriori is Equivalent to Ridge Regression
Maximum a Posteriori Maximum a Posteriori is Equivalent to Ridge Regression
Basis Function Regression M = 3 1 t 0 − 1 0 1 x
Basis Function Regression M = 3 1 t 0 − 1 0 1 x
Predictive Posterior
Priors on Functions M=0 M=1 M=2 2 4 5 1 2 0 0 0 − 2 − 1 − 5 − 4 − 2 − 1 0 1 2 − 1 0 1 2 − 1 0 1 2 M=3 M=5 5 M=17 x 10 50 2 10 0 0 0 − 10 − 2 − 50 − 1 0 1 2 − 1 0 1 2 − 1 0 1 2 Idea: sampling w ~ p( w ) defines a function w T φ ( x ), so p( w ) is equivalent to a prior on functions . adapted from : Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/
Posterior Uncertainty Can we reason about the posterior on functions ? 2 0 − 2 Increasing λ − 6 − 4 − 2 0 2 4 6 2 0 − 2 − 6 − 4 − 2 0 2 4 6 2 0 − 2 − 6 − 4 − 2 0 2 4 6 Idea: sample w ~ p( w | X , y ) and plot functions adapted from : Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/
The Predictive Distribution Predictive distribution on observations Prior Predictive distribution on the function value � 1 φ ( x ⇤ ) > A � 1 Φ y , φ ( x ⇤ ) > A � 1 φ ( x ⇤ ) � � f ⇤ | x ⇤ , X, y ⇠ N σ 2 n ameter is typic n ΦΦ > + Σ � 1 and A = σ � 2 p . for f ⇤ , f ( x ⇤ ) with Φ = Φ ( X ) invert the A matrix of w.r.t. the Gau equation we need adapted from : Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/
The Predictive Distribution Idea: Average over all possible values of w 2 0 − 2 − 6 − 4 − 2 0 2 4 6 Increasing λ 2 0 − 2 − 6 − 4 − 2 0 2 4 6 2 0 − 2 − 6 − 4 − 2 0 2 4 6 adapted from : Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/
The Kernel Trick
Cost of Feature Computation Example: Mapping with linear and quadratic terms
Cost of Feature Computation Example: Mapping with linear and quadratic terms 1+d+d 2 /2 terms
Cost of Feature Computation Example: Mapping with linear and quadratic terms ϕ ( x ) Polynomial Cost 100 features > d 2 /2 terms up d 2 N 2 /4 2,500 N 2 Quadratic to degree 2 > d 3 /6 terms up d 3 N 2 /12 83,000 N 2 Cubic to degree 3 > d 4 /24 terms d 4 N 2 /48 1,960,000 N 2 Quartic up to degree 4 •
The Kernel Trick Define a kernel function such that k can be cheaper to evaluate than φ !
Recommend
More recommend