Machine Learning - Regressions Amir H. Payberah payberah@kth.se 07/11/2018
The Course Web Page https://id2223kth.github.io 1 / 81
Where Are We? 2 / 81
Where Are We? 3 / 81
Let’s Start with an Example 4 / 81
The Housing Price Example (1/3) ◮ Given the dataset of m houses. Living area No. of bedrooms Price 2104 3 400 1600 3 330 2400 3 369 . . . . . . . . . ◮ Predict the prices of other houses, as a function of the size of living area and number of bedrooms? 5 / 81
The Housing Price Example (2/3) Living area No. of bedrooms Price 2104 3 400 1600 3 330 2400 3 369 . . . . . . . . . � 2104 � � 1600 � � 2400 � x ( 1 ) = y ( 1 ) = 400 x ( 2 ) = y ( 2 ) = 330 x ( 3 ) = y ( 3 ) = 369 3 3 3 x ( 1 ) ⊺ 2104 3 400 x ( 2 ) ⊺ 1600 3 330 X = = y = x ( 3 ) ⊺ 2400 3 369 . . . . . . . . . . . . ◮ x ( i ) ∈ R 2 : x ( i ) is the living area, and x ( i ) is the number of bedrooms of the i th 1 2 house in the training set. 6 / 81
The Housing Price Example (3/3) Living area No. of bedrooms Price 2104 3 400 1600 3 330 2400 3 369 . . . . . . . . . ◮ Predict the prices of other houses ^ y as a function of the size of their living areas x 1 , and number of bedrooms x 2 , i.e., ^ y = f ( x 1 , x 2 ) ◮ E.g., what is ^ y , if x 1 = 4000 and x 2 = 4 ? ◮ As an initial choice: ^ y = f w ( x ) = w 1 x 1 + w 2 x 2 7 / 81
Linear Regression 8 / 81
Linear Regression (1/2) ◮ Our goal: to build a system that takes input x ∈ R n and predicts output ^ y ∈ R . ◮ In linear regression, the output ^ y is a linear function of the input x . y = f w ( x ) = w 1 x 1 + w 2 x 2 + · · · + w n x n ^ y = w ⊺ x ^ • ^ y : the predicted value • n : the number of features • x i : the i th feature value • w j : the j th model parameter ( w ∈ R n ) 9 / 81
Linear Regression (2/2) ◮ Linear regression often has one additional parameter, called intercept b : y = w ⊺ x + b ^ ◮ Instead of adding the bias parameter b , we can augment x with an extra entry that is always set to 1. y = f w ( x ) = w 0 x 0 + w 1 x 1 + w 2 x 2 + · · · + w n x n , where x 0 = 1 ^ 10 / 81
Linear Regression - Model Parameters ◮ Parameters w ∈ R n are values that control the behavior of the model. ◮ w are a set of weights that determine how each feature affects the prediction. • w i > 0 : increasing the value of the feature x i , increases the value of our prediction ^ y . • w i < 0 : increasing the value of the feature x i , decreases the value of our prediction ^ y . • w i = 0 : the value of the feature x i , has no effect on the prediction ^ y . 11 / 81
How to Learn Model Parameters w ? 12 / 81
Linear Regression - Cost Function (1/2) ◮ One reasonable model should make ^ y close to y , at least for the training dataset. ◮ Residual: the difference between the dependent variable y and the predicted value ^ y . r ( i ) = y ( i ) − ^ y ( i ) 13 / 81
Linear Regression - Cost Function (2/2) ◮ Cost function J ( w ) y ( i ) is to the corresponding y ( i ) . • For each value of the w , it measures how close the ^ • We can define J ( w ) as the mean squared error (MSE): m J ( w ) = MSE ( w ) = 1 y ( i ) − y ( i ) ) 2 � ( ^ m i y − y ) 2 ] = 1 y − y || 2 = E [( ^ m || ^ 2 14 / 81
How to Learn Model Parameters? ◮ We want to choose w so as to minimize J ( w ). ◮ Two approaches to find w : • Normal equation • Gradient descent 15 / 81
Normal Equation 16 / 81
Derivatives and Gradient (1/3) ◮ The first derivative of f ( x ), shown as f ′ ( x ), shows the slope of the tangent line to the function at the poa x . ◮ f ( x ) = x 2 ⇒ f ′ ( x ) = 2x ◮ If f(x) is increasing, then f ′ ( x ) > 0 ◮ If f(x) is decreasing, then f ′ ( x ) < 0 ◮ If f(x) is at local minimum/maximum, then f ′ ( x ) = 0 17 / 81
Derivatives and Gradient (2/3) ◮ What if a function has multiple arguments, e.g., f ( x 1 , x 2 , · · · , x n ) ◮ Partial derivatives: the derivative with respect to a particular argument. ∂ f ∂ x 1 , the derivative with respect to x 1 • ∂ f ∂ x 2 , the derivative with respect to x 2 • ∂ f ∂ x i : shows how much the function f will change, if we change x i . ◮ ◮ Gradient: the vector of all partial derivatives for a function f . ∂ f ∂ x 1 ∂ f ∂ x 2 ∇ x f ( x ) = . . . ∂ f ∂ x n 18 / 81
Derivatives and Gradient (3/3) ◮ What is the gradient of f ( x 1 , x 2 , x 3 ) = x 1 − x 1 x 2 + x 2 3 ? ∂ ∂ x 1 ( x 1 − x 1 x 2 + x 2 3 ) 1 − x 2 ∂ x 2 ( x 1 − x 1 x 2 + x 2 ∂ ∇ x f ( x ) = 3 ) = − x 1 ∂ ∂ x 3 ( x 1 − x 1 x 2 + x 2 3 ) 2x 3 19 / 81
Normal Equation (1/2) ◮ To minimize J ( w ), we can simply solve for where its gradient is 0: ∇ w J ( w ) = 0 y = w ⊺ x ^ [ x ( 1 ) 1 , x ( 1 ) 2 , · · · , x ( 1 ) x ( 1 ) ⊺ y ( 1 ) n ] ^ [ x ( 2 ) 1 , x ( 2 ) 2 , · · · , x ( 2 ) x ( 2 ) ⊺ y ( 2 ) ^ n ] X = = y = ^ . . . . . . . . . x ( m ) ⊺ y ( m ) [ x ( m ) 1 , x ( m ) 2 , · · · , x ( m ) ^ n ] y = w ⊺ X ⊺ or ^ y = Xw ^ 20 / 81
Normal Equation (2/2) ◮ To minimize J ( w ), we can simply solve for where its gradient is 0: ∇ w J ( w ) = 0 J ( w ) = 1 y − y || 2 m || ^ 2 , ∇ w J ( w ) = 0 1 y − y || 2 ⇒ ∇ w m || ^ 2 = 0 1 m || Xw − y || 2 ⇒ ∇ w 2 = 0 ⇒ ∇ w ( Xw − y ) ⊺ ( Xw − y ) = 0 ⇒ ∇ w ( w ⊺ X ⊺ Xw − 2 w ⊺ X ⊺ y + y ⊺ y ) = 0 ⇒ 2 X ⊺ Xw − 2 X ⊺ y = 0 ⇒ w = ( X ⊺ X ) − 1 X ⊺ y 21 / 81
Normal Equation - Example (1/7) Living area No. of bedrooms Price 2104 3 400 1600 3 330 2400 3 369 1416 2 232 3000 4 540 ◮ Predict the value of ^ y , when x 1 = 4000 and x 2 = 4 . ◮ We should find w 0 , w 1 , and w 2 in ^ y = w 0 + w 1 x 1 + w 2 x 2 . ◮ w = ( X ⊺ X ) − 1 X ⊺ y . 22 / 81
Normal Equation - Example (2/7) Living area No. of bedrooms Price 2104 3 400 1600 3 330 2400 3 369 1416 2 232 3000 4 540 1 2104 3 400 1 1600 3 330 X = 1 2400 3 y = 369 1 1416 2 232 1 3000 4 540 import breeze.linalg._ val X = new DenseMatrix(5, 3, Array(1.0, 1.0, 1.0, 1.0, 1.0, 2104.0, 1600.0, 2400.0, 1416.0, 3000.0, 3.0, 3.0, 3.0, 2.0, 4.0)) val y = new DenseVector(Array(400.0, 330.0, 369.0, 232.0, 540.0)) 23 / 81
Normal Equation - Example (3/7) 1 2104 3 1 1 1 1 1 1 1600 3 5 10520 15 X ⊺ X = = 2104 1600 2400 1416 3000 1 2400 3 10520 23751872 33144 3 3 3 2 4 1 1416 2 15 33144 47 1 3000 4 val Xt = X.t val XtX = Xt * X 24 / 81
Normal Equation - Example (4/7) 4 . 90366455e + 00 7 . 48766737e − 04 − 2 . 09302326e + 00 ( X ⊺ X ) − 1 = 7 . 48766737e − 04 2 . 75281889e − 06 − 2 . 18023256e − 03 − 2 . 09302326e + 00 − 2 . 18023256e − 03 2 . 22674419e + 00 val XtXInv = inv(XtX) 25 / 81
Normal Equation - Example (5/7) 400 1 1 1 1 1 330 1871 X ⊺ y = = 2104 1600 2400 1416 3000 369 4203712 3 3 3 2 4 232 5921 540 val Xty = Xt * y 26 / 81
Normal Equation - Example (6/7) 4 . 90366455e + 00 − 2 . 09302326e + 00 1871 7 . 48766737e − 04 w = ( X ⊺ X ) − 1 X ⊺ y = 7 . 48766737e − 04 2 . 75281889e − 06 − 2 . 18023256e − 03 4203712 − 2 . 09302326e + 00 2 . 22674419e + 00 5921 − 2 . 18023256e − 03 − 7 . 04346018 e + 01 = 6 . 38433756 e − 02 1 . 03436047 e + 02 val w = XtXInv * Xty 27 / 81
Normal Equation - Example (7/7) ◮ Predict the value of y , when x 1 = 4000 and x 2 = 4 . y = − 7 . 04346018e + 01 + 6 . 38433756e − 02 × 4000 + 1 . 03436047e + 02 × 4 ≈ 599 ^ val test = new DenseVector(Array(1.0, 4000.0, 4.0)) val yHat = w * test 28 / 81
Normal Equation in Spark case class house(x1: Long, x2: Long, y: Long) val trainData = Seq(house(2104, 3, 400), house(1600, 3, 330), house(2400, 3, 369), house(1416, 2, 232), house(3000, 4, 540)).toDF val testData = Seq(house(4000, 4, 0)).toDF import org.apache.spark.ml.feature.VectorAssembler val va = new VectorAssembler().setInputCols(Array("x1", "x2")).setOutputCol("features") val train = va.transform(trainData) val test = va.transform(testData) import org.apache.spark.ml.regression.LinearRegression val lr = new LinearRegression().setFeaturesCol("features").setLabelCol("y").setSolver("normal") val lrModel = lr.fit(train) lrModel.transform(test).show 29 / 81
Recommend
More recommend