15 388 688 practical data science intro to machine
play

15-388/688 - Practical Data Science: Intro to Machine Learning & - PowerPoint PPT Presentation

15-388/688 - Practical Data Science: Intro to Machine Learning & Linear Regression J. Zico Kolter Carnegie Mellon University Fall 2019 1 Announcements HW 3 released, due 10/24 (not 10/17) Feedback on tutorial sent to everyone who


  1. 15-388/688 - Practical Data Science: Intro to Machine Learning & Linear Regression J. Zico Kolter Carnegie Mellon University Fall 2019 1

  2. Announcements HW 3 released, due 10/24 (not 10/17) Feedback on tutorial sent to everyone who submitted by deadline, will send to remaining people by tomorrow evening 2

  3. Outline Least squares regression: a simple example Machine learning notation Linear regression revisited Matrix/vector notation and analytic solutions Implementing linear regression 3

  4. Outline Least squares regression: a simple example Machine learning notation Linear regression revisited Matrix/vector notation and analytic solutions Implementing linear regression 4

  5. A simple example: predicting electricity use What will peak power consumption be in Pittsburgh tomorrow? Difficult to build an “a priori” model from first principles to answer this question But, relatively easy to record past days of consumption, plus additional features that affect consumption (i.e., weather) Da Date te Hi High Tempera rature (F) Peak k Demand (GW) 2011-06-01 84.0 2.651 2011-06-02 73.0 2.081 2011-06-03 75.2 1.844 2011-06-04 84.9 1.959 … … … 5

  6. Plot of consumption vs. temperature Plot of high temperature vs. peak demand for summer months (June – August) for past six years 3 . 00 2 . 75 Peak Demand (GW) 2 . 50 2 . 25 2 . 00 1 . 75 1 . 50 60 70 80 90 High Temperature (F) 6

  7. Hypothesis: linear model Let’s suppose that the peak demand approximately fits a linear model Peak_Demand ≈ 𝜄 1 ⋅ High_Temperature + 𝜄 2 Here 𝜄 1 is the “slope” of the line, and 𝜄 2 is the intercept How do we find a “good” fit to the data? Many possibilities, but natural objective is to minimize some difference between this line and the observed data, e.g. squared loss 𝜄 1 ⋅ High_Temperature 푖 + 𝜄 2 − Peak_Demand 푖 2 𝐹 𝜄 = ∑ 푖∈days 7

  8. How do we find parameters? How do we find the parameters 𝜄 1 , 𝜄 2 that minimize the function 𝜄 1 ⋅ High_Temperature 푖 + 𝜄 2 − Peak_Demand 푖 2 𝐹 𝜄 = ∑ 푖∈days 𝜄 1 ⋅ 𝑦 푖 + 𝜄 2 − 𝑧 푖 2 ≡ ∑ 푖∈days General idea: suppose we want to minimize some function 𝑔 𝜄 f ( θ ) f ′ ( θ 0 ) θ 0 θ Derivative is slope of the function, so negative derivative points “downhill” 8

  9. Computing the derivatives What are the derivatives of the error function with respect to each parameter 𝜄 1 and 𝜄 2 ? 𝜖𝐹 𝜄 = 𝜖 𝜄 1 ⋅ 𝑦 푖 + 𝜄 2 − 𝑧 푖 2 ∑ 𝜖𝜄 1 𝜖𝜄 1 푖∈days 𝜖 𝜄 1 ⋅ 𝑦 푖 + 𝜄 2 − 𝑧 푖 2 = ∑ 𝜖𝜄 1 푖∈days ⋅ 𝜖 2 𝜄 1 ⋅ 𝑦 푖 + 𝜄 2 − 𝑧 푖 𝜄 1 ⋅ 𝑦 푖 = ∑ 𝜖𝜄 1 푖∈days 2 𝜄 1 ⋅ 𝑦 푖 + 𝜄 2 − 𝑧 푖 ⋅ 𝑦 푖 = ∑ 푖∈days 𝜖𝐹 𝜄 2 𝜄 1 ⋅ 𝑦 푖 + 𝜄 2 − 𝑧 푖 = ∑ 𝜖𝜄 2 푖∈days 9

  10. Finding the best 𝜄 To find a good value of 𝜄 , we can repeatedly take steps in the direction of the negative derivatives for each value Repeat: 2 𝜄 1 ⋅ 𝑦 푖 + 𝜄 2 − 𝑧 푖 ⋅ 𝑦 푖 𝜄 1 ≔ 𝜄 1 − 𝛽 ∑ 푖∈days 2 𝜄 1 ⋅ 𝑦 푖 + 𝜄 2 − 𝑧 푖 𝜄 2 ≔ 𝜄 2 − 𝛽 ∑ 푖∈days where 𝛽 is some small positive number called the step size This is the gradient decent algorithm , the workhorse of modern machine learning 10

  11. Gradient descent 3 . 00 2 . 75 Peak Demand (GW) 2 . 50 2 . 25 2 . 00 1 . 75 1 . 50 60 70 80 90 High Temperature (F) 11

  12. Gradient descent 3 . 00 2 . 75 Peak Demand (GW) 2 . 50 2 . 25 2 . 00 1 . 75 1 . 50 − 4 − 2 0 2 Normalized Temperature Normalize input by subtracting the mean and dividing by the standard deviation 12

  13. Gradient descent – Iteration 1 3 . 00 Observed days Squared loss fit 2 . 75 Peak Demand (GW) 2 . 50 2 . 25 2 . 00 1 . 75 1 . 50 − 4 − 2 0 2 Normalized Temperature θ = (0 . 00 , 0 . 00) E ( θ ) = 1427 . 53 ( ∂ E ( θ ) ∂θ 1 , ∂ E ( θ ) ∂θ 2 ) = ( − 151 . 20 , − 1243 . 10) 13

  14. Gradient descent – Iteration 2 3 . 00 Observed days Squared loss fit 2 . 75 Peak Demand (GW) 2 . 50 2 . 25 2 . 00 1 . 75 1 . 50 − 4 − 2 0 2 Normalized Temperature θ = (0 . 15 , 1 . 24) E ( θ ) = 292 . 18 ( ∂ E ( θ ) ∂θ 1 , ∂ E ( θ ) ∂θ 2 ) = ( − 67 . 74 , − 556 . 91) 14

  15. Gradient descent – Iteration 3 3 . 00 Observed days Squared loss fit 2 . 75 Peak Demand (GW) 2 . 50 2 . 25 2 . 00 1 . 75 1 . 50 − 4 − 2 0 2 Normalized Temperature θ = (0 . 22 , 1 . 80) E ( θ ) = 64 . 31 ( ∂ E ( θ ) ∂θ 1 , ∂ E ( θ ) ∂θ 2 ) = ( − 30 . 35 , − 249 . 50) 15

  16. Gradient descent – Iteration 4 3 . 00 Observed days Squared loss fit 2 . 75 Peak Demand (GW) 2 . 50 2 . 25 2 . 00 1 . 75 1 . 50 − 4 − 2 0 2 Normalized Temperature θ = (0 . 25 , 2 . 05) E ( θ ) = 18 . 58 ( ∂ E ( θ ) ∂θ 1 , ∂ E ( θ ) ∂θ 2 ) = ( − 13 . 60 , − 111 . 77) 16

  17. Gradient descent – Iteration 5 3 . 00 Observed days Squared loss fit 2 . 75 Peak Demand (GW) 2 . 50 2 . 25 2 . 00 1 . 75 1 . 50 − 4 − 2 0 2 Normalized Temperature θ = (0 . 26 , 2 . 16) E ( θ ) = 9 . 40 ( ∂ E ( θ ) ∂θ 1 , ∂ E ( θ ) ∂θ 2 ) = ( − 6 . 09 , − 50 . 07) 17

  18. Gradient descent – Iteration 10 3 . 00 Observed days Squared loss fit 2 . 75 Peak Demand (GW) 2 . 50 2 . 25 2 . 00 1 . 75 1 . 50 − 4 − 2 0 2 Normalized Temperature θ = (0 . 27 , 2 . 25) E ( θ ) = 7 . 09 ( ∂ E ( θ ) ∂θ 1 , ∂ E ( θ ) ∂θ 2 ) = ( − 0 . 11 , − 0 . 90) 18

  19. Fitted line in “original” coordinates 3 . 00 Observed days Squared loss fit 2 . 75 Peak Demand (GW) 2 . 50 2 . 25 2 . 00 1 . 75 1 . 50 50 60 70 80 90 100 High Temperature (F) 19

  20. Making predictions Importantly, our model also lets us make predictions about new days What will the peak demand be tomorrow? If we know the high temperature will be 72 degrees (ignoring for now that this is also a prediction), then we can predict peak demand to be: Predicted_demand = 𝜄 1 ⋅ 72 + 𝜄 2 = 1.821 GW (requires that we rescale 𝜄 after solving to “normal” coordinates) Equivalent to just “finding the point on the line” 20

  21. Extensions What if we want to add additional features, e.g. day of week, instead of just temperature? What if we want to use a different loss function instead of squared error (i.e., absolute error)? What if we want to use a non-linear prediction instead of a linear one? We can easily reason about all these things by adopting some additional notation… 22

  22. Outline Least squares regression: a simple example Machine learning notation Linear regression revisited Matrix/vector notation and analytic solutions Implementing linear regression 23

  23. ̂ Machine learning This has been an example of a machine learning algorithm Basic idea: in many domains, it is difficult to hand-build a predictive model, but easy to collect lots of data; machine learning provides a way to automatically infer the predictive model from data The basic process (supervised learning): Machine learning Training Data Predictions algorithm 𝑦 1 , 𝑧 1 Hypothesis function New example 𝑦 𝑦 2 , 𝑧 2 𝑧 푖 ≈ ℎ 𝑦 푖 𝑧 = ℎ(𝑦) 𝑦 3 , 𝑧 3 ⋮ 24

  24. Terminology Input features: 𝑦 푖 ∈ ℝ 푛 , 𝑗 = 1, … , 𝑛 High_Temperature 푖 E. g. : 𝑦 푖 = Is_Weekday 푖 1 Outputs: 𝑧 푖 ∈ 𝒵, 𝑗 = 1, … , 𝑛 E. g. : 𝑧 푖 ∈ ℝ = Peak_Demand 푖 Model parameters: 𝜄 ∈ ℝ 푛 Hypothesis function: ℎ 휃 : ℝ 푛 → 𝒵 , predicts output given input 푛 E. g. : ℎ 휃 𝑦 = ∑ 𝜄 푗 ⋅ 𝑦 푗 푗=1 25

  25. ̂ ̂ Terminology Loss function: ℓ: 𝒵×𝒵 → ℝ + , measures the difference between a prediction and an actual output 𝑧 − 𝑧 2 E. g. : ℓ 𝑧, 𝑧 = The canonical machine learning optimization problem: 푚 ℓ ℎ 휃 𝑦 푖 , 𝑧 푖 minimize ∑ 휃 푖=1 Virtually every machine learning algorithm has this form, just specify • What is the hypothesis function? • What is the loss function? • How do we solve the optimization problem? 26

  26. Example machine learning algorithms Note: we (machine learning researchers) have not been consistent in naming conventions, many machine learning algorithms actually only specify some of these three elements • Least squares: {linear hypothesis, squared loss, (usually) analytical solution} • Linear regression: {linear hypothesis, *, *} • Support vector machine: {linear or kernel hypothesis, hinge loss, *} • Neural network: {Composed non-linear function, *, (usually) gradient descent) • Decision tree: {Hierarchical axis-aligned halfplanes, *, greedy optimization} • Naïve Bayes: {Linear hypothesis, joint probability under certain independence assumptions, analytical solution} 27

  27. Outline Least squares regression: a simple example Machine learning notation Linear regression revisited Matrix/vector notation and analytic solutions Implementing linear regression 28

  28. ̂ ̂ Least squares revisited Using our new terminology, plus matrix notion, let’s revisit how to solve linear regression with a squared error loss Setup: 푛 • Linear hypothesis function: ℎ 휃 𝑦 = ∑ 푗=1 𝜄 푗 ⋅ 𝑦 푗 𝑧 − 𝑧 2 • Squared error loss: ℓ 𝑧, 𝑧 = • Resulting machine learning optimization problem: 2 푚 푛 푖 − 𝑧 푖 minimize ∑ ∑ 𝜄 푗 ⋅ 𝑦 푗 ≡ minimize 𝐹 𝜄 휃 휃 푖=1 푗=1 29

Recommend


More recommend