linear regression
play

Linear Regression Jia-Bin Huang Virginia Tech Spring 2019 - PowerPoint PPT Presentation

Linear Regression Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824 Administrative Office hour Chen Gao Shih-Yang Su Feedback (Thanks!) Notation? More descriptive slides? Video/audio recording? TA hours


  1. Linear Regression Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824

  2. Administrative • Office hour • Chen Gao • Shih-Yang Su • Feedback (Thanks!) • Notation? • More descriptive slides? • Video/audio recording? • TA hours (uniformly spread over the week)?

  3. Recap: Machine learning algorithms Supervised Unsupervised Learning Learning Discrete Classification Clustering Dimensionality Continuous Regression reduction

  4. Recap: Nearest neighbor classifier • Training data 𝑦 1 , 𝑧 1 , 𝑦 2 , 𝑧 2 , ⋯ , 𝑦 𝑂 , 𝑧 𝑂 • Learning Do nothing. • Testing ℎ 𝑦 = 𝑧 (𝑙) , where 𝑙 = argmin i 𝐸(𝑦, 𝑦 (𝑗) )

  5. Recap: Instance/Memory-based Learning 1. A distance metric • Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? • 1? 3? 5? 15? 3. A weighting function (optional) • Closer neighbors matter more 4. How to fit with the local points? • Kernel regression Slide credit: Carlos Guestrin

  6. Validation set • Spliting training set: A fake test set to tune hyper-parameters Slide credit: CS231 @ Stanford

  7. Cross-validation • 5-fold cross-validation -> split the training data into 5 equal folds • 4 of them for training and 1 for validation Slide credit: CS231 @ Stanford

  8. Things to remember • Supervised Learning • Training/testing data; classification/regression; Hypothesis • k-NN • Simplest learning algorithm • With sufficient data, very hard to beat “strawman” approach • Kernel regression/classification • Set k to n (number of data points) and chose kernel width • Smoother than k-NN • Problems with k-NN • Curse of dimensionality • Not robust to irrelevant features • Slow NN search: must remember (very large) dataset for prediction

  9. Today’s plan : Linear Regression • Model representation • Cost function • Gradient descent • Features and polynomial regression • Normal equation

  10. Linear Regression • Model representation • Cost function • Gradient descent • Features and polynomial regression • Normal equation

  11. Regression Training set real-valued output Learning Algorithm 𝑦 𝑧 ℎ Size of house Hypothesis Estimated price

  12. House pricing prediction Price ($) in 1000’s 400 300 200 100 2000 500 1000 1500 2500 Size in feet^2

  13. Training set Size in feet^2 (x) Price ($) in 1000’s (y) 2104 460 1416 232 1534 315 𝑛 = 47 852 178 … … • Notation: • 𝑛 = Number of training examples • 𝑦 = Input variable / features Examples: 𝑦 (1) = 2104 • 𝑧 = Output variable / target variable 𝑦 (2) = 1416 • ( 𝑦 , 𝑧 ) = One training example 𝑧 (1) = 460 • ( 𝑦 (𝑗) , 𝑧 (𝑗) ) = 𝑗 𝑢ℎ training example Slide credit: Andrew Ng

  14. Model representation 𝑧 = ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 Training set Shorthand ℎ 𝑦 Learning Algorithm Price ($) in 1000’s 400 300 200 𝑦 𝑧 ℎ 100 Size of house Hypothesis Estimated price 2000 500 1000 1500 2500 Size in feet^2 Univariate linear regression Slide credit: Andrew Ng

  15. Linear Regression • Model representation • Cost function • Gradient descent • Features and polynomial regression • Normal equation

  16. Size in feet^2 (x) Price ($) in 1000’s (y) Training set 2104 460 1416 232 1534 315 𝑛 = 47 852 178 … … • Hypothesis ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 𝜄 0 , 𝜄 1 : parameters/weights How to choose 𝜄 𝑗 ’s? Slide credit: Andrew Ng

  17. ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 𝑧 𝑧 𝑧 3 3 3 2 2 2 1 1 1 𝑦 𝑦 𝑦 1 2 3 1 2 3 1 2 3 𝜄 0 = 1.5 𝜄 0 = 0 𝜄 0 = 1 𝜄 1 = 0 𝜄 1 = 0.5 𝜄 1 = 0.5 Slide credit: Andrew Ng

  18. Cost function 2 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 2𝑛 σ 𝑗=1 minimize • Idea: 𝜄 0 , 𝜄 1 Choose 𝜄 0 , 𝜄 1 so that ℎ 𝜄 𝑦 is close to 𝑧 for our ℎ 𝜄 𝑦 𝑗 = 𝜄 0 + 𝜄 1 𝑦 (𝑗) training example (𝑦, 𝑧) 𝑛 𝐾 𝜄 0 , 𝜄 1 = 1 𝑧 2 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 2𝑛 ෍ Price ($) in 1000’s 𝑗=1 400 300 200 minimize 𝐾 𝜄 0 , 𝜄 1 100 Cost function 𝑦 𝜄 0 , 𝜄 1 500 1000 1500 2000 2500 Size in feet^2 Slide credit: Andrew Ng

  19. Simplified • Hypothesis: • Hypothesis: ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 ℎ 𝜄 𝑦 = 𝜄 1 𝑦 𝜄 0 = 0 • Parameters: • Parameters: 𝜄 0 , 𝜄 1 𝜄 1 • Cost function: • Cost function: 𝑛 𝑛 𝐾 𝜄 0 , 𝜄 1 = 1 𝐾 𝜄 1 = 1 2 2 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 2𝑛 ෍ 2𝑛 ෍ 𝑗=1 𝑗=1 • Goal : • Goal : minimize 𝐾 𝜄 0 , 𝜄 1 minimize 𝐾 𝜄 1 𝜄 0 , 𝜄 1 𝜄 0 , 𝜄 1 Slide credit: Andrew Ng

  20. ℎ 𝜄 𝑦 , function of 𝑦 𝐾 𝜄 1 , function of 𝜄 1 𝑧 𝐾 𝜄 1 3 3 2 2 1 1 𝑦 𝜄 1 1 2 3 0 1 2 3 Slide credit: Andrew Ng

  21. ℎ 𝜄 𝑦 , function of 𝑦 𝐾 𝜄 1 , function of 𝜄 1 𝑧 𝐾 𝜄 1 3 3 2 2 1 1 𝑦 𝜄 1 1 2 3 0 1 2 3 Slide credit: Andrew Ng

  22. ℎ 𝜄 𝑦 , function of 𝑦 𝐾 𝜄 1 , function of 𝜄 1 𝑧 𝐾 𝜄 1 3 3 2 2 1 1 𝑦 𝜄 1 1 2 3 0 1 2 3 Slide credit: Andrew Ng

  23. ℎ 𝜄 𝑦 , function of 𝑦 𝐾 𝜄 1 , function of 𝜄 1 𝑧 𝐾 𝜄 1 3 3 2 2 1 1 𝑦 𝜄 1 1 2 3 0 1 2 3 Slide credit: Andrew Ng

  24. ℎ 𝜄 𝑦 , function of 𝑦 𝐾 𝜄 1 , function of 𝜄 1 𝑧 𝐾 𝜄 1 3 3 2 2 1 1 𝑦 𝜄 1 1 2 3 0 1 2 3 Slide credit: Andrew Ng

  25. • Hypothesis: ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 • Parameters: 𝜄 0 , 𝜄 1 2 1 • Cost function: 𝐾 𝜄 0 , 𝜄 1 = 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 2𝑛 σ 𝑗=1 • Goal : minimize 𝐾 𝜄 0 , 𝜄 1 𝜄 0 , 𝜄 1 Slide credit: Andrew Ng

  26. Cost function Slide credit: Andrew Ng

  27. How do we find good 𝜄 0 , 𝜄 1 that minimize 𝐾 𝜄 0 , 𝜄 1 ? Slide credit: Andrew Ng

  28. Linear Regression • Model representation • Cost function • Gradient descent • Features and polynomial regression • Normal equation

  29. Gradient descent Have some function 𝐾 𝜄 0 , 𝜄 1 Want argmin 𝐾 𝜄 0 , 𝜄 1 𝜄 0 , 𝜄 1 Outline: • Start with some 𝜄 0 , 𝜄 1 • Keep changing 𝜄 0 , 𝜄 1 to reduce 𝐾 𝜄 0 , 𝜄 1 until we hopefully end up at minimum Slide credit: Andrew Ng

  30. Slide credit: Andrew Ng

  31. Gradient descent Repeat until convergence{ 𝜖 𝜄 𝑘 ≔ 𝜄 𝑘 − 𝛽 𝜖𝜄 𝑘 𝐾 𝜄 0 , 𝜄 1 (for 𝑘 = 0 and 𝑘 = 1 ) } 𝛽 : Learning rate (step size) 𝜖 𝜖𝜄 𝑘 𝐾 𝜄 0 , 𝜄 1 : derivative (rate of change) Slide credit: Andrew Ng

  32. Gradient descent Correct: simultaneous update Incorrect: 𝜖 𝜖 temp0 ≔ 𝜄 0 −𝛽 𝐾 𝜄 0 , 𝜄 1 temp0 ≔ 𝜄 0 −𝛽 𝐾 𝜄 0 , 𝜄 1 𝜖𝜄 0 𝜖𝜄 0 𝜖 𝜄 0 ≔ temp0 temp1 ≔ 𝜄 1 −𝛽 𝐾 𝜄 0 , 𝜄 1 𝜖 𝜖𝜄 1 temp1 ≔ 𝜄 1 −𝛽 𝐾 𝜄 0 , 𝜄 1 𝜄 0 ≔ temp0 𝜖𝜄 1 𝜄 1 ≔ temp1 𝜄 1 ≔ temp1 Slide credit: Andrew Ng

  33. 𝜖 𝜄 1 ≔ 𝜄 1 − 𝛽 𝐾 𝜄 1 𝜖𝜄 1 𝐾 𝜄 1 𝜖 3 𝐾 𝜄 1 < 0 𝜖𝜄 1 𝜖 𝐾 𝜄 1 > 0 2 𝜖𝜄 1 1 𝜄 1 0 1 2 3 Slide credit: Andrew Ng

  34. Learning rate

  35. Gradient descent for linear regression Repeat until convergence{ 𝜖 𝜄 𝑘 ≔ 𝜄 𝑘 − 𝛽 𝜖𝜄 𝑘 𝐾 𝜄 0 , 𝜄 1 (for 𝑘 = 0 and 𝑘 = 1 ) } • Linear regression model ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 𝑛 𝐾 𝜄 0 , 𝜄 1 = 1 2 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 2𝑛 ෍ 𝑗=1 Slide credit: Andrew Ng

  36. Computing partial derivative 2 𝜖 𝜖 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 • 2𝑛 σ 𝑗=1 𝜖𝜄 𝑘 𝐾 𝜄 0 , 𝜄 1 = 𝜖𝜄 𝑘 2 𝜖 1 𝜄 0 + 𝜄 1 𝑦 (𝑗) − 𝑧 𝑗 𝑛 2𝑛 σ 𝑗=1 = 𝜖𝜄 𝑘 𝜖 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 • 𝑘 = 0 : 𝑛 σ 𝑗=1 𝜖𝜄 0 𝐾 𝜄 0 , 𝜄 1 = 𝜖 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑦 𝑗 • 𝑘 = 1 : 𝑛 σ 𝑗=1 𝜖𝜄 1 𝐾 𝜄 0 , 𝜄 1 = Slide credit: Andrew Ng

  37. Gradient descent for linear regression Repeat until convergence{ 𝑛 𝜄 0 ≔ 𝜄 0 − 𝛽 1 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑛 ෍ 𝑗=1 𝑛 𝜄 1 ≔ 𝜄 1 − 𝛽 1 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑦 𝑗 𝑛 ෍ 𝑗=1 } Update 𝜄 0 and 𝜄 1 simultaneously Slide credit: Andrew Ng

  38. Batch gradient descent • “Batch”: Each step of gradient descent uses all the training examples Repeat until convergence{ 𝑛 : Number of training examples 𝑛 𝜄 0 ≔ 𝜄 0 − 𝛽 1 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑛 ෍ 𝑗=1 𝑛 𝜄 1 ≔ 𝜄 1 − 𝛽 1 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑦 𝑗 𝑛 ෍ 𝑗=1 } Slide credit: Andrew Ng

  39. Linear Regression • Model representation • Cost function • Gradient descent • Features and polynomial regression • Normal equation

  40. Training dataset Size in feet^2 (x) Price ($) in 1000’s (y) 2104 460 1416 232 1534 315 852 178 … … ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 Slide credit: Andrew Ng

Recommend


More recommend