Linear Regression Jia-Bin Huang Virginia Tech Spring 2019 ECE-5424G / CS-5824
Administrative • Office hour • Chen Gao • Shih-Yang Su • Feedback (Thanks!) • Notation? • More descriptive slides? • Video/audio recording? • TA hours (uniformly spread over the week)?
Recap: Machine learning algorithms Supervised Unsupervised Learning Learning Discrete Classification Clustering Dimensionality Continuous Regression reduction
Recap: Nearest neighbor classifier • Training data 𝑦 1 , 𝑧 1 , 𝑦 2 , 𝑧 2 , ⋯ , 𝑦 𝑂 , 𝑧 𝑂 • Learning Do nothing. • Testing ℎ 𝑦 = 𝑧 (𝑙) , where 𝑙 = argmin i 𝐸(𝑦, 𝑦 (𝑗) )
Recap: Instance/Memory-based Learning 1. A distance metric • Continuous? Discrete? PDF? Gene data? Learn the metric? 2. How many nearby neighbors to look at? • 1? 3? 5? 15? 3. A weighting function (optional) • Closer neighbors matter more 4. How to fit with the local points? • Kernel regression Slide credit: Carlos Guestrin
Validation set • Spliting training set: A fake test set to tune hyper-parameters Slide credit: CS231 @ Stanford
Cross-validation • 5-fold cross-validation -> split the training data into 5 equal folds • 4 of them for training and 1 for validation Slide credit: CS231 @ Stanford
Things to remember • Supervised Learning • Training/testing data; classification/regression; Hypothesis • k-NN • Simplest learning algorithm • With sufficient data, very hard to beat “strawman” approach • Kernel regression/classification • Set k to n (number of data points) and chose kernel width • Smoother than k-NN • Problems with k-NN • Curse of dimensionality • Not robust to irrelevant features • Slow NN search: must remember (very large) dataset for prediction
Today’s plan : Linear Regression • Model representation • Cost function • Gradient descent • Features and polynomial regression • Normal equation
Linear Regression • Model representation • Cost function • Gradient descent • Features and polynomial regression • Normal equation
Regression Training set real-valued output Learning Algorithm 𝑦 𝑧 ℎ Size of house Hypothesis Estimated price
House pricing prediction Price ($) in 1000’s 400 300 200 100 2000 500 1000 1500 2500 Size in feet^2
Training set Size in feet^2 (x) Price ($) in 1000’s (y) 2104 460 1416 232 1534 315 𝑛 = 47 852 178 … … • Notation: • 𝑛 = Number of training examples • 𝑦 = Input variable / features Examples: 𝑦 (1) = 2104 • 𝑧 = Output variable / target variable 𝑦 (2) = 1416 • ( 𝑦 , 𝑧 ) = One training example 𝑧 (1) = 460 • ( 𝑦 (𝑗) , 𝑧 (𝑗) ) = 𝑗 𝑢ℎ training example Slide credit: Andrew Ng
Model representation 𝑧 = ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 Training set Shorthand ℎ 𝑦 Learning Algorithm Price ($) in 1000’s 400 300 200 𝑦 𝑧 ℎ 100 Size of house Hypothesis Estimated price 2000 500 1000 1500 2500 Size in feet^2 Univariate linear regression Slide credit: Andrew Ng
Linear Regression • Model representation • Cost function • Gradient descent • Features and polynomial regression • Normal equation
Size in feet^2 (x) Price ($) in 1000’s (y) Training set 2104 460 1416 232 1534 315 𝑛 = 47 852 178 … … • Hypothesis ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 𝜄 0 , 𝜄 1 : parameters/weights How to choose 𝜄 𝑗 ’s? Slide credit: Andrew Ng
ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 𝑧 𝑧 𝑧 3 3 3 2 2 2 1 1 1 𝑦 𝑦 𝑦 1 2 3 1 2 3 1 2 3 𝜄 0 = 1.5 𝜄 0 = 0 𝜄 0 = 1 𝜄 1 = 0 𝜄 1 = 0.5 𝜄 1 = 0.5 Slide credit: Andrew Ng
Cost function 2 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 2𝑛 σ 𝑗=1 minimize • Idea: 𝜄 0 , 𝜄 1 Choose 𝜄 0 , 𝜄 1 so that ℎ 𝜄 𝑦 is close to 𝑧 for our ℎ 𝜄 𝑦 𝑗 = 𝜄 0 + 𝜄 1 𝑦 (𝑗) training example (𝑦, 𝑧) 𝑛 𝐾 𝜄 0 , 𝜄 1 = 1 𝑧 2 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 2𝑛 Price ($) in 1000’s 𝑗=1 400 300 200 minimize 𝐾 𝜄 0 , 𝜄 1 100 Cost function 𝑦 𝜄 0 , 𝜄 1 500 1000 1500 2000 2500 Size in feet^2 Slide credit: Andrew Ng
Simplified • Hypothesis: • Hypothesis: ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 ℎ 𝜄 𝑦 = 𝜄 1 𝑦 𝜄 0 = 0 • Parameters: • Parameters: 𝜄 0 , 𝜄 1 𝜄 1 • Cost function: • Cost function: 𝑛 𝑛 𝐾 𝜄 0 , 𝜄 1 = 1 𝐾 𝜄 1 = 1 2 2 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 2𝑛 2𝑛 𝑗=1 𝑗=1 • Goal : • Goal : minimize 𝐾 𝜄 0 , 𝜄 1 minimize 𝐾 𝜄 1 𝜄 0 , 𝜄 1 𝜄 0 , 𝜄 1 Slide credit: Andrew Ng
ℎ 𝜄 𝑦 , function of 𝑦 𝐾 𝜄 1 , function of 𝜄 1 𝑧 𝐾 𝜄 1 3 3 2 2 1 1 𝑦 𝜄 1 1 2 3 0 1 2 3 Slide credit: Andrew Ng
ℎ 𝜄 𝑦 , function of 𝑦 𝐾 𝜄 1 , function of 𝜄 1 𝑧 𝐾 𝜄 1 3 3 2 2 1 1 𝑦 𝜄 1 1 2 3 0 1 2 3 Slide credit: Andrew Ng
ℎ 𝜄 𝑦 , function of 𝑦 𝐾 𝜄 1 , function of 𝜄 1 𝑧 𝐾 𝜄 1 3 3 2 2 1 1 𝑦 𝜄 1 1 2 3 0 1 2 3 Slide credit: Andrew Ng
ℎ 𝜄 𝑦 , function of 𝑦 𝐾 𝜄 1 , function of 𝜄 1 𝑧 𝐾 𝜄 1 3 3 2 2 1 1 𝑦 𝜄 1 1 2 3 0 1 2 3 Slide credit: Andrew Ng
ℎ 𝜄 𝑦 , function of 𝑦 𝐾 𝜄 1 , function of 𝜄 1 𝑧 𝐾 𝜄 1 3 3 2 2 1 1 𝑦 𝜄 1 1 2 3 0 1 2 3 Slide credit: Andrew Ng
• Hypothesis: ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 • Parameters: 𝜄 0 , 𝜄 1 2 1 • Cost function: 𝐾 𝜄 0 , 𝜄 1 = 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 2𝑛 σ 𝑗=1 • Goal : minimize 𝐾 𝜄 0 , 𝜄 1 𝜄 0 , 𝜄 1 Slide credit: Andrew Ng
Cost function Slide credit: Andrew Ng
How do we find good 𝜄 0 , 𝜄 1 that minimize 𝐾 𝜄 0 , 𝜄 1 ? Slide credit: Andrew Ng
Linear Regression • Model representation • Cost function • Gradient descent • Features and polynomial regression • Normal equation
Gradient descent Have some function 𝐾 𝜄 0 , 𝜄 1 Want argmin 𝐾 𝜄 0 , 𝜄 1 𝜄 0 , 𝜄 1 Outline: • Start with some 𝜄 0 , 𝜄 1 • Keep changing 𝜄 0 , 𝜄 1 to reduce 𝐾 𝜄 0 , 𝜄 1 until we hopefully end up at minimum Slide credit: Andrew Ng
Slide credit: Andrew Ng
Gradient descent Repeat until convergence{ 𝜖 𝜄 𝑘 ≔ 𝜄 𝑘 − 𝛽 𝜖𝜄 𝑘 𝐾 𝜄 0 , 𝜄 1 (for 𝑘 = 0 and 𝑘 = 1 ) } 𝛽 : Learning rate (step size) 𝜖 𝜖𝜄 𝑘 𝐾 𝜄 0 , 𝜄 1 : derivative (rate of change) Slide credit: Andrew Ng
Gradient descent Correct: simultaneous update Incorrect: 𝜖 𝜖 temp0 ≔ 𝜄 0 −𝛽 𝐾 𝜄 0 , 𝜄 1 temp0 ≔ 𝜄 0 −𝛽 𝐾 𝜄 0 , 𝜄 1 𝜖𝜄 0 𝜖𝜄 0 𝜖 𝜄 0 ≔ temp0 temp1 ≔ 𝜄 1 −𝛽 𝐾 𝜄 0 , 𝜄 1 𝜖 𝜖𝜄 1 temp1 ≔ 𝜄 1 −𝛽 𝐾 𝜄 0 , 𝜄 1 𝜄 0 ≔ temp0 𝜖𝜄 1 𝜄 1 ≔ temp1 𝜄 1 ≔ temp1 Slide credit: Andrew Ng
𝜖 𝜄 1 ≔ 𝜄 1 − 𝛽 𝐾 𝜄 1 𝜖𝜄 1 𝐾 𝜄 1 𝜖 3 𝐾 𝜄 1 < 0 𝜖𝜄 1 𝜖 𝐾 𝜄 1 > 0 2 𝜖𝜄 1 1 𝜄 1 0 1 2 3 Slide credit: Andrew Ng
Learning rate
Gradient descent for linear regression Repeat until convergence{ 𝜖 𝜄 𝑘 ≔ 𝜄 𝑘 − 𝛽 𝜖𝜄 𝑘 𝐾 𝜄 0 , 𝜄 1 (for 𝑘 = 0 and 𝑘 = 1 ) } • Linear regression model ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 𝑛 𝐾 𝜄 0 , 𝜄 1 = 1 2 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 2𝑛 𝑗=1 Slide credit: Andrew Ng
Computing partial derivative 2 𝜖 𝜖 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 • 2𝑛 σ 𝑗=1 𝜖𝜄 𝑘 𝐾 𝜄 0 , 𝜄 1 = 𝜖𝜄 𝑘 2 𝜖 1 𝜄 0 + 𝜄 1 𝑦 (𝑗) − 𝑧 𝑗 𝑛 2𝑛 σ 𝑗=1 = 𝜖𝜄 𝑘 𝜖 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 • 𝑘 = 0 : 𝑛 σ 𝑗=1 𝜖𝜄 0 𝐾 𝜄 0 , 𝜄 1 = 𝜖 1 𝑛 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑦 𝑗 • 𝑘 = 1 : 𝑛 σ 𝑗=1 𝜖𝜄 1 𝐾 𝜄 0 , 𝜄 1 = Slide credit: Andrew Ng
Gradient descent for linear regression Repeat until convergence{ 𝑛 𝜄 0 ≔ 𝜄 0 − 𝛽 1 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑛 𝑗=1 𝑛 𝜄 1 ≔ 𝜄 1 − 𝛽 1 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑦 𝑗 𝑛 𝑗=1 } Update 𝜄 0 and 𝜄 1 simultaneously Slide credit: Andrew Ng
Batch gradient descent • “Batch”: Each step of gradient descent uses all the training examples Repeat until convergence{ 𝑛 : Number of training examples 𝑛 𝜄 0 ≔ 𝜄 0 − 𝛽 1 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑛 𝑗=1 𝑛 𝜄 1 ≔ 𝜄 1 − 𝛽 1 ℎ 𝜄 𝑦 𝑗 − 𝑧 𝑗 𝑦 𝑗 𝑛 𝑗=1 } Slide credit: Andrew Ng
Linear Regression • Model representation • Cost function • Gradient descent • Features and polynomial regression • Normal equation
Training dataset Size in feet^2 (x) Price ($) in 1000’s (y) 2104 460 1416 232 1534 315 852 178 … … ℎ 𝜄 𝑦 = 𝜄 0 + 𝜄 1 𝑦 Slide credit: Andrew Ng
Recommend
More recommend