Linear Regression Aar$ Singh & Barnabas Poczos - PowerPoint PPT Presentation

Linear ¡Regression ¡ Aar$ ¡Singh ¡& ¡Barnabas ¡Poczos ¡ ¡ ¡ Machine ¡Learning ¡10-‑701/15-‑781 ¡ Jan ¡23, ¡2014 ¡

So ¡far ¡… ¡ • Learning ¡distribu$ons ¡ ¡ – Maximum ¡Likelihood ¡Es$ma$on ¡(MLE) ¡ – Maximum ¡A ¡Posteriori ¡(MAP) ¡ ¡ ¡ • Learning ¡classifiers ¡ – Naïve ¡Bayes ¡ 2 ¡

Discrete ¡to ¡Con3nuous ¡Labels ¡ Classification Sports ¡ Anemic ¡cell ¡ Science ¡ Healthy ¡cell ¡ News ¡ Y ¡= ¡Diagnosis ¡ X ¡= ¡Document ¡ Y ¡= ¡Topic ¡ X ¡= ¡Cell ¡Image ¡ Regression ¡ Stock ¡Market ¡ ¡ Predic$on ¡ Y ¡= ¡? ¡ X ¡= ¡Feb01 ¡ ¡ 3 ¡

Regression ¡Tasks ¡ Weather ¡Predic$on ¡ Y ¡= ¡Temp ¡ X ¡= ¡7 ¡pm ¡ Es$ma$ng ¡ Contamina$on ¡ X ¡= ¡new ¡loca3on ¡ Y ¡= ¡sensor ¡reading ¡ 4 ¡

Supervised ¡Learning ¡ Goal: loss function (performance measure) Sports ¡ Science ¡ Y ¡= ¡? ¡ News ¡ X ¡= ¡Feb01 ¡ ¡ Classification: Regression: ¡ Probability ¡of ¡Error Mean ¡Squared ¡Error 5 ¡

Regression ¡algorithms ¡ Learning ¡algorithm ¡ Linear ¡Regression ¡ Regularized ¡Linear ¡Regression ¡– ¡Ridge ¡regression, ¡Lasso ¡ Polynomial ¡Regression ¡ Kernel ¡Regression ¡ Regression ¡Trees, ¡Splines, ¡Wavelet ¡es$mators, ¡… ¡ 6 ¡

Replace ¡Expecta3on ¡with ¡Empirical ¡ Mean ¡ Optimal predictor: Empirical Minimizer: Empirical ¡mean ¡ Law of Large Numbers: n ¡ ¡ ¡ ¡ ¡ ¡ ¡∞ ¡ 7 ¡

Restrict ¡class ¡of ¡predictors ¡ Optimal predictor: Empirical Minimizer: Class ¡of ¡predictors ¡ Why? ¡ ¡ ¡Overfi_ng! ¡ Y i ¡ ¡ ¡Empiricial ¡loss ¡minimized ¡by ¡any ¡ ¡ ¡ ¡func$on ¡of ¡the ¡form ¡ ¡ X i ¡ 8 ¡

Restrict ¡class ¡of ¡predictors ¡ Optimal predictor: Empirical Minimizer: Class ¡of ¡predictors ¡ -‑ Class ¡of ¡Linear ¡func$ons ¡ F -‑ Class ¡of ¡Polynomial ¡func$ons ¡ -‑ Class ¡of ¡nonlinear ¡func$ons ¡ 9 ¡

Linear ¡Regression ¡ Least Squares Estimator -‑ ¡Class ¡of ¡Linear ¡func$ons ¡ β 2 ¡ = ¡slope ¡ Uni-‑variate ¡case: ¡ β 1 ¡ -‑ ¡intercept ¡ Mul$-‑variate ¡case: ¡ 1 ¡ where ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡, ¡ 10 ¡

Least ¡Squares ¡Es3mator ¡ f ( X i ) = X i β 11 ¡

Least ¡Squares ¡Es3mator ¡ 12 ¡

Normal ¡Equa3ons ¡ p ¡xp ¡ p ¡x1 ¡ p ¡x1 ¡ If ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡is ¡inver$ble, ¡ ¡ When ¡is ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡inver$ble ¡? ¡ ¡ Recall: ¡Full ¡rank ¡matrices ¡are ¡inver$ble. ¡What ¡is ¡rank ¡of ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡? ¡ ¡ ¡ What ¡if ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡is ¡not ¡inver$ble ¡? ¡ ¡ Regulariza$on ¡(later) ¡ 13 ¡

Gradient ¡Descent ¡ Even ¡when ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡is ¡inver$ble, ¡might ¡be ¡computa$onally ¡expensive ¡if ¡ A ¡is ¡huge. ¡ Treat ¡as ¡op$miza$on ¡problem ¡ ¡ Observa$on: ¡ ¡ ¡J(β) ¡is ¡convex ¡in ¡β. ¡ How ¡to ¡find ¡the ¡minimizer? ¡ J(β 1 ) ¡ J(β 1 , ¡β 2 ) ¡ β 1 ¡ β 1 ¡ β 2 ¡ 14 ¡

Gradient ¡Descent ¡ Even ¡when ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡is ¡inver$ble, ¡might ¡be ¡computa$onally ¡expensive ¡if ¡ A ¡is ¡huge. ¡ Since ¡J( β ) ¡is ¡convex, ¡move ¡along ¡nega3ve ¡of ¡gradient ¡ step ¡size ¡ Ini$alize: ¡ ¡ ¡ Update: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡0 ¡if ¡ ¡ ¡ ¡ ¡ ¡= ¡ ¡ ¡ Stop: ¡ ¡when ¡some ¡criterion ¡met ¡e.g. ¡fixed ¡# ¡itera$ons, ¡or ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡< ¡ ε . ¡ ¡ 15 ¡

Effect ¡of ¡step-‑size ¡ α ¡ Large ¡ α ¡ ¡=> ¡Fast ¡convergence ¡but ¡larger ¡residual ¡error ¡ ¡Also ¡possible ¡oscilla$ons ¡ ¡ Small ¡ α ¡ ¡=> ¡Slow ¡convergence ¡but ¡small ¡residual ¡error ¡ ¡ ¡ ¡ ¡ 16 ¡

Least ¡Squares ¡and ¡MLE ¡ Intui$on: ¡Signal ¡plus ¡(zero-‑mean) ¡Noise ¡model ¡ = X β ∗ log ¡likelihood ¡ Least Square Estimate is same as Maximum Likelihood Estimate under a Gaussian model ! 17 ¡

Regularized ¡Least ¡Squares ¡and ¡MAP ¡ What ¡if ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡is ¡not ¡inver$ble ¡? ¡ ¡ log ¡likelihood ¡ log ¡prior ¡ I) ¡Gaussian ¡Prior ¡ 0 ¡ Ridge Regression b A > A A > Y I ) � 1 A β MAP = ( A A A + λ I A I A Y Y 18 ¡

Regularized ¡Least ¡Squares ¡and ¡MAP ¡ What ¡if ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡is ¡not ¡inver$ble ¡? ¡ ¡ log ¡likelihood ¡ log ¡prior ¡ I) ¡Gaussian ¡Prior ¡ 0 ¡ Ridge Regression Prior ¡belief ¡that ¡β ¡is ¡Gaussian ¡with ¡zero-‑mean ¡biases ¡solu$on ¡to ¡“small” ¡β ¡ 19 ¡

Regularized ¡Least ¡Squares ¡and ¡MAP ¡ What ¡if ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡is ¡not ¡inver$ble ¡? ¡ ¡ log ¡likelihood ¡ log ¡prior ¡ II) ¡Laplace ¡Prior ¡ Lasso Prior ¡belief ¡that ¡β ¡is ¡Laplace ¡with ¡zero-‑mean ¡biases ¡solu$on ¡to ¡“small” ¡β ¡ 20 ¡

Ridge ¡Regression ¡vs ¡Lasso ¡ Ridge ¡Regression: ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Lasso: ¡ Ideally ¡l0 ¡penalty, ¡ ¡ HOT! ¡ but ¡op$miza$on ¡ ¡ ¡ ¡ ¡ ¡ becomes ¡non-‑convex ¡ β s ¡with ¡constant ¡ J ( β ) ¡ (level ¡sets ¡of ¡ J ( β )) ¡ β 2 ¡ β s ¡with ¡ ¡ β s ¡with ¡ ¡ β s ¡with ¡ ¡ constant ¡ ¡ constant ¡ ¡ constant ¡ ¡ l2 ¡norm ¡ l1 ¡norm ¡ l0 ¡norm ¡ β 1 ¡ Lasso ¡(l1 ¡penalty) ¡results ¡in ¡sparse ¡solu3ons ¡– ¡vector ¡with ¡more ¡zero ¡coordinates ¡ Good ¡for ¡high-‑dimensional ¡problems ¡– ¡don’t ¡have ¡to ¡store ¡all ¡coordinates! ¡ 21 ¡

Beyond ¡Linear ¡Regression ¡ Polynomial ¡regression ¡ ¡ ¡ Regression ¡with ¡nonlinear ¡features ¡ ¡ ¡ ¡ Later ¡… ¡ ¡ Kernel ¡regression ¡-‑ ¡Local/Weighted ¡regression ¡ ¡ 26 ¡

Polynomial ¡Regression ¡ degree ¡m ¡ Univariate ¡(1-‑dim) ¡ ¡ case: ¡ where ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡, ¡ β 0 + β 1 X (1) + β 2 X (2) + · · · + β p X ( p ) MulGvariate ¡(p-‑dim) ¡ ¡ f ( X ) = case: ¡ p p p p p β ij X ( i ) X ( j ) + X X X X X X ( i ) X ( j ) X ( k ) + i =1 j =1 i =1 j =1 k =1 + . . . terms up to degree m 27 ¡

Polynomial ¡Regression ¡ Polynomial ¡of ¡order ¡k, ¡equivalently ¡of ¡degree ¡up ¡to ¡k-‑1 ¡ 1.5 1.4 k=1 ¡ k=2 ¡ 1.2 1 1 0.8 0.6 0.5 0.4 0.2 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.4 5 k=3 ¡ k=7 ¡ 0 1.2 -5 1 -10 0.8 -15 0.6 -20 ¡ What ¡is ¡the ¡right ¡order? ¡Recall ¡overfiPng! ¡More ¡later ¡… ¡ -25 0.4 -30 0.2 ¡ -35 0 -40 28 ¡ -0.2 -45 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Regression ¡with ¡nonlinear ¡features ¡ Weight of Nonlinear each feature features In ¡general, ¡use ¡any ¡nonlinear ¡features ¡ ¡ ¡ ¡e.g. ¡e X , ¡log ¡X, ¡1/X, ¡sin(X), ¡… ¡ 29 ¡

Linear Regression Aar$ Singh & Barnabas Poczos - PowerPoint PPT Presentation

Linear Regression Aar$ Singh & Barnabas Poczos Machine Learning 10-701/15-781 Jan 23, 2014 So far Learning distribu$ons Maximum

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Separable Nonlinear Least Squares Problems in Image Processing Julianne Chung and James Nagy

Canadian Tire Corporation Third Quarter Financial Results| November 7, 2019 Forward Looking

Young Adult Voter Behavior September 18, 2017 Presented by: In partnership with:

EKF, UKF Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox,

Nonlinear Optimization Practical Advice for Non-linear Least Square Problems Niclas Brlin

On the Statistical Rate of Nonlinear Recovery in Generative Models with Heavy-tailed Data Xiaohan

Detection and Estimation Theory Lecture 12 Mojtaba Soltanalian- UIC msol@uic.edu

Theory of Generalized Linear Models If Y has a Poisson distribution with parameter then P (