Linear Regression and the Bias Variance Tradeoff Guest Lecturer - PowerPoint PPT Presentation

Linear Regression and the Bias Variance Tradeoff Guest Lecturer Joseph E. Gonzalez slides available here: h"p://&nyurl.com/reglecture

Simple Linear Regression Y X Response Covariate Variable Y = mX + b Linear Model: Slope Intercept (bias)

MoHvaHon • One of the most widely used techniques • Fundamental to many larger models – Generalized Linear Models – CollaboraHve filtering • Easy to interpret • Efficient to solve

MulHple Linear Regression

The Regression Model • For a single data point (x,y) : Independent Variable Response Variable (Vector) (Scalar) Observe: x x y (CondiHon) y ∈ R x ∈ R p • Joint Probability: DiscriminaHve p ( x, y ) = p ( x ) p ( y | x ) Model

The Linear Model Vector of Parameters Vector of Covariates y = ✓ T x + ✏ + b Scalar Real Value Response Noise Noise Model: p ✏ ∼ N (0 , σ 2 ) X Linear Combina&on θ i x i of Covariates i =1 What about bias/intercept term? Define: x p +1 = 1 Then redefine p := p+1 for notaHonal simplicity

CondiHonal Likelihood p(y|x) • CondiHoned on x: Constant Normal DistribuHon y = ✓ T x + ✏ ∼ N (0 , σ 2 ) Mean Variance • CondiHonal distribuHon of Y: Y ∼ N ( θ T x, σ 2 ) − ( y − θ T x ) 2 ✓ ◆ 1 √ p ( y | x ) = 2 π exp 2 σ 2 σ

Parameters and Random Variables Parameters y ∼ N ( θ T x, σ 2 ) • CondiHonal distribuHon of y: – Bayesian: parameters as random variables p ( y | x, θ, σ 2 ) – FrequenHst: parameters as (unknown) constants p θ,σ 2 ( y | x )

So far … Y I’m lonely * X 2 X 1

Independent and IdenHcally Distributed (iid) Data • For n data points: D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } = { ( x i , y i ) } n i =1 Plate Diagram Independent Variable Response Variable (Vector) (Scalar) x i y i x i ∈ R p y i ∈ R i ∈ { 1 , . . . , n }

Joint Probability x i y i n • For n data points independent and iden&cally distributed (iid) : n Y p ( D ) = p ( x i , y i ) i =1 n Y = p ( x i ) p ( y i | x i ) i =1

RewriHng with Matrix NotaHon D = { ( x i , y i ) } n • Represent data as: i =1 Covariate (Design) Response Matrix Vector n n y 1 x 1 y 2 x 2 ∈ R np ∈ R n X = Y = . . . . Assume X . . has rank p x n y n (not degenerate) p 1

RewriHng with Matrix NotaHon • RewriHng the model using matrix operaHons: Y = X✓ + ✏ = + θ Y X ✏ p n n n 1 p 1

EsHmaHng the Model • Given data how can we esHmate θ? Y = X✓ + ✏ • Construct maximum likelihood esHmator (MLE): – Derive the log‐likelihood – Find θ MLE that maximizes log‐likelihood • AnalyHcally: Take derivaHve and set = 0 • IteraHvely: (StochasHc) gradient descent

Joint Probability x i x i y i n • For n data points: n Y p ( D ) = p ( x i , y i ) i =1 “1” n Y = p ( x i ) p ( y i | x i ) DiscriminaHve Model i =1

Defining the Likelihood p θ ( y | x ) = x i y i − ( y − θ T x ) 2 ✓ ◆ 1 √ 2 π exp 2 σ 2 n σ n Y L ( θ |D ) = p θ ( y i | x i ) i =1 − ( y i − θ T x i ) 2 n ✓ ◆ 1 Y = 2 π exp √ 2 σ 2 σ i =1 n ! 1 − 1 ( y i − θ T x i ) 2 X = 2 exp n 2 σ 2 σ n (2 π ) i =1

Maximizing the Likelihood • Want to compute: ˆ θ MLE = arg max θ ∈ R p L ( θ |D ) • To simplify the calculaHons we take the log: 1 ˆ θ MLE = arg max θ ∈ R p log L ( θ |D ) 1 2 3 4 5 - 1 - 2 which does not affect the maximizaHon because log is a monotone funcHon.

! n 1 − 1 ( y i − θ T x i ) 2 X L ( θ |D ) = 2 exp n 2 σ 2 σ n (2 π ) i =1 • Take the log: n 1 ( y i − θ T x i ) 2 X 2 ) − log L ( θ |D ) = − log( σ n (2 π ) n 2 σ 2 i =1 • Removing constant terms with respect to θ: n ( y i − θ T x i ) 2 X log L ( θ ) = − Monotone FuncHon i =1 (Easy to maximize)

n ( y i − θ T x i ) 2 X log L ( θ ) = − i =1 • Want to compute: ˆ θ MLE = arg max θ ∈ R p log L ( θ |D ) • Plugging in log‐likelihood: n ( y i − θ T x i ) 2 ˆ X θ MLE = arg max θ ∈ R p − i =1

n ( y i − θ T x i ) 2 ˆ X θ MLE = arg max θ ∈ R p − i =1 • Dropping the sign and flipping from maximizaHon to minimizaHon: n ( y i − θ T x i ) 2 ˆ X θ MLE = arg min θ ∈ R p i =1 Minimize Sum (Error) 2 • Gaussian Noise Model  Squared Loss – Least Squares Regression

Pictorial InterpretaHon of Squared Error y x

Maximizing the Likelihood (Minimizing the Squared Error) n ( y i − θ T x i ) 2 ˆ X θ MLE = arg min θ ∈ R p i =1 − log L ( θ ) Convex FuncHon Slope = 0 θ ˆ θ MLE • Take the gradient and set it equal to zero

Minimizing the Squared Error n ( y i − θ T x i ) 2 ˆ X θ MLE = arg min θ ∈ R p i =1 • Taking the gradient n ( y i − θ T x i ) 2 X −r θ log L ( θ ) = r θ i =1 n ( y i − θ T x i ) x i X = − 2 Chain Rule  i =1 n n ( θ T x i ) x i X X = − 2 y i x i + 2 i =1 i =1

• RewriHng the gradient in matrix form: n n ( θ T x i ) x i X X −r θ log L ( θ ) = − 2 y i x i + 2 i =1 i =1 = − 2 X T Y + 2 X T Xθ • To make sure the log‐likelihood is convex compute the second derivaHve (Hessian) −r 2 log L ( θ ) = 2 X T X • If X is full rank then X T X is posiHve definite and therefore θ MLE is the minimum – Address the degenerate cases with regularizaHon

−r θ log L ( θ ) = − 2 X T y + 2 X T Xθ = 0 • Sehng gradient equal to 0 and solve for θ MLE : ( X T X )ˆ θ MLE = X T Y Normal EquaHons (Write on θ MLE = ( X T X ) − 1 X T Y ˆ board) n p n 1 ‐1 = p

Geometric InterpretaHon • View the MLE as finding a projecHon on col(X) – Define the esHmator: ˆ Y = Xθ – Observe that Ŷ is in col(X) • linear combinaHon of cols of X – Want to Ŷ closest to Y • Implies (Y‐Ŷ) normal to X X T ( Y − ˆ Y ) = X T ( Y − Xθ ) = 0 ⇒ X T Xθ = X T Y

ConnecHon to Pseudo‐Inverse θ MLE = ( X T X ) − 1 X T Y ˆ Moore‐Penrose X † Psuedoinverse • GeneralizaHon of the inverse: – Consider the case when X is square and inverHble: X † = ( X T X ) − 1 X T = X − 1 ( X T ) − 1 X T = X − 1 – Which implies θ MLE = X ‐1 Y the soluHon to X θ = Y when X is square and inverHble

CompuHng the MLE θ MLE = ( X T X ) − 1 X T Y ˆ • Not typically solved by inverHng X T X • Solved using direct methods: – Cholesky factorizaHon: or use the • Up to a factor of 2 faster built‐in solver – QR factorizaHon: in your math library. R: solve(Xt %*% X, Xt %*% y) • More numerically stable • Solved using various iteraHve methods: – Krylov subspace methods – (StochasHc) Gradient Descent hqp://www.seas.ucla.edu/~vandenbe/103/lectures/qr.pdf

Cholesky FactorizaHon ( X T X )ˆ θ MLE = X T Y solve ˆ θ MLE d C • Compute symm. matrix C = X T X O ( np 2 ) • Compute vector d = X T Y O ( np ) • Cholesky FactorizaHon LL T = C O ( p 3 ) – L is lower triangular • Forward subs. to solve: O ( p 2 ) Lz = d • Backward subs. to solve: L T ˆ O ( p 2 ) θ MLE = z ConnecHons to graphical model inference: hqp://ssg.mit.edu/~willsky/publ_pdfs/185_pub_MLR.pdf and hqp://yaroslavvb.blogspot.com/2011/02/juncHon‐trees‐in‐numerical‐analysis.html with illustraHons

Solving Triangular System x 1 b 1 A 11 A 12 A 13 A 14 x 2 b 2 A 22 A 23 A 24 = * x 3 b 3 A 33 A 34 x 4 b 4 A 44

Solving Triangular System b 1 A 11 x 1 A 12 x 2 A 13 x 3 A 14 x 4 x 1 = b 1 ‐A 12 x 2 ‐A 13 x 3 ‐A 14 x 4 A 11 b 2 A 22 x 2 A 23 x 3 A 24 x 4 x 2 =b 2 ‐A 23 x 3 ‐A 24 x 4 A 22 b 3 A 33 x 3 A 34 x 4 x 3 =(b 3 ‐A 34 x 4 ) A 33 b 4 A 44 x 4 x 4 =b 4 /A 44

Distributed Direct SoluHon (Map‐Reduce) θ MLE = ( X T X ) − 1 X T Y ˆ • DistribuHon computaHons of sums: p n C = X T X = X x i x T p O ( np 2 ) i i =1 n 1 d = X T y = X p x i y i O ( np ) i =1 • Solve system C θ MLE = d on master. O ( p 3 )

Gradient Descent: What if p is large? (e.g., n/2) • The cost of O( np 2 ) = O( n 3 ) could by prohibiHve • SoluHon: IteraHve Methods – Gradient Descent: For τ from 0 until convergence θ ( τ +1) = θ ( τ ) − ρ ( τ ) r log L ( θ ( τ ) | D ) Learning rate

Gradient Descent Illustrated: − log L ( θ ) Slope = 0 θ (0) θ (1) θ (2) θ (3) θ (3) = ˆ θ MLE Convex FuncHon θ

Gradient Descent: What if p is large? (e.g., n/2) • The cost of O( np 2 ) = O( n 3 ) could by prohibiHve • SoluHon: IteraHve Methods – Gradient Descent: For τ from 0 until convergence θ ( τ +1) = θ ( τ ) − ρ ( τ ) r log L ( θ ( τ ) | D ) n = θ ( τ ) + ρ ( τ ) 1 ( y i − θ ( τ ) T x i ) x i X O ( np ) n i =1 • Can we do beqer? EsHmate of the Gradient

StochasHc Gradient Descent • Construct noisy esHmate of the gradient: For τ from 0 until convergence 1) pick a random i θ ( τ +1) = θ ( τ ) + ρ ( τ )( y i − θ ( τ ) T x i ) x i 2) O ( p ) • SensiHve to choice of ρ(τ) typically (ρ(τ)=1/τ) • Also known as Least‐Mean‐Squares (LMS) • Applies to streaming data O(p) storage

Fihng Non‐linear Data • What if Y has a non‐linear response? 2.0 1.5 1.0 0.5 1 2 3 4 5 6 - 0.5 - 1.0 - 1.5 • Can we sHll use a linear model?

Linear Regression and the Bias Variance Tradeoff Guest Lecturer - PowerPoint PPT Presentation

Linear Regression and the Bias Variance Tradeoff Guest Lecturer Joseph E. Gonzalez slides available here: h"p://&nyurl.com/reglecture Simple Linear Regression Y X Response Covariate Variable Y = mX + b Linear Model: Slope

Review Selection bias, overfitting Bias v. variance v. residual Bias-variance tradeoff

Bias-Variance Tradeoff Machine Learning 1 Bias and variance Every learning algorithm requires

Linear Regression, Regularization Bias-Variance Tradeoff Thanks to C Guestrin, T Dietterich, R

Bias, Variance and Error Bias and Variance given algorithm that outputs estimate for , we

Bias- -Variance Theory Variance Theory Bias Decompose Error Rate into components, some

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

10701 Machine Learning Boosting Fighting the bias-variance tradeoff Simple (a.k.a. weak)

BIAS What Is Bias? Bias can be defined as favoring one side, position, or belief being

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

BIAS BIAS LIGHT LIGHT & & MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

Bias-Variance Tradeoff David Dalpiaz STAT 430, Fall 2017 1 Announcements Homework 03

Analysis of variance and regression December 4, 2007 Variance component models Variance

Project Proposal: Machine Learning Good Symbol Precedences 1 Filip Brtek Martin Suda Czech

Generalised link-layer adaptation with higher-layer criteria for energy-constrained and

Recursive Program Schemes with Effects Daniel Schwencke, 28th March 2010 Outline 1 Introduction 2

Objectives Introduction to Grammars Identify and explain the parts of a grammar. Defjne

CSE443 Compilers Dr. Carl Alphonce alphonce@buffalo.edu 343 Davis Hall Phases of a Syntactic

EECS 3401 AI and Logic Prog. Lecture 3 Adapted from slides of Prof. Yves Lesperance

Exercises for 8.1.3 1. Give a parse tree with semantic records for the statement Z = 2 + 2. Show

Monotonicity Testing Yevgeniy Dodis, Oded Goldreich, Eric Lehman, Sofya Raskhodnikova, Dana Ron

Linear Regression and the Bias Variance Tradeoff Guest Lecturer - PowerPoint PPT Presentation

Linear Regression and the Bias Variance Tradeoff Guest Lecturer Joseph E. Gonzalez slides available here: h"p://&nyurl.com/reglecture Simple Linear Regression Y X Response Covariate Variable Y = mX + b Linear Model: Slope

Review Selection bias, overfitting Bias v. variance v. residual Bias-variance tradeoff

Bias-Variance Tradeoff Machine Learning 1 Bias and variance Every learning algorithm requires

Linear Regression, Regularization Bias-Variance Tradeoff Thanks to C Guestrin, T Dietterich, R

Bias, Variance and Error Bias and Variance given algorithm that outputs estimate for , we

Bias- -Variance Theory Variance Theory Bias Decompose Error Rate into components, some

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

10701 Machine Learning Boosting Fighting the bias-variance tradeoff Simple (a.k.a. weak)

BIAS What Is Bias? Bias can be defined as favoring one side, position, or belief being

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

BIAS BIAS LIGHT LIGHT &amp; &amp; MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh

Bias-Variance Tradeoff David Dalpiaz STAT 430, Fall 2017 1 Announcements Homework 03

Analysis of variance and regression December 4, 2007 Variance component models Variance

Project Proposal: Machine Learning Good Symbol Precedences 1 Filip Brtek Martin Suda Czech

Generalised link-layer adaptation with higher-layer criteria for energy-constrained and

Recursive Program Schemes with Effects Daniel Schwencke, 28th March 2010 Outline 1 Introduction 2

Objectives Introduction to Grammars Identify and explain the parts of a grammar. Defjne

CSE443 Compilers Dr. Carl Alphonce alphonce@buffalo.edu 343 Davis Hall Phases of a Syntactic

EECS 3401 AI and Logic Prog. Lecture 3 Adapted from slides of Prof. Yves Lesperance

Exercises for 8.1.3 1. Give a parse tree with semantic records for the statement Z = 2 + 2. Show

Monotonicity Testing Yevgeniy Dodis, Oded Goldreich, Eric Lehman, Sofya Raskhodnikova, Dana Ron

BIAS BIAS LIGHT LIGHT & & MEDIUM MEDIUM TR TRUCK UCK TIRES TIRES Bias Bias Ligh