Linear Regression and the Bias Variance Tradeoff Guest Lecturer Joseph E. Gonzalez slides available here: h"p://&nyurl.com/reglecture
Simple Linear Regression Y X Response Covariate Variable Y = mX + b Linear Model: Slope Intercept (bias)
MoHvaHon • One of the most widely used techniques • Fundamental to many larger models – Generalized Linear Models – CollaboraHve filtering • Easy to interpret • Efficient to solve
MulHple Linear Regression
The Regression Model • For a single data point (x,y) : Independent Variable Response Variable (Vector) (Scalar) Observe: x x y (CondiHon) y ∈ R x ∈ R p • Joint Probability: DiscriminaHve p ( x, y ) = p ( x ) p ( y | x ) Model
The Linear Model Vector of Parameters Vector of Covariates y = ✓ T x + ✏ + b Scalar Real Value Response Noise Noise Model: p ✏ ∼ N (0 , σ 2 ) X Linear Combina&on θ i x i of Covariates i =1 What about bias/intercept term? Define: x p +1 = 1 Then redefine p := p+1 for notaHonal simplicity
CondiHonal Likelihood p(y|x) • CondiHoned on x: Constant Normal DistribuHon y = ✓ T x + ✏ ∼ N (0 , σ 2 ) Mean Variance • CondiHonal distribuHon of Y: Y ∼ N ( θ T x, σ 2 ) − ( y − θ T x ) 2 ✓ ◆ 1 √ p ( y | x ) = 2 π exp 2 σ 2 σ
Parameters and Random Variables Parameters y ∼ N ( θ T x, σ 2 ) • CondiHonal distribuHon of y: – Bayesian: parameters as random variables p ( y | x, θ, σ 2 ) – FrequenHst: parameters as (unknown) constants p θ,σ 2 ( y | x )
So far … Y I’m lonely * X 2 X 1
Independent and IdenHcally Distributed (iid) Data • For n data points: D = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } = { ( x i , y i ) } n i =1 Plate Diagram Independent Variable Response Variable (Vector) (Scalar) x i y i x i ∈ R p y i ∈ R i ∈ { 1 , . . . , n }
Joint Probability x i y i n • For n data points independent and iden&cally distributed (iid) : n Y p ( D ) = p ( x i , y i ) i =1 n Y = p ( x i ) p ( y i | x i ) i =1
RewriHng with Matrix NotaHon D = { ( x i , y i ) } n • Represent data as: i =1 Covariate (Design) Response Matrix Vector n n y 1 x 1 y 2 x 2 ∈ R np ∈ R n X = Y = . . . . Assume X . . has rank p x n y n (not degenerate) p 1
RewriHng with Matrix NotaHon • RewriHng the model using matrix operaHons: Y = X✓ + ✏ = + θ Y X ✏ p n n n 1 p 1
EsHmaHng the Model • Given data how can we esHmate θ? Y = X✓ + ✏ • Construct maximum likelihood esHmator (MLE): – Derive the log‐likelihood – Find θ MLE that maximizes log‐likelihood • AnalyHcally: Take derivaHve and set = 0 • IteraHvely: (StochasHc) gradient descent
Joint Probability x i x i y i n • For n data points: n Y p ( D ) = p ( x i , y i ) i =1 “1” n Y = p ( x i ) p ( y i | x i ) DiscriminaHve Model i =1
Defining the Likelihood p θ ( y | x ) = x i y i − ( y − θ T x ) 2 ✓ ◆ 1 √ 2 π exp 2 σ 2 n σ n Y L ( θ |D ) = p θ ( y i | x i ) i =1 − ( y i − θ T x i ) 2 n ✓ ◆ 1 Y = 2 π exp √ 2 σ 2 σ i =1 n ! 1 − 1 ( y i − θ T x i ) 2 X = 2 exp n 2 σ 2 σ n (2 π ) i =1
Maximizing the Likelihood • Want to compute: ˆ θ MLE = arg max θ ∈ R p L ( θ |D ) • To simplify the calculaHons we take the log: 1 ˆ θ MLE = arg max θ ∈ R p log L ( θ |D ) 1 2 3 4 5 - 1 - 2 which does not affect the maximizaHon because log is a monotone funcHon.
! n 1 − 1 ( y i − θ T x i ) 2 X L ( θ |D ) = 2 exp n 2 σ 2 σ n (2 π ) i =1 • Take the log: n 1 ( y i − θ T x i ) 2 X 2 ) − log L ( θ |D ) = − log( σ n (2 π ) n 2 σ 2 i =1 • Removing constant terms with respect to θ: n ( y i − θ T x i ) 2 X log L ( θ ) = − Monotone FuncHon i =1 (Easy to maximize)
n ( y i − θ T x i ) 2 X log L ( θ ) = − i =1 • Want to compute: ˆ θ MLE = arg max θ ∈ R p log L ( θ |D ) • Plugging in log‐likelihood: n ( y i − θ T x i ) 2 ˆ X θ MLE = arg max θ ∈ R p − i =1
n ( y i − θ T x i ) 2 ˆ X θ MLE = arg max θ ∈ R p − i =1 • Dropping the sign and flipping from maximizaHon to minimizaHon: n ( y i − θ T x i ) 2 ˆ X θ MLE = arg min θ ∈ R p i =1 Minimize Sum (Error) 2 • Gaussian Noise Model Squared Loss – Least Squares Regression
Pictorial InterpretaHon of Squared Error y x
Maximizing the Likelihood (Minimizing the Squared Error) n ( y i − θ T x i ) 2 ˆ X θ MLE = arg min θ ∈ R p i =1 − log L ( θ ) Convex FuncHon Slope = 0 θ ˆ θ MLE • Take the gradient and set it equal to zero
Minimizing the Squared Error n ( y i − θ T x i ) 2 ˆ X θ MLE = arg min θ ∈ R p i =1 • Taking the gradient n ( y i − θ T x i ) 2 X −r θ log L ( θ ) = r θ i =1 n ( y i − θ T x i ) x i X = − 2 Chain Rule i =1 n n ( θ T x i ) x i X X = − 2 y i x i + 2 i =1 i =1
• RewriHng the gradient in matrix form: n n ( θ T x i ) x i X X −r θ log L ( θ ) = − 2 y i x i + 2 i =1 i =1 = − 2 X T Y + 2 X T Xθ • To make sure the log‐likelihood is convex compute the second derivaHve (Hessian) −r 2 log L ( θ ) = 2 X T X • If X is full rank then X T X is posiHve definite and therefore θ MLE is the minimum – Address the degenerate cases with regularizaHon
−r θ log L ( θ ) = − 2 X T y + 2 X T Xθ = 0 • Sehng gradient equal to 0 and solve for θ MLE : ( X T X )ˆ θ MLE = X T Y Normal EquaHons (Write on θ MLE = ( X T X ) − 1 X T Y ˆ board) n p n 1 ‐1 = p
Geometric InterpretaHon • View the MLE as finding a projecHon on col(X) – Define the esHmator: ˆ Y = Xθ – Observe that Ŷ is in col(X) • linear combinaHon of cols of X – Want to Ŷ closest to Y • Implies (Y‐Ŷ) normal to X X T ( Y − ˆ Y ) = X T ( Y − Xθ ) = 0 ⇒ X T Xθ = X T Y
ConnecHon to Pseudo‐Inverse θ MLE = ( X T X ) − 1 X T Y ˆ Moore‐Penrose X † Psuedoinverse • GeneralizaHon of the inverse: – Consider the case when X is square and inverHble: X † = ( X T X ) − 1 X T = X − 1 ( X T ) − 1 X T = X − 1 – Which implies θ MLE = X ‐1 Y the soluHon to X θ = Y when X is square and inverHble
CompuHng the MLE θ MLE = ( X T X ) − 1 X T Y ˆ • Not typically solved by inverHng X T X • Solved using direct methods: – Cholesky factorizaHon: or use the • Up to a factor of 2 faster built‐in solver – QR factorizaHon: in your math library. R: solve(Xt %*% X, Xt %*% y) • More numerically stable • Solved using various iteraHve methods: – Krylov subspace methods – (StochasHc) Gradient Descent hqp://www.seas.ucla.edu/~vandenbe/103/lectures/qr.pdf
Cholesky FactorizaHon ( X T X )ˆ θ MLE = X T Y solve ˆ θ MLE d C • Compute symm. matrix C = X T X O ( np 2 ) • Compute vector d = X T Y O ( np ) • Cholesky FactorizaHon LL T = C O ( p 3 ) – L is lower triangular • Forward subs. to solve: O ( p 2 ) Lz = d • Backward subs. to solve: L T ˆ O ( p 2 ) θ MLE = z ConnecHons to graphical model inference: hqp://ssg.mit.edu/~willsky/publ_pdfs/185_pub_MLR.pdf and hqp://yaroslavvb.blogspot.com/2011/02/juncHon‐trees‐in‐numerical‐analysis.html with illustraHons
Solving Triangular System x 1 b 1 A 11 A 12 A 13 A 14 x 2 b 2 A 22 A 23 A 24 = * x 3 b 3 A 33 A 34 x 4 b 4 A 44
Solving Triangular System b 1 A 11 x 1 A 12 x 2 A 13 x 3 A 14 x 4 x 1 = b 1 ‐A 12 x 2 ‐A 13 x 3 ‐A 14 x 4 A 11 b 2 A 22 x 2 A 23 x 3 A 24 x 4 x 2 =b 2 ‐A 23 x 3 ‐A 24 x 4 A 22 b 3 A 33 x 3 A 34 x 4 x 3 =(b 3 ‐A 34 x 4 ) A 33 b 4 A 44 x 4 x 4 =b 4 /A 44
Distributed Direct SoluHon (Map‐Reduce) θ MLE = ( X T X ) − 1 X T Y ˆ • DistribuHon computaHons of sums: p n C = X T X = X x i x T p O ( np 2 ) i i =1 n 1 d = X T y = X p x i y i O ( np ) i =1 • Solve system C θ MLE = d on master. O ( p 3 )
Gradient Descent: What if p is large? (e.g., n/2) • The cost of O( np 2 ) = O( n 3 ) could by prohibiHve • SoluHon: IteraHve Methods – Gradient Descent: For τ from 0 until convergence θ ( τ +1) = θ ( τ ) − ρ ( τ ) r log L ( θ ( τ ) | D ) Learning rate
Gradient Descent Illustrated: − log L ( θ ) Slope = 0 θ (0) θ (1) θ (2) θ (3) θ (3) = ˆ θ MLE Convex FuncHon θ
Gradient Descent: What if p is large? (e.g., n/2) • The cost of O( np 2 ) = O( n 3 ) could by prohibiHve • SoluHon: IteraHve Methods – Gradient Descent: For τ from 0 until convergence θ ( τ +1) = θ ( τ ) − ρ ( τ ) r log L ( θ ( τ ) | D ) n = θ ( τ ) + ρ ( τ ) 1 ( y i − θ ( τ ) T x i ) x i X O ( np ) n i =1 • Can we do beqer? EsHmate of the Gradient
StochasHc Gradient Descent • Construct noisy esHmate of the gradient: For τ from 0 until convergence 1) pick a random i θ ( τ +1) = θ ( τ ) + ρ ( τ )( y i − θ ( τ ) T x i ) x i 2) O ( p ) • SensiHve to choice of ρ(τ) typically (ρ(τ)=1/τ) • Also known as Least‐Mean‐Squares (LMS) • Applies to streaming data O(p) storage
Fihng Non‐linear Data • What if Y has a non‐linear response? 2.0 1.5 1.0 0.5 1 2 3 4 5 6 - 0.5 - 1.0 - 1.5 • Can we sHll use a linear model?
Recommend
More recommend