Regression and the Bias-Variance Decomposition William Cohen 10-601 April 2008 Readings: Bishop 3.1,3.2 1
Regression • Technically: learning a function f( x )=y where y is real-valued , rather than discrete . – Replace livesInSquirrelHill(x1,x2,…,xn) with averageCommuteDistanceInMiles(x1,x2,…,xn) – Replace userLikesMovie( u , m ) with usersRatingForMovie( u , m ) – … 2
Example: univariate linear regression • Example: predict age from number of publications 50 45 40 35 30 Age in Years 25 20 15 10 5 0 0 20 40 60 80 100 120 140 160 Number of Publications 3
Linear regression • Model: yi = axi + b + εi where εi ~ N(0,σ) • Training Data: ( x1,y1),….(xn,yn) ^ ^ • Goal: estimate a,b with w= ( a,b ) w arg max Pr( w | D ) arg max Pr( D | w ) Pr( w ) arg max log Pr( D | w ) assume MLE arg max log Pr( y i x | , w ) i i ˆ ˆ ˆ ( w ) y ( a x b ) ˆ 2 arg min [ ( w )] i i i i i 4
Linear regression • Model: yi = axi + b + εi where εi ~ N(0,σ) • Training Data: ( x1,y1),….(xn,yn) ^ ^ 2 ˆ w arg min • Goal: estimate a,b with w= ( a,b ) i i • Ways to estimate parameters – Find derivative wrt parameters a,b – Set to zero and solve • Or use gradient ascent to solve • Or …. 5
Linear regression y2 d3 How to estimate the slope? d2 y y1 y d1 slope x y y y y 1 2 ... x x x x 1 2 x x1 x2 1 y y i n x x y y n*cov(X,Y) i i i 1 2 x x x x n*var(X ) i i i n i 6
Linear regression y2 d3 How to estimate the intercept? d2 y y1 ˆ ˆ y a x b d1 ˆ ˆ b y a x x x1 x2 7
Bias/Variance Decomposition of Error 8
Bias – Variance decomposition of error • Return to the simple regression problem f:X Y y = f(x) + noise N(0, ) deterministic What is the expected error for a learned h ? 9
Bias – Variance decomposition of error 2 E [ f ( x ) h ( x ) Pr( ) Pr( x ) d dx ] D D learned from D dataset true fct Experiment (the error of which I’d like to predict): 1. Draw size n sample D = ( x1,y1),….(xn,yn) 2 . Train linear function hD using D 3. Draw a test example (x,f(x)+ε) 4. Measure squared error of hD on that example 10
Bias – Variance decomposition of error (2) 2 E f ( x ) h ( x ) D , D learned from D dataset true fct Fix x, then do this experiment: 1. Draw size n sample D = ( x1,y1),….(xn,yn) 2 . Train linear function hD using D 3. Draw the test example (x,f(x)+ε) 4. Measure squared error of hD on that example 11
Bias – Variance decomposition of error t 2 E f ( x ) h ( x ) D , D ^ f ^ really yD y ˆ 2 E ( t y ) 2 why not? ˆ E [ t f ] [ f y ] 2 2 ˆ ˆ E [ t f ] [ f y ] 2 [ t f ][ f y ] 2 2 2 ˆ ˆ ˆ E [ t f ] [ f y ] 2 [ tf t y f f y ] 12
Bias – Variance decomposition of error 2 ˆ E D ( t y ) , 2 ˆ E [ t f ] [ f y ] 2 2 ˆ ˆ E [ t f ] [ f y ] 2 [ t f ][ f y ] 2 2 2 ˆ ˆ ˆ E [ t f ] [ f y ] 2 [ tf t y f f y ] 2 2 2 ˆ ˆ ˆ E [ ] E [( f y ) ] 2 E [ tf ] E [ t y ] E [ f ] E [ f y ] ˆ ( f ) f ( f ) y Depends on how well learner Intrinsic approximates f noise 13
Bias – Variance decomposition of error 2 ˆ E ( f y ) h E { h ( x )} 2 ˆ E [ f h ] [ h y ] D D D ˆ ˆ y y h ( x ) D 2 2 ˆ ˆ E [ f h ] [ h y ] 2 [ f h ][ h y ] 2 2 2 ˆ ˆ ˆ E [ f h ] [ h y ] 2 [ fh f y h h y ] 2 ˆ 2 ˆ 2 ˆ E [( f h ) ] E [( h y ) ] 2 E [ fh ] E [ f y ] E [ h ] E [ h y ] VARIANCE Squared difference between best possible Squared difference btwn our long- prediction for x, f(x), and term expectation for the learners our “ long-term ” performance, ED[hD(x)], and what expectation for what the we expect in a representative run learner will do if we on a dataset D (hat y) averaged over many BIAS2 datasets D, ED[hD(x)] 14
Bias-variance decomposition Make the long-term average better approximate the true function f(x ) Make the learner less sensitive to variations in the data How can you reduce bias of a learner? How can you reduce variance of a learner? 15
A generalization of bias-variance decomposition to other loss functions • “Arbitrary” real-valued loss L(t,y) Claim: But L(y,y’)=L(y’,y), L(y,y)=0, and L(y,y’)!=0 if y!=y’ ED,t[L(t,y) ] = c1N(x) • Define “optimal prediction”: +B(x)+c2V(x) y* = argmin y’ L(t,y’) where • Define “main prediction of learner” ym=ym,D = argmin y’ ED{L(t,y’)} c1=PrD [ y=y* ] - 1 m=n=|D | • Define “bias of learner”: c2=1 if ym=y*, -1 else B(x)=L(y*,ym) • Define “variance of learner” V(x)=ED[L(ym,y)] • Define “noise for x”: N(x) = Et[L(t,y*)] 16
Other regression methods 17
Example: univariate linear regression • Example: predict age from number of publications Paul Erdős 50 45 40 1 ˆ y x 26 35 7 30 Age in Years Hungarian 25 mathematician, 1913-1996 20 x ~ 1500 15 age about 240 10 5 0 0 20 40 60 80 100 120 140 160 Number of Publications 18
Linear regression Summary: y2 d3 d2 y x x y y ˆ a i i 2 y1 x x i i d1 ˆ ˆ b y a x To simplify: x x1 x2 • assume zero-centered data, as we T T 1 ˆ a x y ( x x ) did for PCA • let x= ( x1,…,xn) and y = (y1,…,yn) ˆ b 0 • then… 19
Onward: multivariate linear regression Multivariate col is feature Univariate y 1 k x ,...., x x x ,...., x 1 1 1 1 n y ... X ... y y ,...., y 1 n y 1 k x ,...., x n n n T T 1 ˆ w x y ( x x ) 1 1 k k ˆ ˆ ˆ y w x ... w x row is example T T 1 ˆ w X y ( X X ) 2 ˆ w arg min [ ( w )] i ˆ T ( w ) y w x i i i 20
regularized Onward: multivariate linear regression ^ y 1 k x ,...., x 1 2 ˆ w arg min [ ( w )] 1 1 i y ... X ... ˆ T ( w ) y w x i i i y 1 k x ,...., x n ˆ 2 T n n w arg min [ ( w )] w w i 2 1 1 k k ˆ ˆ ˆ y w x ... w x T T 1 ˆ T T 1 ˆ w X y ( I X X ) w X y ( X X ) 21
Onward: multivariate linear regression Multivariate, multiple outputs 1 1 y y ,...., y 1 m x ,...., x 1 1 1 1 1 y ... Y ... X ... 1 k y y ,...., y 1 m x ,...., x n n n n n 1 1 k k ˆ ˆ ˆ ˆ y w x ... w x y W x T T 1 ˆ T T 1 w X y ( X X ) W X Y ( X X ) 22
regularized Onward: multivariate linear regression ^ y 1 k x ,...., x 1 2 ˆ w arg min [ ( w )] 1 1 i y ... X ... ˆ T ( w ) y w x i i i y 1 k x ,...., x n ˆ 2 T n n w arg min [ ( w )] w w i 2 What does increasing λ do? 1 1 k k ˆ ˆ ˆ y w x ... w x T T 1 ˆ T T 1 ˆ w X y ( I X X ) w X y ( X X ) 23
regularized Onward: multivariate linear regression ^ y 1 , x 1 2 ˆ w arg min [ ( w )] 1 i y ... X ... ˆ T ( w ) y w x i i i y 1 , x n ˆ 2 T n w arg min [ ( w )] w w i 2 w= (w1,w2) What does fixing w2=0 1 1 k k ˆ ˆ ˆ y w x ... w x do (if λ=0)? T T 1 ˆ T T 1 ˆ w X y ( I X X ) w X y ( X X ) 24
Recommend
More recommend