regression and the bias variance decomposition
play

Regression and the Bias-Variance Decomposition William Cohen - PowerPoint PPT Presentation

Regression and the Bias-Variance Decomposition William Cohen 10-601 April 2008 Readings: Bishop 3.1,3.2 1 Regression Technically: learning a function f( x )=y where y is real-valued , rather than discrete . Replace


  1. Regression and the Bias-Variance Decomposition William Cohen 10-601 April 2008 Readings: Bishop 3.1,3.2 1

  2. Regression • Technically: learning a function f( x )=y where y is real-valued , rather than discrete . – Replace livesInSquirrelHill(x1,x2,…,xn) with averageCommuteDistanceInMiles(x1,x2,…,xn) – Replace userLikesMovie( u , m ) with usersRatingForMovie( u , m ) – … 2

  3. Example: univariate linear regression • Example: predict age from number of publications 50 45 40 35 30 Age in Years 25 20 15 10 5 0 0 20 40 60 80 100 120 140 160 Number of Publications 3

  4. Linear regression • Model: yi = axi + b + εi where εi ~ N(0,σ) • Training Data: ( x1,y1),….(xn,yn) ^ ^ • Goal: estimate a,b with w= ( a,b )  w arg max Pr( w | D )  arg max Pr( D | w ) Pr( w )  arg max log Pr( D | w ) assume MLE   arg max log Pr( y i x | , w ) i i ˆ      ˆ ˆ ( w ) y ( a x b )   ˆ 2 arg min [ ( w )] i i i i i 4

  5. Linear regression • Model: yi = axi + b + εi where εi ~ N(0,σ) • Training Data: ( x1,y1),….(xn,yn)  ^ ^ 2   ˆ w arg min • Goal: estimate a,b with w= ( a,b ) i i • Ways to estimate parameters – Find derivative wrt parameters a,b – Set to zero and solve • Or use gradient ascent to solve • Or …. 5

  6. Linear regression y2 d3 How to estimate the slope? d2 y y1  y d1  slope  x       y y y y    1 2 ...       x x x x 1 2 x x1 x2 1     y y    i n   x x y y n*cov(X,Y)   i  i i 1      2  x x  x x n*var(X ) i i i n i 6

  7. Linear regression y2 d3 How to estimate the intercept? d2 y y1 ˆ ˆ   y a x b d1 ˆ   ˆ b y a x x x1 x2 7

  8. Bias/Variance Decomposition of Error 8

  9. Bias – Variance decomposition of error • Return to the simple regression problem f:X  Y y = f(x) +  noise N(0,  ) deterministic What is the expected error for a learned h ? 9

  10. Bias – Variance decomposition of error   2       E [ f ( x ) h ( x ) Pr( ) Pr( x ) d dx ] D D learned from D dataset true fct Experiment (the error of which I’d like to predict): 1. Draw size n sample D = ( x1,y1),….(xn,yn) 2 . Train linear function hD using D 3. Draw a test example (x,f(x)+ε) 4. Measure squared error of hD on that example 10

  11. Bias – Variance decomposition of error (2)     2    E f ( x ) h ( x )  D , D learned from D dataset true fct Fix x, then do this experiment: 1. Draw size n sample D = ( x1,y1),….(xn,yn) 2 . Train linear function hD using D 3. Draw the test example (x,f(x)+ε) 4. Measure squared error of hD on that example 11

  12. Bias – Variance decomposition of error t     2    E f ( x ) h ( x )  D , D ^ f ^ really yD   y  ˆ 2 E ( t y )     2     why not? ˆ E [ t f ] [ f y ]          2 2 ˆ ˆ E [ t f ] [ f y ] 2 [ t f ][ f y ]           2 2 2 ˆ ˆ ˆ E [ t f ] [ f y ] 2 [ tf t y f f y ] 12

  13. Bias – Variance decomposition of error    2 ˆ E D ( t y )  ,     2     ˆ E [ t f ] [ f y ]          2 2 ˆ ˆ E [ t f ] [ f y ] 2 [ t f ][ f y ]           2 2 2 ˆ ˆ ˆ E [ t f ] [ f y ] 2 [ tf t y f f y ]           2 2 2 ˆ ˆ ˆ E [ ] E [( f y ) ] 2 E [ tf ] E [ t y ] E [ f ] E [ f y ]     ˆ ( f ) f ( f ) y Depends on how well learner Intrinsic approximates f noise 13

  14. Bias – Variance decomposition of error    2 ˆ E ( f y )    h E { h ( x )}       2 ˆ E [ f h ] [ h y ] D D  D  ˆ ˆ y y h ( x )   D        2 2 ˆ ˆ E [ f h ] [ h y ] 2 [ f h ][ h y ]           2 2 2 ˆ ˆ ˆ E [ f h ] [ h y ] 2 [ fh f y h h y ]           2 ˆ 2 ˆ 2 ˆ E [( f h ) ] E [( h y ) ] 2 E [ fh ] E [ f y ] E [ h ] E [ h y ] VARIANCE Squared difference between best possible Squared difference btwn our long- prediction for x, f(x), and term expectation for the learners our “ long-term ” performance, ED[hD(x)], and what expectation for what the we expect in a representative run learner will do if we on a dataset D (hat y) averaged over many BIAS2 datasets D, ED[hD(x)] 14

  15. Bias-variance decomposition Make the long-term average better approximate the true function f(x ) Make the learner less sensitive to variations in the data How can you reduce bias of a learner? How can you reduce variance of a learner? 15

  16. A generalization of bias-variance decomposition to other loss functions • “Arbitrary” real-valued loss L(t,y) Claim: But L(y,y’)=L(y’,y), L(y,y)=0, and L(y,y’)!=0 if y!=y’ ED,t[L(t,y) ] = c1N(x) • Define “optimal prediction”: +B(x)+c2V(x) y* = argmin y’ L(t,y’) where • Define “main prediction of learner” ym=ym,D = argmin y’ ED{L(t,y’)} c1=PrD [ y=y* ] - 1 m=n=|D | • Define “bias of learner”: c2=1 if ym=y*, -1 else B(x)=L(y*,ym) • Define “variance of learner” V(x)=ED[L(ym,y)] • Define “noise for x”: N(x) = Et[L(t,y*)] 16

  17. Other regression methods 17

  18. Example: univariate linear regression • Example: predict age from number of publications Paul Erdős 50 45 40 1   ˆ y x 26 35 7 30 Age in Years Hungarian 25 mathematician, 1913-1996 20 x ~ 1500 15 age about 240 10 5 0 0 20 40 60 80 100 120 140 160 Number of Publications 18

  19. Linear regression Summary: y2 d3 d2    y   x x y y   ˆ a i i   2 y1  x x i i d1 ˆ   ˆ b y a x To simplify: x x1 x2 • assume zero-centered data, as we   T T 1 ˆ a x y ( x x ) did for PCA • let x= ( x1,…,xn) and y = (y1,…,yn) ˆ  b 0 • then… 19

  20. Onward: multivariate linear regression Multivariate col is feature Univariate     y 1 k x ,...., x  x x ,...., x 1 1 1     1 n  y ...  X ...      y y ,...., y     1 n y 1 k x ,...., x     n n n   T T 1 ˆ w x y ( x x )    1 1 k k ˆ ˆ ˆ y w x ... w x row is example   T T 1 ˆ w X y ( X X )    2 ˆ w arg min [ ( w )] i    ˆ T ( w ) y w x i i i 20

  21. regularized Onward: multivariate linear regression ^     y 1 k x ,...., x  1   2 ˆ w arg min [ ( w )] 1 1     i  y ...  X ...        ˆ T ( w ) y w x i i i     y 1 k x ,...., x       n    ˆ 2 T n n w arg min [ ( w )] w w i 2    1 1 k k ˆ ˆ ˆ y w x ... w x      T T 1  ˆ T T 1 ˆ w X y ( I X X ) w X y ( X X ) 21

  22. Onward: multivariate linear regression Multivariate, multiple outputs     1 1   y y ,...., y 1 m x ,...., x 1 1 1 1 1          y ... Y ... X ...             1 k y y ,...., y 1 m x ,...., x       n n n n n    1 1 k k  ˆ ˆ ˆ ˆ y w x ... w x y W x    T T 1  ˆ T T 1 w X y ( X X ) W X Y ( X X ) 22

  23. regularized Onward: multivariate linear regression ^     y 1 k x ,...., x  1   2 ˆ w arg min [ ( w )] 1 1     i  y ...  X ...        ˆ T ( w ) y w x i i i     y 1 k x ,...., x       n    ˆ 2 T n n w arg min [ ( w )] w w i 2 What does increasing λ do?    1 1 k k ˆ ˆ ˆ y w x ... w x      T T 1  ˆ T T 1 ˆ w X y ( I X X ) w X y ( X X ) 23

  24. regularized Onward: multivariate linear regression ^     y 1 , x  1   2 ˆ w arg min [ ( w )] 1     i  y ...  X ...        ˆ T ( w ) y w x i i i     y 1 , x       n    ˆ 2 T n w arg min [ ( w )] w w i 2 w= (w1,w2) What does fixing w2=0    1 1 k k ˆ ˆ ˆ y w x ... w x do (if λ=0)?      T T 1  ˆ T T 1 ˆ w X y ( I X X ) w X y ( X X ) 24

Recommend


More recommend