applied machine learning applied machine learning
play

Applied Machine Learning Applied Machine Learning Regularization - PowerPoint PPT Presentation

Applied Machine Learning Applied Machine Learning Regularization Siamak Ravanbakhsh Siamak Ravanbakhsh COMP 551 COMP 551 (winter 2020) (winter 2020) 1 Learning objectives Learning objectives Basic idea of overfitting and underfitting


  1. Ridge Ridge regression regression we can set the derivative to zero J ( w ) = 1 ⊤ ⊤ ( Xw − y ) ( Xw − y ) + λ w w 2 2 ⊤ ∇ J ( w ) = X ( Xw − y ) + λw = 0 when using gradient descent, this term ⊤ ⊤ λ I ) w = X y ( X X + reduces the weights at each step (weight decay) 6 . 3

  2. Ridge Ridge regression regression we can set the derivative to zero J ( w ) = 1 ⊤ ⊤ ( Xw − y ) ( Xw − y ) + λ w w 2 2 ⊤ ∇ J ( w ) = X ( Xw − y ) + λw = 0 when using gradient descent, this term ⊤ ⊤ λ I ) w = X y ( X X + reduces the weights at each step (weight decay) ⊤ −1 ⊤ λ I ) w = ( X X + X y 6 . 3

  3. Ridge Ridge regression regression we can set the derivative to zero J ( w ) = 1 ⊤ ⊤ ( Xw − y ) ( Xw − y ) + λ w w 2 2 ⊤ ∇ J ( w ) = X ( Xw − y ) + λw = 0 when using gradient descent, this term ⊤ ⊤ λ I ) w = X y ( X X + reduces the weights at each step (weight decay) ⊤ −1 ⊤ λ I ) w = ( X X + X y the only part different due to regularization 6 . 3

  4. Ridge Ridge regression regression we can set the derivative to zero J ( w ) = 1 ⊤ ⊤ ( Xw − y ) ( Xw − y ) + λ w w 2 2 ⊤ ∇ J ( w ) = X ( Xw − y ) + λw = 0 when using gradient descent, this term ⊤ ⊤ λ I ) w = X y ( X X + reduces the weights at each step (weight decay) ⊤ −1 ⊤ λ I ) w = ( X X + X y the only part different due to regularization makes it invertible! λI we can have linearly dependent features (e.g., D > N) the solution will be unique! 6 . 3

  5. Example: Example: polynomial bases polynomial bases polynomial bases ( x ) = x k ϕ k Without regularization: using D=10 we can perfectly fit the data (high test error) 6 . 4

  6. Example: Example: polynomial bases polynomial bases polynomial bases ( x ) = x k ϕ k Without regularization: using D=10 we can perfectly fit the data (high test error) degree 2 (D=3) 6 . 4

  7. Example: Example: polynomial bases polynomial bases polynomial bases ( x ) = x k ϕ k Without regularization: using D=10 we can perfectly fit the data (high test error) degree 2 (D=3) degree 4 (D=5) 6 . 4

  8. Example: Example: polynomial bases polynomial bases polynomial bases ( x ) = x k ϕ k Without regularization: using D=10 we can perfectly fit the data (high test error) degree 2 (D=3) degree 4 (D=5) degree 9 (D=10) 6 . 4

  9. Example: Example: polynomial bases polynomial bases polynomial bases ( x ) = x k ϕ k with regularization: fixed D=10, changing the amount of regularization 6 . 5

  10. Example: Example: polynomial bases polynomial bases polynomial bases ( x ) = x k ϕ k with regularization: fixed D=10, changing the amount of regularization λ = 0 6 . 5

  11. Example: Example: polynomial bases polynomial bases polynomial bases ( x ) = x k ϕ k with regularization: fixed D=10, changing the amount of regularization λ = .1 λ = 0 6 . 5

  12. Example: Example: polynomial bases polynomial bases polynomial bases ( x ) = x k ϕ k with regularization: fixed D=10, changing the amount of regularization λ = .1 λ = 10 λ = 0 6 . 5

  13. Data normalization Data normalization ~ ( n ) what if we scale the input features, using different factors ( n ) = ∀ d , n x γ x d 6 . 6

  14. Data normalization Data normalization ~ ( n ) what if we scale the input features, using different factors ( n ) = ∀ d , n x γ x d ~ 1 = ∀ d if we have no regularization: w w d d γ d ~ w ~ 2 2 ∣∣ Xw − y ∣∣ = ∣∣ − y ∣∣ everything remains the same because: X 2 2 6 . 6

  15. Data normalization Data normalization ~ ( n ) what if we scale the input features, using different factors ( n ) = ∀ d , n x γ x d ~ 1 = ∀ d if we have no regularization: w w d d γ d ~ w ~ 2 2 ∣∣ Xw − y ∣∣ = ∣∣ − y ∣∣ everything remains the same because: X 2 2 ~ 2 with regularization: ∣∣ ∣∣ 2  ∣∣ w ∣∣ = so the optimal w will be different! w 2 6 . 6

  16. Data normalization Data normalization ~ ( n ) what if we scale the input features, using different factors ( n ) = ∀ d , n x γ x d ~ 1 = ∀ d if we have no regularization: w w d d γ d ~ w ~ 2 2 ∣∣ Xw − y ∣∣ = ∣∣ − y ∣∣ everything remains the same because: X 2 2 ~ 2 with regularization: ∣∣ ∣∣ 2  ∣∣ w ∣∣ = so the optimal w will be different! w 2 features of different mean and variance will be penalized differently { ( n ) 1 = μ x normalization d N d ( n ) 1 2 d 2 = ( x − ) σ μ N −1 d d ( n ) ( n ) − μ x ← makes sure all features have the same mean and variance x d d d σ d 6 . 6 Winter 2020 | Applied Machine Learning (COMP551)

  17. Maximum likelihood Maximum likelihood previously: linear regression & logistic regression maximize log-likelihood 7 . 1

  18. Maximum likelihood Maximum likelihood previously: linear regression & logistic regression maximize log-likelihood linear regression ∗ w = arg max p ( y ∣ w ) N 2 = arg max N ( y ; Φ w , σ ) w ∏ n =1 ( n ) ⊤ ( n ) ≡ arg min ( y , w ϕ ( x )) ∑ n L 2 7 . 1

  19. Maximum likelihood Maximum likelihood previously: linear regression & logistic regression maximize log-likelihood linear regression logistic regression ∗ ∗ w = arg max p ( y ∣ w ) w = arg max p ( y ∣ x , w ) N = arg max Bernoulli ( y ; σ (Φ w )) w ∏ n =1 N 2 = arg max N ( y ; Φ w , σ ) w ∏ n =1 ( n ) ⊤ ≡ arg min ( y , σ ( w ϕ ( x ))) n ∑ n L ( n ) ⊤ ( n ) ≡ arg min ( y , w ϕ ( x )) ∑ n L CE 2 7 . 1

  20. Maximum likelihood Maximum likelihood previously: linear regression & logistic regression maximize log-likelihood linear regression logistic regression ∗ ∗ w = arg max p ( y ∣ w ) w = arg max p ( y ∣ x , w ) N = arg max Bernoulli ( y ; σ (Φ w )) w ∏ n =1 N 2 = arg max N ( y ; Φ w , σ ) w ∏ n =1 ( n ) ⊤ ≡ arg min ( y , σ ( w ϕ ( x ))) n ∑ n L ( n ) ⊤ ( n ) ≡ arg min ( y , w ϕ ( x )) ∑ n L CE 2 idea: maximize the posterior instead of likelihood p ( w ) p ( y ∣ w ) p ( w ∣ y ) = p ( y ) 7 . 1

  21. Maximum a Posteriori ( Maximum a Posteriori (MAP MAP) use the Bayes rule and find the parameters with max posterior prob. p ( w ) p ( y ∣ w ) p ( w ∣ y ) = p ( y ) the same for all choices of w (ignore) 7 . 2

  22. Maximum a Posteriori ( Maximum a Posteriori (MAP MAP) use the Bayes rule and find the parameters with max posterior prob. p ( w ) p ( y ∣ w ) p ( w ∣ y ) = p ( y ) the same for all choices of w (ignore) MAP estimate ∗ w = arg max p ( w ) p ( y ∣ w ) w 7 . 2

  23. Maximum a Posteriori (MAP Maximum a Posteriori ( MAP) use the Bayes rule and find the parameters with max posterior prob. p ( w ) p ( y ∣ w ) p ( w ∣ y ) = p ( y ) the same for all choices of w (ignore) MAP estimate ∗ w = arg max p ( w ) p ( y ∣ w ) w ≡ arg max log p ( y ∣ w ) + log p ( w ) w 7 . 2

  24. Maximum a Posteriori (MAP Maximum a Posteriori ( MAP) use the Bayes rule and find the parameters with max posterior prob. p ( w ) p ( y ∣ w ) p ( w ∣ y ) = p ( y ) the same for all choices of w (ignore) MAP estimate ∗ w = arg max p ( w ) p ( y ∣ w ) w ≡ arg max log p ( y ∣ w ) + log p ( w ) w likelihood: original objective 7 . 2

  25. Maximum a Posteriori ( Maximum a Posteriori (MAP MAP) use the Bayes rule and find the parameters with max posterior prob. p ( w ) p ( y ∣ w ) p ( w ∣ y ) = p ( y ) the same for all choices of w (ignore) MAP estimate ∗ w = arg max p ( w ) p ( y ∣ w ) w ≡ arg max log p ( y ∣ w ) + log p ( w ) w prior likelihood: original objective 7 . 2

  26. Maximum a Posteriori ( Maximum a Posteriori (MAP MAP) use the Bayes rule and find the parameters with max posterior prob. p ( w ) p ( y ∣ w ) p ( w ∣ y ) = p ( y ) the same for all choices of w (ignore) MAP estimate ∗ w = arg max p ( w ) p ( y ∣ w ) w ≡ arg max log p ( y ∣ w ) + log p ( w ) w prior likelihood: original objective even better would be to estimate the posterior distribution p ( w ∣ y ) more on this later in the course! 7 . 2

  27. Gaussian prior Gaussian prior Gaussian likelihood and Gaussian prior ∗ w = arg max p ( w ) p ( y ∣ w ) w 7 . 3

  28. Gaussian prior Gaussian prior Gaussian likelihood and Gaussian prior ∗ w = arg max p ( w ) p ( y ∣ w ) ≡ arg max log p ( y ∣ w ) + log p ( w ) w w 7 . 3

  29. Gaussian prior Gaussian prior Gaussian likelihood and Gaussian prior ∗ w = arg max p ( w ) p ( y ∣ w ) ≡ arg max log p ( y ∣ w ) + log p ( w ) w w ⊤ 2 D 2 ≡ arg max log N ( y ∣ w x , σ ) + log N ( w , 0, τ ) ∑ d =1 w d 7 . 3

  30. Gaussian prior Gaussian prior Gaussian likelihood and Gaussian prior ∗ w = arg max p ( w ) p ( y ∣ w ) ≡ arg max log p ( y ∣ w ) + log p ( w ) w w ⊤ 2 D 2 assuming independent Gaussian ≡ arg max log N ( y ∣ w x , σ ) + log N ( w , 0, τ ) ∑ d =1 w d (one per each weight) 7 . 3

  31. Gaussian prior Gaussian prior Gaussian likelihood and Gaussian prior ∗ w = arg max p ( w ) p ( y ∣ w ) ≡ arg max log p ( y ∣ w ) + log p ( w ) w w ⊤ 2 D 2 assuming independent Gaussian ≡ arg max log N ( y ∣ w x , σ ) + log N ( w , 0, τ ) ∑ d =1 w d (one per each weight) D −1 ⊤ 2 1 2 ≡ arg max ( y − w x ) − ∑ d =1 w w 2 σ 2 2 τ 2 d 7 . 3

  32. Gaussian prior Gaussian prior Gaussian likelihood and Gaussian prior ∗ w = arg max p ( w ) p ( y ∣ w ) ≡ arg max log p ( y ∣ w ) + log p ( w ) w w ⊤ 2 D 2 assuming independent Gaussian ≡ arg max log N ( y ∣ w x , σ ) + log N ( w , 0, τ ) ∑ d =1 w d (one per each weight) D σ 2 −1 ⊤ 2 1 2 1 D ≡ arg max ( y − w x ) − ⊤ 2 2 ∑ d =1 ≡ arg min ( y − w x ) + ∑ d =1 w w w 2 σ 2 w 2 2 τ 2 d 2 τ 2 d 7 . 3

  33. Gaussian prior Gaussian prior Gaussian likelihood and Gaussian prior ∗ w = arg max p ( w ) p ( y ∣ w ) ≡ arg max log p ( y ∣ w ) + log p ( w ) w w ⊤ 2 D 2 assuming independent Gaussian ≡ arg max log N ( y ∣ w x , σ ) + log N ( w , 0, τ ) ∑ d =1 w d (one per each weight) D σ 2 −1 ⊤ 2 1 2 1 D ≡ arg max ( y − w x ) − ⊤ 2 2 ∑ d =1 ≡ arg min ( y − w x ) + ∑ d =1 w w w 2 σ 2 w 2 2 τ 2 d 2 τ 2 d multiple data-points 1 ∑ n D ( n ) ⊤ ( n ) 2 2 ≡ arg min ( y − ) + λ ∑ d =1 w x w w 2 2 d 7 . 3

  34. Gaussian prior Gaussian prior Gaussian likelihood and Gaussian prior ∗ w = arg max p ( w ) p ( y ∣ w ) ≡ arg max log p ( y ∣ w ) + log p ( w ) w w ⊤ 2 D 2 assuming independent Gaussian ≡ arg max log N ( y ∣ w x , σ ) + log N ( w , 0, τ ) ∑ d =1 w d (one per each weight) D σ 2 −1 ⊤ 2 1 2 1 D ≡ arg max ( y − w x ) − ⊤ 2 2 ∑ d =1 ≡ arg min ( y − w x ) + ∑ d =1 w w w 2 σ 2 w 2 2 τ 2 d 2 τ 2 d σ 2 λ = multiple data-points τ 2 1 ∑ n D ( n ) ⊤ ( n ) 2 2 ≡ arg min ( y − ) + λ ∑ d =1 w x w w 2 L2 regularization 2 d 7 . 3

  35. Gaussian prior Gaussian prior Gaussian likelihood and Gaussian prior ∗ w = arg max p ( w ) p ( y ∣ w ) ≡ arg max log p ( y ∣ w ) + log p ( w ) w w ⊤ 2 D 2 assuming independent Gaussian ≡ arg max log N ( y ∣ w x , σ ) + log N ( w , 0, τ ) ∑ d =1 w d (one per each weight) D σ 2 −1 ⊤ 2 1 2 1 D ≡ arg max ( y − w x ) − ⊤ 2 2 ∑ d =1 ≡ arg min ( y − w x ) + ∑ d =1 w w w 2 σ 2 w 2 2 τ 2 d 2 τ 2 d σ 2 λ = multiple data-points τ 2 1 ∑ n D ( n ) ⊤ ( n ) 2 2 ≡ arg min ( y − ) + λ ∑ d =1 w x w w 2 L2 regularization 2 d L2- regularization is assuming a Gaussian prior on weights the same is true for logistic regression (or any other cost function) 7 . 3

  36. Laplace prior Laplace prior another notable choice of prior is the Laplace distribution image:https://stats.stackexchange.com/questions/177210/why-is-laplace-prior-producing-sparse-solutions 7 . 4

  37. Laplace prior Laplace prior another notable choice of prior is the Laplace distribution ∣ w ∣ 1 − p ( w ; β ) = e notice the peak around zero β 2 β w image:https://stats.stackexchange.com/questions/177210/why-is-laplace-prior-producing-sparse-solutions 7 . 4

  38. Laplace prior Laplace prior another notable choice of prior is the Laplace distribution 1 − log p ( w ) = ∣ w ∣ minimizing negative log-likelihood ∑ d ∑ d 2 β d d ∣ w ∣ 1 − p ( w ; β ) = e notice the peak around zero β 2 β w image:https://stats.stackexchange.com/questions/177210/why-is-laplace-prior-producing-sparse-solutions 7 . 4

  39. Laplace prior Laplace prior another notable choice of prior is the Laplace distribution 1 = ∣∣ w ∣∣ 1 − log p ( w ) = ∣ w ∣ minimizing negative log-likelihood ∑ d ∑ d 2 β 1 2 β d d L1 norm of w ∣ w ∣ 1 − p ( w ; β ) = e notice the peak around zero β 2 β w image:https://stats.stackexchange.com/questions/177210/why-is-laplace-prior-producing-sparse-solutions 7 . 4

  40. Laplace prior Laplace prior another notable choice of prior is the Laplace distribution 1 = ∣∣ w ∣∣ 1 − log p ( w ) = ∣ w ∣ minimizing negative log-likelihood ∑ d ∑ d 2 β 1 2 β d d L1 norm of w L1 regularization: J ( w ) ← J ( w ) + λ ∣∣ w ∣∣ also called lasso 1 (least absolute shrinkage and selection operator) ∣ w ∣ 1 − p ( w ; β ) = e notice the peak around zero β 2 β w image:https://stats.stackexchange.com/questions/177210/why-is-laplace-prior-producing-sparse-solutions 7 . 4

  41. regularization regularization vs L L 1 2 regularization path shows how change as we change λ { w } d Lasso Ridge regression w d ′ w d decreasing regularization coef. λ 7 . 5

  42. regularization regularization vs L L 1 2 regularization path shows how change as we change λ { w } d Lasso produces sparse weights (many are zero, rather than small) Lasso Ridge regression w d ′ w d decreasing regularization coef. λ 7 . 5

  43. regularization regularization vs L L 1 2 regularization path shows how change as we change λ { w } d Lasso produces sparse weights (many are zero, rather than small) red-line is the optimal from cross-validation λ Lasso Ridge regression w d ′ w d decreasing regularization coef. λ 7 . 5

  44. regularization regularization vs L L 1 2 min J ( w ) + λ ∣∣ w ∣∣ p ~ w p ~ subject to ∣∣ w ∣∣ for an appropriate choice of λ is equivalent to min J ( w ) p ≤ λ w p 7 . 6

  45. regularization regularization vs L L 1 2 min J ( w ) + λ ∣∣ w ∣∣ p ~ w p ~ subject to ∣∣ w ∣∣ for an appropriate choice of λ is equivalent to min J ( w ) p ≤ λ w p figures below show the constraint and the isocontours of J ( w ) J ( w ) any convex cost function J ( w ) w w 2 2 w w MLE MLE w MAP w MAP ~ w w 1 1 ∣∣ w ∣∣ ≤ ~ λ 1 2 ∣∣ w ∣∣ ≤ λ 2 7 . 6

  46. regularization regularization vs L L 1 2 min J ( w ) + λ ∣∣ w ∣∣ p ~ w p ~ subject to ∣∣ w ∣∣ for an appropriate choice of λ is equivalent to min J ( w ) p ≤ λ w p figures below show the constraint and the isocontours of J ( w ) optimal solution with L1-regularization is more likely to have zero components J ( w ) any convex cost function J ( w ) w w 2 2 w w MLE MLE w MAP w MAP ~ w w 1 1 ∣∣ w ∣∣ ≤ ~ λ 1 2 ∣∣ w ∣∣ ≤ λ 2 7 . 6

  47. Subset selection Subset selection 1 1 ∣ w ∣ ∣ w ∣ ∑ d ∑ d 4 2 ∣ w ∣ ∑ d ∑ d ∑ d 2 10 w w d d d d d 7 . 7

  48. Subset selection Subset selection p ≥ 1 p-norms with are convex (easier to optimize) 1 1 ∣ w ∣ ∣ w ∣ ∑ d ∑ d 4 2 ∣ w ∣ ∑ d ∑ d ∑ d 2 10 w w d d d d d 7 . 7

  49. Subset selection Subset selection p ≥ 1 p-norms with are convex (easier to optimize) p ≤ 1 p-norms with induces sparsity 1 1 ∣ w ∣ ∣ w ∣ ∑ d ∑ d 4 2 ∣ w ∣ ∑ d ∑ d ∑ d 2 10 w w d d d d d 7 . 7

  50. Subset selection Subset selection p ≥ 1 p-norms with are convex (easier to optimize) p ≤ 1 p-norms with induces sparsity 1 1 ∣ w ∣ ∣ w ∣ ∑ d ∑ d 4 2 ∣ w ∣ ∑ d ∑ d ∑ d 2 10 w w d d d d d norm closer to 0-norm L 0 7 . 7

  51. Subset selection Subset selection p ≥ 1 p-norms with are convex (easier to optimize) p ≤ 1 p-norms with induces sparsity 1 1 ∣ w ∣ ∣ w ∣ ∑ d ∑ d 4 2 ∣ w ∣ ∑ d ∑ d ∑ d 2 10 w w d d d d d norm closer to 0-norm L 0 penalizes the number of non-zero features I ( w J ( w ) + λ ∣∣ w ∣∣ = J ( w ) + λ d  0) = ∑ d 0 7 . 7

  52. Subset selection Subset selection p ≥ 1 p-norms with are convex (easier to optimize) p ≤ 1 p-norms with induces sparsity 1 1 ∣ w ∣ ∣ w ∣ ∑ d ∑ d 4 2 ∣ w ∣ ∑ d ∑ d ∑ d 2 10 w w d d d d d norm closer to 0-norm L 0 penalizes the number of non-zero features I ( w J ( w ) + λ ∣∣ w ∣∣ = J ( w ) + λ d  0) = ∑ d 0 a penalty of for each feature λ performs feature selection 7 . 7

  53. Subset selection Subset selection p ≥ 1 p-norms with are convex (easier to optimize) p ≤ 1 p-norms with induces sparsity 1 1 ∣ w ∣ ∣ w ∣ ∑ d ∑ d 4 2 ∣ w ∣ ∑ d ∑ d ∑ d 2 10 w w d d d d d norm closer to 0-norm L 0 optimizing this is a difficult combinatorial problem : 2 D search over all subsets 7 . 8

  54. Subset selection Subset selection p ≥ 1 p-norms with are convex (easier to optimize) p ≤ 1 p-norms with induces sparsity 1 1 ∣ w ∣ ∣ w ∣ ∑ d ∑ d 4 2 ∣ w ∣ ∑ d ∑ d ∑ d 2 10 w w d d d d d norm closer to 0-norm L 0 optimizing this is a difficult combinatorial problem : 2 D search over all subsets L1 regularization is a viable alternative to L0 regularization 7 . 8 Winter 2020 | Applied Machine Learning (COMP551)

  55. Bias-variance decomposition Bias-variance decomposition for L2 loss 8 . 1

  56. Bias-variance decomposition Bias-variance decomposition for L2 loss assume a true distribution p ( x , y ) 8 . 1

  57. Bias-variance decomposition Bias-variance decomposition for L2 loss assume a true distribution p ( x , y ) f ( x ) = E [ y ∣ x ] the regression function is p 8 . 1

  58. Bias-variance decomposition Bias-variance decomposition for L2 loss assume a true distribution p ( x , y ) f ( x ) = E [ y ∣ x ] the regression function is p p ( x , y ) assume that a dataset is sampled from ( n ) ( n ) D = {( x , y )} n 8 . 1

  59. Bias-variance decomposition Bias-variance decomposition for L2 loss assume a true distribution p ( x , y ) f ( x ) = E [ y ∣ x ] the regression function is p p ( x , y ) assume that a dataset is sampled from ( n ) ( n ) D = {( x , y )} n ^ let be our model based on the dataset f D 8 . 1

  60. Bias-variance decomposition Bias-variance decomposition for L2 loss assume a true distribution p ( x , y ) f ( x ) = E [ y ∣ x ] the regression function is p p ( x , y ) assume that a dataset is sampled from ( n ) ( n ) D = {( x , y )} n ^ let be our model based on the dataset f D what we care about is the expected loss ( aka risk ) ^ E [( 2 ( x ) − y ) ] f D all blue items are random variables 8 . 1

  61. Bias-variance decomposition Bias-variance decomposition for L2 loss what we care about is the expected loss ( aka risk ) ^ E [( 2 ( x ) − y ) ] f D 8 . 2

  62. Bias-variance decomposition Bias-variance decomposition for L2 loss what we care about is the expected loss ( aka risk ) ^ E [( 2 ( x ) − y ) ] f D f ( x ) + ϵ 8 . 2

  63. Bias-variance decomposition Bias-variance decomposition for L2 loss what we care about is the expected loss ( aka risk ) ^ E [( 2 ( x ) − y ) ] f D f ( x ) + ϵ ^ D ^ D ^ D E E ( x ) + [ ( x )] − [ ( x )] add and subtract a term f D f D f 8 . 2

  64. Bias-variance decomposition Bias-variance decomposition for L2 loss what we care about is the expected loss ( aka risk ) ^ ^ ^ ^ E [( 2 = E [( E y + E 2 ( x ) − y ) ] ( x ) − [ ( x )] − [ ( x )]) ] f f D f D f D D D D f ( x ) + ϵ ^ D ^ D ^ D E E ( x ) + [ ( x )] − [ ( x )] add and subtract a term f D f D f 8 . 2

  65. Bias-variance decomposition Bias-variance decomposition for L2 loss what we care about is the expected loss ( aka risk ) ^ ^ ^ ^ E [( 2 = E [( E y + E 2 ( x ) − y ) ] ( x ) − [ ( x )] − [ ( x )]) ] f f D f D f D D D D f ( x ) + ϵ ^ D ^ D ^ D E E ( x ) + [ ( x )] − [ ( x )] add and subtract a term f D f D f the remaining terms evaluate to zero (check for yourself!) 2 + E [( f ( x ) − E ^ ^ ^ = E [( E 2 + E [ ϵ ] 2 ( x ) − [ ( x )]) ] [ ( x )]) ] f D f D f D D D 8 . 2

Recommend


More recommend