bias variance theory variance theory bias
play

Bias- -Variance Theory Variance Theory Bias Decompose Error Rate - PowerPoint PPT Presentation

Bias- -Variance Theory Variance Theory Bias Decompose Error Rate into components, some Decompose Error Rate into components, some of which can be measured on unlabeled data of which can be measured on unlabeled data Bias- -Variance


  1. Bias- -Variance Theory Variance Theory Bias Decompose Error Rate into components, some Decompose Error Rate into components, some of which can be measured on unlabeled data of which can be measured on unlabeled data Bias- -Variance Variance Decomposition for Regression Decomposition for Regression Bias Bias- -Variance Decomposition for Classification Variance Decomposition for Classification Bias Bias- -Variance Analysis of Learning Algorithms Variance Analysis of Learning Algorithms Bias Effect of Bagging on Bias and Variance Effect of Bagging on Bias and Variance Effect of Boosting on Bias and Variance Effect of Boosting on Bias and Variance Summary and Conclusion Summary and Conclusion

  2. Bias- -Variance Analysis in Variance Analysis in Bias Regression Regression ε True function is y = f(x) + ε True function is y = f(x) + ε is normally distributed with zero mean where ε – where is normally distributed with zero mean – σ . and standard deviation σ . and standard deviation Given a set of training examples, {(x i , y i )}, Given a set of training examples, {(x i , y i )}, · x + b to we fit an hypothesis h(x) = w · x + b to we fit an hypothesis h(x) = w the data to minimize the squared error the data to minimize the squared error Σ i Σ )] 2 2 [y i – h(x h(x i i [y i – i )]

  3. Example: 20 points Example: 20 points y = x + 2 sin(1.5x) + N(0,0.2) y = x + 2 sin(1.5x) + N(0,0.2)

  4. 50 fits (20 examples each) 50 fits (20 examples each)

  5. Bias- -Variance Analysis Variance Analysis Bias Now, given a new data point x* (with Now, given a new data point x* (with ε ), we would observed value y* = f(x*) + ε ), we would observed value y* = f(x*) + like to understand the expected prediction like to understand the expected prediction error error 2 ] h(x*)) 2 E[ (y* – – h(x*)) ] E[ (y*

  6. Classical Statistical Analysis Classical Statistical Analysis Imagine that our particular training sample Imagine that our particular training sample S is drawn from some population of S is drawn from some population of possible training samples according to possible training samples according to P(S). P(S). 2 ] h(x*)) 2 Compute E P [ (y* – – h(x*)) ] Compute E P [ (y* Decompose this into “ “bias bias” ”, , “ “variance variance” ”, , Decompose this into and “ “noise noise” ” and

  7. Lemma Lemma Let Z be a random variable with probability Let Z be a random variable with probability distribution P(Z) distribution P(Z) Let Z Z = E = E P [ Z ] be the average value of Z. Let P [ Z ] be the average value of Z. 2 ] = E[Z 2 ] 2 Lemma: E[ (Z – – Z Z) ) 2 ] = E[Z 2 ] – – Z Z 2 Lemma: E[ (Z 2 ] = E[ Z 2 – 2 ] E[ (Z – – Z Z) ) 2 ] = E[ Z 2 – 2 Z 2 Z Z Z + + Z Z 2 ] E[ (Z 2 ] 2 = E[Z 2 ] – – 2 E[Z] 2 E[Z] Z Z + + Z Z 2 = E[Z 2 + 2 ] 2 = E[Z 2 ] – – 2 2 Z Z 2 + Z Z 2 = E[Z 2 ] 2 = E[Z 2 ] – – Z Z 2 = E[Z 2 ] + 2 ] = E[ (Z 2 Corollary: E[Z 2 ] = E[ (Z – – Z Z) ) 2 ] + Z Z 2 Corollary: E[Z

  8. Bias- -Variance Variance- -Noise Noise Bias Decomposition Decomposition 2 ] = E[ h(x*) 2 – 2 ] y*) 2 ] = E[ h(x*) 2 2 h(x*) y* + y* 2 E[ (h(x*) – – y*) – 2 h(x*) y* + y* ] E[ (h(x*) = E[ h(x*) 2 2 ] 2 E[ h(x*) ] E[y*] + E[y* 2 2 ] ] – – 2 E[ h(x*) ] E[y*] + E[y* ] = E[ h(x*) 2 ] + ) 2 ) 2 2 = E[ (h(x*) – – h(x*) h(x*)) ] + h(x* h(x*) = E[ (h(x*) (lemma) (lemma) – 2 2 h(x*) h(x*) f(x*) f(x*) – 2 ] + f(x*) f(x*)) 2 ] + f(x*) 2 2 + E[ (y* – – f(x*)) + E[ (y* (lemma) (lemma) 2 ] + [variance] ) 2 = E[ (h(x*) – – h(x*) h(x*)) ] + [variance] = E[ (h(x*) 2 + [bias f(x*)) 2 + [bias 2 2 ] (h(x*) h(x*) – – f(x*)) ] ( 2 ] [noise] f(x*)) 2 E[ (y* – – f(x*)) ] [noise] E[ (y*

  9. Derivation (continued) Derivation (continued) 2 ] = y*) 2 E[ (h(x*) – – y*) ] = E[ (h(x*) 2 ] + = E[ (h(x*) – – h(x*) h(x*)) ) 2 ] + = E[ (h(x*) 2 + f(x*)) 2 (h(x*) h(x*) – – f(x*)) + ( 2 ] E[ (y* – – f(x*)) f(x*)) 2 ] E[ (y* 2 + E[ 2 ] ε 2 + E[ ε = Var(h(x*)) + Bias(h(x*)) 2 ] = Var(h(x*)) + Bias(h(x*)) 2 + σ 2 + σ 2 = Var(h(x*)) + Bias(h(x*)) 2 = Var(h(x*)) + Bias(h(x*)) 2 + Noise 2 Expected prediction error = Variance + Bias 2 + Noise 2 Expected prediction error = Variance + Bias

  10. Bias, Variance, and Noise Bias, Variance, and Noise 2 ] Variance: E[ (h(x*) Variance: E[ (h(x*) – – h(x*) h(x*)) ) 2 ] Describes how much h(x*) varies from one Describes how much h(x*) varies from one training set S to another training set S to another Bias: [ Bias: [h(x*) h(x*) – – f(x*)] f(x*)] Describes the average average error of h(x*). error of h(x*). Describes the 2 ] = E[ ε 2 σ 2 ] = E[ ε ] = σ Noise: E[ (y* Noise: f(x*)) 2 2 ] = 2 E[ (y* – – f(x*)) Describes how much y* varies from f(x*) Describes how much y* varies from f(x*)

  11. 50 fits (20 examples each) 50 fits (20 examples each)

  12. Bias Bias

  13. Variance Variance

  14. Noise Noise

  15. 50 fits (20 examples each) 50 fits (20 examples each)

  16. Distribution of predictions at Distribution of predictions at x=2.0 x=2.0

  17. 50 fits (20 examples each) 50 fits (20 examples each)

  18. Distribution of predictions at Distribution of predictions at x=5.0 x=5.0

  19. Measuring Bias and Variance Measuring Bias and Variance In practice (unlike in theory), we have only In practice (unlike in theory), we have only ONE training set S. ONE training set S. We can simulate multiple training sets by We can simulate multiple training sets by bootstrap replicates bootstrap replicates – S S’ ’ = {x | x is drawn at random with = {x | x is drawn at random with – replacement from S} and |S’ ’| = |S|. | = |S|. replacement from S} and |S

  20. Procedure for Measuring Bias Procedure for Measuring Bias and Variance and Variance Construct B bootstrap replicates of S (e.g., Construct B bootstrap replicates of S (e.g., B = 200): S 1 , … …, S , S B B = 200): S 1 , B Apply learning algorithm to each replicate Apply learning algorithm to each replicate S b to obtain hypothesis h b S b to obtain hypothesis h b Let T b = S \ \ S S b be the data points that do Let T b = S b be the data points that do not appear in S b (out of bag out of bag points) points) not appear in S b ( Compute predicted value h b (x) for each x Compute predicted value h b (x) for each x in T b in T b

  21. Estimating Bias and Variance Estimating Bias and Variance (continued) (continued) For each data point x, we will now have For each data point x, we will now have the observed corresponding value y and the observed corresponding value y and several predictions y 1 , … …, y , y K . several predictions y 1 , K . Compute the average prediction h h. . Compute the average prediction Estimate bias as (h h – – y) y) Estimate bias as ( Σ k Estimate variance as Σ 2 /(K ) 2 (y k – h h) /(K – – 1) 1) Estimate variance as k (y k – Assume noise is 0 Assume noise is 0

  22. Approximations in this Approximations in this Procedure Procedure Bootstrap replicates are not real data Bootstrap replicates are not real data We ignore the noise We ignore the noise – If we have multiple data points with the same If we have multiple data points with the same – x value, then we can estimate the noise x value, then we can estimate the noise – We can also estimate noise by pooling y We can also estimate noise by pooling y – values from nearby x values values from nearby x values

  23. Ensemble Learning Methods Ensemble Learning Methods Given training sample S Given training sample S Generate multiple hypotheses, h 1 , h 2 , … …, , Generate multiple hypotheses, h 1 , h 2 , h L . h L . Optionally: determining corresponding Optionally: determining corresponding weights w 1 , w 2 , … …, , w w L weights w 1 , w 2 , L Classify new points according to Classify new points according to ∑ l θ ∑ > θ w l h l l w l h l >

  24. Bagging: Bootstrap Aggregating Bagging: Bootstrap Aggregating For b = 1, … …, B do , B do For b = 1, – S S b = bootstrap replicate of S – b = bootstrap replicate of S – Apply learning algorithm to S Apply learning algorithm to S b to learn h b – b to learn h b Classify new points by unweighted unweighted vote: vote: Classify new points by ∑ b [ ∑ – [ h b (x)]/B > 0 )]/B > 0 – b h b (x

  25. Bagging Bagging Bagging makes predictions according to Bagging makes predictions according to Σ b y = Σ h b (x) / B y = b h b (x) / B Hence, bagging’ ’s predictions are s predictions are h h(x) (x) Hence, bagging

Recommend


More recommend