evaluating hypotheses
play

Evaluating Hypotheses IEEE Expert, October 1996 1 Evaluating - PDF document

Evaluating Hypotheses IEEE Expert, October 1996 1 Evaluating Hypotheses Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal distribution, Central Limit Theorem


  1. Evaluating Hypotheses IEEE Expert, October 1996 1

  2. Evaluating Hypotheses • Sample error, true error • Confidence intervals for observed hypothesis error • Estimators • Binomial distribution, Normal distribution, Central Limit Theorem • Paired t tests • Comparing learning methods 2

  3. Evaluating Hypotheses and Learners Consider hypotheses H 1 and H 2 learned by learners L 1 and L 2 • How to learn H and estimate accuracy with limited data? • How well does observed accuracy of H over limited sample estimate accuracy over unseen data? • If H 1 outperforms H 2 on sample, will H 1 outperform H 2 in general? • Same conclusion for L 1 and L 2 ? 3

  4. Two Definitions of Error The true error of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random according to D . error D ( h ) ≡ Pr x ∈D [ f ( x ) � = h ( x )] The sample error of h with respect to target function f and data sample S is the proportion of examples h misclassifies error S ( h ) ≡ 1 x ∈ S δ ( f ( x ) � = h ( x )) � n Where δ ( f ( x ) � = h ( x )) is 1 if f ( x ) � = h ( x ), and 0 otherwise. How well does error S ( h ) estimate error D ( h )? 4

  5. Problems Estimating Error 1. Bias: If S is training set, error S ( h ) is optimistically biased bias ≡ E [ error S ( h )] − error D ( h ) For unbiased estimate, h and S must be chosen independently 2. Variance: Even with unbiased S , error S ( h ) may still vary from error D ( h ) 5

  6. Example Hypothesis h misclassifies 12 of the 40 examples in S error S ( h ) = 12 40 = . 30 What is error D ( h )? 6

  7. Estimators Experiment: 1. choose sample S of size n according to distribution D 2. measure error S ( h ) error S ( h ) is a random variable (i.e., result of an experiment) error S ( h ) is an unbiased estimator for error D ( h ) Given observed error S ( h ) what can we conclude about error D ( h )? 7

  8. Confidence Intervals If • S contains n examples, drawn independently of h and each other • n ≥ 30 Then • With approximately 95% probability, error D ( h ) lies in interval � � error S ( h )(1 − error S ( h )) � � error S ( h ) ± 1 . 96 � � n 8

  9. Confidence Intervals If • S contains n examples, drawn independently of h and each other • n ≥ 30 Then • With approximately N% probability, error D ( h ) lies in interval � � error S ( h )(1 − error S ( h )) � � error S ( h ) ± z N � � n where N %: 50% 68% 80% 90% 95% 98% 99% z N : 0.67 1.00 1.28 1.64 1.96 2.33 2.58 9

  10. error S ( h ) is a Random Variable Rerun the experiment with different randomly drawn S (of size n ) Probability of observing r misclassified examples: Binomial distribution for n = 40, p = 0.3 0.14 0.12 0.1 0.08 P(r) 0.06 0.04 0.02 0 0 5 10 15 20 25 30 35 40 n ! r !( n − r )! error D ( h ) r (1 − error D ( h )) n − r P ( r ) = 10

  11. Binomial Probability Distribution Binomial distribution for n = 40, p = 0.3 0.14 0.12 0.1 0.08 P(r) 0.06 0.04 0.02 0 0 5 10 15 20 25 30 35 40 n ! r !( n − r )! p r (1 − p ) n − r P ( r ) = Probability P ( r ) of r heads in n coin flips, if p = Pr( heads ) • Expected, or mean value of X , E [ X ], is n E [ X ] ≡ i =0 iP ( i ) = np � • Variance of X is V ar ( X ) ≡ E [( X − E [ X ]) 2 ] = np (1 − p ) • Standard deviation of X , σ X , is � � E [( X − E [ X ]) 2 ] = σ X ≡ np (1 − p ) 11

  12. Normal Distribution Approximates Binomial error S ( h ) follows a Binomial distribution, with • mean µ error S ( h ) = error D ( h ) • standard deviation σ error S ( h ) � � error D ( h )(1 − error D ( h )) � � σ error S ( h ) = � � n Approximate this by a Normal distribution with • mean µ error S ( h ) = error D ( h ) • standard deviation σ error S ( h ) � � error S ( h )(1 − error S ( h )) � � σ error S ( h ) ≈ � � n 12

  13. Normal Probability Distribution Normal distribution with mean 0, standard deviation 1 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -3 -2 -1 0 1 2 3 1 2 πσ 2 e − 1 2 ( x − µ σ ) 2 √ p ( x ) = The probability that X will fall into the interval ( a, b ) is given by � b a p ( x ) dx • Expected, or mean value of X , E [ X ], is E [ X ] = µ • Variance of X is V ar ( X ) = σ 2 • Standard deviation of X , σ X , is σ X = σ 13

  14. Normal Probability Distribution 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -3 -2 -1 0 1 2 3 80% of area (probability) lies in µ ± 1 . 28 σ N% of area (probability) lies in µ ± z N σ N %: 50% 68% 80% 90% 95% 98% 99% z N : 0.67 1.00 1.28 1.64 1.96 2.33 2.58 14

  15. Confidence Intervals, More Correctly If • S contains n examples, drawn independently of h and each other • n ≥ 30 Then • With approximately 95% probability, error S ( h ) lies in interval � � error D ( h )(1 − error D ( h )) � � error D ( h ) ± 1 . 96 � � n equivalently, error D ( h ) lies in interval � � error D ( h )(1 − error D ( h )) � � error S ( h ) ± 1 . 96 � � n which is approximately � � error S ( h )(1 − error S ( h )) � � error S ( h ) ± 1 . 96 � � n 15

  16. Two-Sided and One-Sided Bounds 0.4 0.4 0.35 0.35 0.3 0.3 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 -3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3 • If µ − z N σ ≤ y ≤ µ + z N σ with confidence N = 100(1 − α )% • Then −∞ ≤ y ≤ µ + z N σ with confidence N = 100(1 − α/ 2)% and µ − z N σ ≤ y ≤ + ∞ with confidence N = 100(1 − α/ 2)% • Example: n = 40, r = 12 – Two-sided, 95% confidence ( α = 0 . 05) P (0 . 16 ≤ y ≤ 0 . 44) = 0 . 95 – One-sided P ( y ≤ 0 . 44) = P ( y ≥ 0 . 16) = (1 − α/ 2) = 0 . 975 16

  17. Calculating Confidence Intervals 1. Pick parameter p to estimate • error D ( h ) 2. Choose an estimator • error S ( h ) 3. Determine probability distribution that governs estimator • error S ( h ) governed by Binomial distribution, approximated by Normal when n ≥ 30 4. Find interval ( L, U ) such that N% of probability mass falls in the interval • Use table of z N values 17

  18. Central Limit Theorem Consider a set of independent, identically distributed random variables Y 1 . . . Y n , all governed by an arbitrary probability distribution with mean µ and finite variance σ 2 . Define the sample mean, Y ≡ 1 n ¯ i =1 Y i � n Central Limit Theorem. As n → ∞ , the distribution governing ¯ Y approaches a Normal distribution, with mean µ and variance σ 2 n . 18

  19. Difference Between Hypotheses Test h 1 on sample S 1 , test h 2 on S 2 1. Pick parameter to estimate d ≡ error D ( h 1 ) − error D ( h 2 ) 2. Choose an estimator ˆ d ≡ error S 1 ( h 1 ) − error S 2 ( h 2 ) 3. Determine probability distribution that governs estimator � error S 1 ( h 1 )(1 − error S 1 ( h 1 )) + error S 2 ( h 2 )(1 − error S 2 ( h 2 )) d ≈ σ ˆ n 1 n 2 Find interval ( L, U ) such that N% of probability mass falls in the interval � error S 1 ( h 1 )(1 − error S 1 ( h 1 )) + error S 2 ( h 2 )(1 − error S 2 ( h 2 )) ˆ d ± z N n 1 n 2 19

  20. Hypothesis Testing P ( error D ( h 1 ) > error D ( h 2 )) =? • Example ◦ | S 1 | = | S 2 | = 100 ◦ error S 1 ( h 1 ) = 0 . 30 ◦ error S 2 ( h 2 ) = 0 . 20 ◦ ˆ d = 0 . 10 ◦ σ ˆ d = 0 . 061 • P ( ˆ d + 0 . 10) = probability ˆ d < µ ˆ d does not overestimate d by more than 0.10 ◦ z N · σ ˆ d = 0 . 10 ◦ z N = 1 . 64 • P ( ˆ d < µ ˆ d + 1 . 64 σ ˆ d ) = 0 . 95 • I.e., reject null hypothesis with 0.05 level of significance 20

  21. Paired t test to compare h A , h B 1. Partition data into k disjoint test sets T 1 , T 2 , . . . , T k of equal size, where this size is at least 30. 2. For i from 1 to k , do δ i ← error T i ( h A ) − error T i ( h B ) 3. Return the value ¯ δ , where δ ≡ 1 k ¯ i =1 δ i � k N % confidence interval estimate for δ : ¯ δ ± t N,k − 1 s ¯ δ � 1 � k � i =1 ( δ i − ¯ δ ) 2 s ¯ δ ≡ � � � � k ( k − 1) � Note δ i approximately Normally distributed 21

  22. Comparing learning algorithms L A and L B What we’d like to estimate: E S ⊂D [ error D ( L A ( S )) − error D ( L B ( S ))] where L ( S ) is the hypothesis output by learner L using training set S i.e., the expected difference in true error between hypotheses output by learners L A and L B , when trained using randomly selected training sets S drawn according to distribution D . But, given limited data D 0 , what is a good estimator? • could partition D 0 into training set S 0 and testing set T 0 , and measure error T 0 ( L A ( S 0 )) − error T 0 ( L B ( S 0 )) • even better, repeat this many times and average the results (next slide) 22

  23. Comparing learning algorithms L A and L B 1. Partition data D 0 into k disjoint test sets T 1 , T 2 , . . . , T k of equal size, where this size is at least 30. 2. For i from 1 to k , do use T i for the test set, and the remaining data for training set S i • S i ← { D 0 − T i } • h A ← L A ( S i ) • h B ← L B ( S i ) • δ i ← error T i ( h A ) − error T i ( h B ) 3. Return the value ¯ δ , where δ ≡ 1 k ¯ i =1 δ i � k 23

Recommend


More recommend