What youll learn today The difference between sample error and true - PowerPoint PPT Presentation

What you’ll learn today • The difference between sample error and true error • Confidence intervals for sample error • How to estimate confidence intervals • Binomial distribution, Normal distribution, Central Limit Theorem • Paired t tests and cross-validation • Comparing learning methods Slides largely pilfered from Tom 1

A practical problem Suppose you’ve trained a classifier h for your favorite problem (YFP), tested it on a sample S , and the error rate on the sample was 0.30. • How good is that estimate? • Should you throw away your old classifier for YFP, which has an error rate of 0.35 on sample S , and replace it with h ? • Can you write a paper saying that you’ve reduced the best-known error rate for YFP from 0.35 to 0.30? Will it get accepted? 2

Two Definitions of Error The true error of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random according to D . error D ( h ) ≡ Pr x ∈D [ f ( x ) � = h ( x )] The sample error of h with respect to target function f and data sample S is the proportion of examples h misclassifies error S ( h ) ≡ 1 � δ ( f ( x ) � = h ( x )) n x ∈ S Where δ ( f ( x ) � = h ( x )) is 1 if f ( x ) � = h ( x ), and 0 otherwise. Usually, you don’t know error D ( h ). The big question is: how well does error S ( h ) estimate error D ( h )? 3

Problems Estimating Error 1. Bias: If S is the training set, error S ( h ) is (almost always) optimistically biased bias ≡ E [ error S ( h )] − error D ( h ) This is also true if any part of the training procedure used any part of S , e.g. for feature engineering, feature selection, parameter tuning, . . . For an unbiased estimate, h and S must be chosen independently 2. Variance: Even with unbiased S , error S ( h ) may still vary from error D ( h ) Variance of X is V ar ( X ) ≡ E [( X − E [ X ]) 2 ] 4

Example Hypothesis h misclassifies 12 of the 40 examples in S error S ( h ) = 12 40 = . 30 What is error D ( h )? 5

Example Hypothesis h misclassifies 12 of the 40 examples in S error S ( h ) = 12 40 = . 30 What is error D ( h )? Some things we know: • If θ = error D ( h ), the sample error is a binomial with parameters θ and | S | (i.e., it’s like flipping a coin with bias θ exactly | S | times.) • Given r errors in n observations ˆ θ = r n is the MLE for θ = error D ( h ) 6

The Binomial Distribution Probability P ( R = r ) of observing r misclassified examples n ! r !( n − r )! error D ( h ) r (1 − error D ( h )) n − r P ( r ) = Question: what’s the random event here? what’s the experiment? 7

Aside: Credibility Intervals From n ! r !( n − r )! θ r (1 − θ ) n − r P ( R = r | Θ = θ ) = we could try and compute P (Θ = θ | R = r ) = 1 Z P ( R = r | Θ = θ ) P (Θ = θ ) to get a MAP for θ , or an interval [ θ L , θ U ] that probably contains θ (probability taken over choices of Θ) 8

The Binomial Distribution Probability P ( R = r ) of observing r misclassified examples Usual interpretation: • h and error D ( h ) are fixed quantities (not random) • S is a random variable—i.e. the experiment is drawing the sample • R = error S ( h ) · | S | is a random variable depending on S 9

The Binomial Distribution Probability P ( R = r ) of observing r misclassified examples Suppose | S | = 40 and error S ( h ) = 12 40 = . 30. How much would you bet that error D ( h ) < 0 . 35 ? Hint: the graph shows that P ( R = 14) > 0 . 1 and 14 40 = 0 . 35. So it would not be that surprising to see a sample error error S ( h ) = . 35 given a true error of error D ( h ) < 0 . 30. 10

Confidence Intervals for Estimators Experiment: 1. choose sample S of size n according to distribution D 2. measure error S ( h ) error S ( h ) is a random variable (i.e., result of an experiment) error S ( h ) is an unbiased estimator for error D ( h ) Given observed error S ( h ) what can we conclude about error D ( h )? It’s probably not true that error D ( h ) = error S ( h ) but it probably is true that error D ( h ) is “close to” error S ( h ). 11

Confidence Intervals: Recipe 1 If • S contains n examples, drawn independently of h and each other • n ≥ 30 Then • With approximately 95% probability, error D ( h ) lies in interval � error S ( h )(1 − error S ( h )) error S ( h ) ± 1 . 96 n Another rule-of-thumb: if the interval above is within [0 , 1] then it’s reasonable to use this approximation. 12

Confidence Intervals: Recipe 2 If • S contains n examples, drawn independently of h and each other • n ≥ 30 Then • With approximately N% probability, error D ( h ) lies in interval � error S ( h )(1 − error S ( h )) error S ( h ) ± z N n where N %: 50% 68% 80% 90% 95% 98% 99% z N : 0.67 1.00 1.28 1.64 1.96 2.33 2.58 Why does this work? 13

Facts about the Binomial Distribution Probability P ( r ) of r heads in n coin flips, if p = Pr( heads ) • Expected, or mean value of X , E [ X ], is E [ X ] ≡ � n i =0 iP ( i ) = np • Variance of X is V ar ( X ) ≡ E [( X − E [ X ]) 2 ] = np (1 − p ) � � E [( X − E [ X ]) 2 ] = • Standard deviation of X , σ X , is σ X ≡ np (1 − p ) n ! r !( n − r )! p r (1 − p ) n − r P ( r ) = 14

Another Fact: the Normal Approximates the Binomial error S ( h ) follows a Binomial distribution, with • mean µ error S ( h ) = error D ( h ) � error D ( h )(1 − error D ( h )) • standard deviation σ error S ( h ) σ error S ( h ) = n For large enough n , the binomial approximates a Normal distribution with • mean µ error S ( h ) = error D ( h ) � error S ( h )(1 − error S ( h )) • standard deviation σ error S ( h ) σ error S ( h ) ≈ n 15

Central Limit Theorem Consider a set of independent, identically distributed random variables Y 1 . . . Y n , all governed by an arbitrary probability distribution with mean µ and finite variance σ 2 . Define the sample mean, n Y ≡ 1 ¯ � Y i n i =1 Central Limit Theorem. As n → ∞ , the distribution governing ¯ Y approaches a Normal distribution, with mean µ and variance σ 2 n . Notice that the standard deviation for Y is σ but the standard deviation for ¯ σ Y is √ n (aka the standard error of the mean ) 16

Fact about the Normal Distribution 1 2 ( x − µ ) 2 2 πσ 2 e − 1 p ( x ) = √ σ � b The probability that X will fall into the interval ( a, b ) is given by a p ( x ) dx • Expected, or mean value of X , E [ X ], is E [ X ] = µ • Variance of X is V ar ( X ) = σ 2 • Standard deviation of X , σ X , is σ X = σ 17

Facts about the Normal Probability Distribution 80% of area (probability) lies in µ ± 1 . 28 σ N% of area (probability) lies in µ ± z N σ N %: 50% 68% 80% 90% 95% 98% 99% z N : 0.67 1.00 1.28 1.64 1.96 2.33 2.58 18

Confidence Intervals, More Correctly If • S contains n examples, drawn independently of h and each other • n ≥ 30 Then • With approximately 95% probability, error S ( h ) lies in interval � error D ( h )(1 − error D ( h )) error D ( h ) ± 1 . 96 n equivalently, error D ( h ) lies in interval � error D ( h )(1 − error D ( h )) error S ( h ) ± 1 . 96 n which is approximately � error S ( h )(1 − error S ( h )) error S ( h ) ± 1 . 96 n 19

Calculating Confidence Intervals: Recipe 2 1. Pick parameter p to estimate • error D ( h ) 2. Choose an unbiased estimator • error S ( h ) 3. Determine probability distribution that governs estimator • error S ( h ) governed by Binomial distribution, approximated by Normal when n ≥ 30 4. Find interval ( L, U ) such that N% of probability mass falls in the interval • Use table of z N values 20

Estimating the Difference Between Hypotheses: Recipe 3 Test h 1 on sample S 1 , test h 2 on S 2 1. Pick parameter to estimate d ≡ error D ( h 1 ) − error D ( h 2 ) 2. Choose an estimator ˆ d ≡ error S 1 ( h 1 ) − error S 2 ( h 2 ) 3. Determine probability distribution that governs estimator � error S 1 ( h 1 )(1 − error S 1 ( h 1 )) + error S 2 ( h 2 )(1 − error S 2 ( h 2 )) d ≈ σ ˆ n 1 n 2 4. Find interval ( L, U ) such that N% of probability mass falls in the interval � error S 1 ( h 1 )(1 − error S 1 ( h 1 )) + error S 2 ( h 2 )(1 − error S 2 ( h 2 )) ˆ d ± z N n 1 n 2 21

A Tastier Version of Recipe 3: Paired z -test to compare h A , h B 1. Partition data into k disjoint test sets T 1 , T 2 , . . . , T k of equal size, where this size is at least 30. 2. For i from 1 to k , do Y i ← error T i ( h A ) − error T i ( h B ) 3. Return the value ¯ Y , where ¯ � k Y ≡ 1 i =1 Y i k By the Central Limit Theoreom, ¯ Y is approximately Normal with variance � k � Y ≡ 1 1 � ( Y i − ¯ Y ) 2 s ¯ k k i =1 22

Yet another Version of Recipe 3: Paired t -test to compare h A , h B 1. Partition data into k disjoint test sets T 1 , T 2 , . . . , T k of equal size, ✭ ✭✭✭✭✭✭✭✭✭✭✭✭ where this size is at least 30 2. For i from 1 to k , do y i ← error T i ( h A ) − error T i ( h B ) � k y ≡ 1 3. Return the value ¯ y , where ¯ i =1 y i k ¯ Y is approximately distributed as a t distribution with k − 1 degrees of freedom . 23

The t -distribution 24

What youll learn today The difference between sample error and true - PowerPoint PPT Presentation

What youll learn today The difference between sample error and true error Confidence intervals for sample error How to estimate confidence intervals Binomial distribution, Normal distribution, Central Limit Theorem Paired t

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

You will learn what git is . You will learn how you can use git . You will learn how to learn more

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Learn Blackboard Learn Learn with others Learn in your own time, pace, space Learn through

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

Causal inference Part II: Difference In Difference and Instrumental Variables Difference in

1.5 File Management We are going to learn The difference between a file and a folder The

SEM Photographs of Activated ash samples SEM Micrographs (Original ash samples) (a) Sample S1F1

Singer difference sets and difference system of sets Akihiro Munemasa Graduate School of

PS 405 Week 4 Section: Difference of means, ANOVA, and Matrix Algebra D.J. Flynn February 4,

Sample Preparation Sample Preparation Sample Size 6 mm x 12 mm x 50 mm 10 mm x 12 mm

SAMPLE SIZE IN TRIAXIAL LOADS How sample size affects the frictional behavior Photo by H.

Math 1710 Class 24 Examples Power 2-Sample CIs Dr. Allen Back and HTs 2-Sample

Sample and Hold Dag T. Wisland Spring 2014 Outline Sample and hold basics Non ideal

Lunch n Learn Lunch n Learn Lunch n Learn Lunch n Learn Understanding Understanding

Questions? Questions? Questions? Questions? Questions? Questions? Questions? Questions?

Producing slides with L A T EX2 Frank Mittelbach 2016/03/29 1 Introduction With L A T EX 2

Producing slides with L A T EX2 Frank Mittelbach 2014/09/29 1 Introduction With L A T EX 2

THE REVISION OF SOME CONCEPTS Summary Statistics Quantitative data describes a numeric set

Primary Care First Foster Independence. Reward Outcomes. Model Briefing Center for Medicare

CSCE 478/878 Lecture 5: Evaluating error D ( h ) Pr x D [ f ( x ) = h ( x )]

Mole Calculations Slide 3 / 158 Table of Contents Click on the topic to go to that section

Neonatal Isoerythrolysis Neonatal Isoerythrolysis Pathogenesis Immune mediated hemolytic

Transfusion To list risks and benefits of various Pitfalls blood products To discuss