CSCE 478/878 Lecture 5: Evaluating error D ( h ) Pr x D [ f ( x ) - PowerPoint PPT Presentation

Outline Two Definitions of Error • Sample error vs. true error • The true error of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random accord- • Confidence intervals for observed hypothesis error ing to D . CSCE 478/878 Lecture 5: Evaluating error D ( h ) ≡ Pr x ∈ D [ f ( x ) � = h ( x )] Hypotheses • Estimators • The sample error of h with respect to target function • Binomial distribution, Normal distribution, Central Limit f and data sample S ( | S | = n ) is the proportion of Theorem examples h misclassifies Stephen D. Scott error S ( h ) ≡ 1 (Adapted from Tom Mitchell’s slides) � δ ( f ( x ) � = h ( x )) , • Paired t tests n x ∈ S where δ ( f ( x ) � = h ( x )) is 1 if f ( x ) � = h ( x ) , and 0 otherwise. • Comparing learning methods • ROC analysis • How well does error S ( h ) estimate error D ( h ) ? 1 2 3 Confidence Intervals Estimators If Problems Estimating Error • S contains n examples, drawn independently of h and Experiment: each other • Bias: If S is training set, error S ( h ) is optimistically • n ≥ 30 1. Choose sample S of size n according to distribution D biased Then 2. Measure error S ( h ) bias ≡ E [ error S ( h )] − error D ( h ) • With approximately 95% probability, error D ( h ) lies in interval For unbiased estimate ( bias = 0 ), h and S must be error S ( h ) is a random variable (i.e., result of an experi- � error S ( h )(1 − error S ( h )) chosen independently ⇒ Don’t test on training set! ment) error S ( h ) ± 1 . 96 n Don’t confuse with inductive bias! error S ( h ) is an unbiased estimator for error D ( h ) E.g. hypothesis h misclassifies 12 of the 40 examples in • Variance: Even with unbiased S , error S ( h ) may still test set S : vary from error D ( h ) error S ( h ) = 12 Given observed error S ( h ) , what can we conclude about 40 = 0 . 30 error D ( h ) ? Then with approx. 95% confidence, error D ( h ) ∈ [0 . 158 , 0 . 442] 4 5 6

Binomial Probability Distribution Binomial distribution for n = 40, p = 0.3 0.14 error S ( h ) is a Random Variable 0.12 Confidence Intervals 0.1 (cont’d) 0.08 P(r) Repeatedly run the experiment, each with different ran- 0.06 domly drawn S (each of size n ) 0.04 If 0.02 0 0 5 10 15 20 25 30 35 40 • S contains n examples, drawn independently of h and Probability of observing r misclassified examples: � n n ! p r (1 − p ) n − r = each other � r !( n − r )! p r (1 − p ) n − r Binomial distribution for n = 40, p = 0.3 P ( r ) = 0.14 r 0.12 • n ≥ 30 0.1 0.08 P(r) Probability P ( r ) of r heads in n coin flips, if p = Pr( heads ) 0.06 0.04 Then 0.02 • Expected, or mean value of X , E [ X ] (= # heads on n 0 • With approximately N % probability, error D ( h ) lies in 0 5 10 15 20 25 30 35 40 flips = # mistakes on n test exs), is interval n � E [ X ] ≡ iP ( i ) = np = n · error D ( h ) � error S ( h )(1 − error S ( h )) i =0 error S ( h ) ± z N � n error D ( h ) r (1 − error D ( h )) n − r � n P ( r ) = r where • Variance of X is N % : 50% 68% 80% 90% 95% 98% 99% V ar ( X ) ≡ E [( X − E [ X ]) 2 ] = np (1 − p ) I.e. let error D ( h ) be probability of heads in biased coin, z N : 0.67 1.00 1.28 1.64 1.96 2.33 2.58 the P ( r ) = prob. of getting r heads out of n flips • Standard deviation of X , σ X , is Why? What kind of distribution is this? � � E [( X − E [ X ]) 2 ] = σ X ≡ np (1 − p ) 7 8 9 Normal Probability Distribution Normal Probability Distribution Normal distribution with mean 0, standard deviation 1 Approximate Binomial Dist. with Normal (cont’d) 0.4 0.35 0.3 0.25 0.4 error S ( h ) = r/n is binomially distributed, with 0.2 0.35 0.15 0.3 0.1 0.25 • mean µ error S ( h ) = error D ( h ) (i.e. unbiased est.) 0.05 0.2 0 0.15 -3 -2 -1 0 1 2 3 0.1 • standard deviation σ error S ( h ) 0.05 0 -3 -2 -1 0 1 2 3 � error D ( h )(1 − error D ( h )) � 2 � � � x − µ 80% of area (probability) lies in µ ± 1 . 28 σ 1 − 1 σ error S ( h ) = p ( x ) = √ 2 πσ 2 exp n 2 σ N % of area (probability) lies in µ ± z N σ (i.e. increasing n decreases variance) N % : 50% 68% 80% 90% 95% 98% 99% • Defined completely by µ and σ z N : 0.67 1.00 1.28 1.64 1.96 2.33 2.58 Want to compute confidence interval = interval centered • The probability that X will fall into the interval ( a, b ) is at error D ( h ) containing N % of the weight under the dis- given by Can also have one-sided bounds: � b tribution (difficult for binomial) 0.4 0.35 a p ( x ) dx 0.3 0.25 Approximate binomial by normal (Gaussian) dist: 0.2 0.15 • Expected, or mean value of X , E [ X ] , is 0.1 • mean µ error S ( h ) = error D ( h ) 0.05 E [ X ] = µ 0 -3 -2 -1 0 1 2 3 • standard deviation σ error S ( h ) • Variance of X is V ar ( X ) = σ 2 N % of area lies < µ + z ′ N σ or > µ − z ′ N σ , where z ′ N = � error S ( h )(1 − error S ( h )) z 100 − (100 − N ) / 2 σ error S ( h ) ≈ • Standard deviation of X , σ X , is n N % : 50% 68% 80% 90% 95% 98% 99% σ X = σ z ′ N : 0.0 0.47 0.84 1.28 1.64 2.05 2.33 10 11 12

Confidence Intervals Revisited Central Limit Theorem Calculating Confidence Intervals If How can we justify approximation? 1. Pick parameter p to estimate • S contains n examples, drawn independently of h and each other Consider a set of independent, identically distributed ran- • error D ( h ) dom variables Y 1 . . . Y n , all governed by an arbitrary prob- • n ≥ 30 ability distribution with mean µ and finite variance σ 2 . De- 2. Choose an estimator fine the sample mean Then n • With approximately 95% probability, error S ( h ) lies in Y ≡ 1 ¯ � • error S ( h ) Y i interval n i =1 � error D ( h )(1 − error D ( h )) error D ( h ) ± 1 . 96 3. Determine probability distribution that governs esti- n Note that ¯ Y is itself a random variable, i.e. the result of an mator experiment (e.g. error S ( h ) = r/n ) Equivalently, error D ( h ) lies in interval • error S ( h ) governed by binomial distribution, ap- Central Limit Theorem: As n → ∞ , the distribution gov- � proximated by normal when n ≥ 30 error D ( h )(1 − error D ( h )) error S ( h ) ± 1 . 96 erning ¯ Y approaches a Normal distribution, with mean µ n and variance σ 2 /n 4. Find interval ( L, U ) such that N % of probability mass which is approximately falls in the interval Thus the distribution of error S ( h ) is approximately nor- � error S ( h )(1 − error S ( h )) mal for large n , and its expected value is error D ( h ) error S ( h ) ± 1 . 96 • Could have L = −∞ or U = ∞ n (Rule of thumb: n ≥ 30 when estimator’s distribution is • Use table of z N or z ′ N values (if distrib. normal) (One-sided bounds yield upper or lower error bounds) binomial; might need to be larger for other distributions) 13 14 15 Difference Between Hypotheses Paired t test to compare h A , h B Comparing Learning Algorithms L A and L B Test h 1 on sample S 1 , test h 2 on S 2 , S 1 ∩ S 2 = ∅ 1. Partition data into k disjoint test sets T 1 , T 2 , . . . , T k of equal size, where this size is at least 30 What we’d like to estimate: 1. Pick parameter to estimate 2. For i from 1 to k , do E S ⊂ D [ error D ( L A ( S )) − error D ( L B ( S ))] d ≡ error D ( h 1 ) − error D ( h 2 ) where L ( S ) is the hypothesis output by learner L using δ i ← error T i ( h A ) − error T i ( h B ) training set S 2. Choose an estimator 3. Return the value ¯ δ , where ˆ d ≡ error S 1 ( h 1 ) − error S 2 ( h 2 ) I.e., the expected difference in true error between hypothe- k δ ≡ 1 ¯ � (unbiased) δ i ses output by learners L A and L B , when trained using k i =1 randomly selected training sets S drawn according to distribution D 3. Determine probability distribution that governs estimator (difference between two normals is also nor- N % confidence interval estimate for d : mal, variances add) But, given limited data D 0 , what is a good estimator? ¯ δ ± t N,k − 1 s ¯ � δ error S 1 ( h 1 )(1 − error S 1 ( h 1 )) + error S 2 ( h 2 )(1 − error S 2 ( h 2 )) σ ˆ d ≈ • Could partition D 0 into training set S 0 and testing set n 1 n 2 � � k 1 T 0 , and measure � 2 � � � δ i − ¯ s ¯ δ ≡ δ � � k ( k − 1) i =1 error T 0 ( L A ( S 0 )) − error T 0 ( L B ( S 0 )) 4. Find interval ( L, U ) such that N % of prob. mass falls in the interval: ˆ d ± z n σ ˆ • Even better, repeat this many times and average the d t plays role of z , s plays role of σ results (next slide) t test gives more accurate results since std. deviation ap- (Can also use S = S 1 ∪ S 2 to test h 1 and h 2 , but not as accurate; interval overly conservative) proximated and test sets for h A and h B not independent 16 17 18

CSCE 478/878 Lecture 5: Evaluating error D ( h ) Pr x D [ f ( x ) - PowerPoint PPT Presentation

Outline Two Definitions of Error Sample error vs. true error The true error of hypothesis h with respect to target function f and distribution D is the probability that h will misclassify an instance drawn at random accord- Confidence

CSCE 478/878 Lecture 5: Evaluating will misclassify an instance drawn at random accord-

Introduction CSCE CSCE 496/896 496/896 Lecture 9: Lecture 9: word2vec and word2vec and To

Introduction Out with the old ... CSCE 970 CSCE 970 Lecture 8: Lecture 8: Structured

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Introduction CSCE CSCE 496/896 496/896 Lecture 6: Lecture 6: Recurrent Recurrent CSCE

Introduction CSCE CSCE Sometimes a single classifier (e.g., neural network, 478/878 478/878

Introduction CSCE CSCE 471/871 471/871 Lecture 6: Lecture 6: Multiple Multiple CSCE

Introduction CSCE CSCE 479/879 479/879 Good for data with a grid-like topology Lecture 4:

Why Are We Here? CSCE CSCE 496/896 496/896 Lecture 10: Lecture 10: CSCE 496/896 Lecture 10:

Introduction CSCE CSCE If no label information is available, can still perform 478/878 478/878

Introduction CSCE 970 CSCE 970 Lecture 4: Lecture 4: Convolutional Convolutional Neural

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Introduction CSCE CSCE In Homework 1, you are (supposedly) 478/878 478/878 Lecture 4:

CSCE 978 Lecture 3: Risk and Loss Functions Introduction In Lecture 1 we mentioned our

CSCE 471/871 Lecture 0: Stephen Scott Administrivia Welcome Introduction What is Bioin-

Introduction Decision Tree for PlayTennis (Mitchell) CSCE CSCE 478/878 478/878 Outlook

CSCE 496/896 Lecture 6: Architectures Stephen Scott Recurrent Architectures Introduction Basic

CSCE 471/871 Lecture 2: Alignments Pairwise Alignments Stephen Scott Alignments Scoring

CSCE 496/896 Lecture 5: Stephen Scott Autoencoders Introduction Basic Idea Stacked AE Stephen

CSCE 496/896 Lecture 5: Stephen Scott Autoencoders Introduction Basic Idea Stacked AE Stephen

CSCE 479/879 Lecture 5: Stephen Scott Autoencoders Introduction Basic Idea Stacked AE Stephen

CSCE 478/878 Lecture 2: Supervised Learning Supervised Learning Stephen Scott Introduction

Introduction Supervised Learning CSCE CSCE 496/896 496/896 Lecture 2: Lecture 2: Basic

CSCE 970 Lecture 8: Prediction Stephen Scott Structured Prediction and Vinod Variyam