  Introduction

Choosing a data set
Extracting a test set of size > 30
Building a tree on the training set
Testing on the test set
Reporting the accuracy

Does the reported accuracy exactly match the generalization performance of the tree?

If a tree has error 10% and an ANN has error 11%, is the tree absolutely better?

Why or why not? How about the algorithms in general?

Outline

Setting Goals

Before setting up an experiment, need to understand exactly what the goal is

Estimate the generalization performance of a hypothesis
Tuning a learning algorithm's parameters
Comparing two learning algorithms on a specific task
Etc.

Will never be able to answer the question with 100% certainty

Due to variances in training set selection, test set selection, etc.

Will choose an estimator for the quantity in question, determine the probability distribution of the estimator, and bound the probability that the estimator is way off

Estimator needs to work regardless of distribution of training/testing data

Goals of performance evaluation
Estimating error and confidence intervals
Paired t tests and cross-validation to compare learning algorithms
Other performance measures
Confusion matrices
ROC analysis
Precision-recall curves

Setting Goals (cont'd)

For now, focus on straightforward, 0/1 classification error

Need to note that, in addition to statistical variations, what we determine is limited to the application that we are studying

E.g., if naïve Bayes better than ID3 on spam filtering, that means nothing about face recognition

In planning experiments, need to ensure that training data not used for evaluation

I.e., don't test on the training set!

Will bias the performance estimator

Also holds for validation set used to prune DT, tune parameters, etc.

Validation set serves as part of training set, but not used for model building

Types of Error

For hypothesis h, recall the two types of classification error from Chapter 2:

Empirical error (or sample error) is fraction of set V that h gets wrong:

error_V(h) ≡ 1/|V| Σ_{x∈V} δ(C(x) ≠ h(x)),

where δ(C(x) ≠ h(x)) is 1 if C(x) ≠ h(x), and 0 otherwise

Generalization error (or true error) is probability that a new, randomly selected, instance is misclassified by h

error_D(h) ≡ Pr_{x∈D}[C(x) ≠ h(x)],

where D is probability distribution instances are drawn from

Why do we care about error_V(h)?

  2. Estimating True Error Estimating True Error (cont’d) CSCE CSCE 478/878 478/878 Lecture 4: Lecture 4: Experimental Experimental Design and Design and Experiment: Analysis Analysis Bias : If T is training set, error T ( h ) is optimistically Stephen Scott Stephen Scott biased Choose sample V of size N according to distribution D 1 Introduction bias ⌘ E [ error T ( h )] � error D ( h ) Introduction Measure error V ( h ) 2 Outline Outline For unbiased estimate ( bias = 0 ), h and V must be Goals Goals error V ( h ) is a random variable (i.e., result of an experiment) chosen independently ) Don’t test on training set! Estimating Estimating Error Error (Don’t confuse with inductive bias!) error V ( h ) is an unbiased estimator for error D ( h ) Types of Error Types of Error Estimating Error Estimating Error Variance : Even with unbiased V , error V ( h ) may still Confidence Intervals Confidence Intervals Given observed error V ( h ) , what can we conclude about Comparing vary from error D ( h ) Comparing Learning Learning error D ( h ) ? Algorithms Algorithms Other Other Performance Performance Measures Measures 7 / 35 8 / 35 Confidence Intervals Confidence Intervals (cont’d) CSCE CSCE If If 478/878 478/878 Lecture 4: Lecture 4: Experimental Experimental Design and V contains N examples, drawn independently of h and Design and V contains N examples, drawn independently of h and Analysis Analysis each other each other Stephen Scott Stephen Scott N � 30 N � 30 Introduction Introduction Outline Outline Then with approximately 95% probability, error D ( h ) lies in Then with approximately c% probability, error D ( h ) lies in Goals Goals Estimating r Estimating r error V ( h )( 1 � error V ( h )) error V ( h )( 1 � error V ( h )) Error Error error V ( h ) ± 1 . 96 error V ( h ) ± z c Types of Error Types of Error N N Estimating Error Estimating Error Confidence Intervals Confidence Intervals E.g. hypothesis h misclassifies 12 of the 40 examples in test Comparing Comparing N % : Learning Learning 50% 68% 80% 90% 95% 98% 99% set V : Algorithms Algorithms z c : 0.67 1.00 1.28 1.64 1.96 2.33 2.58 error V ( h ) = 12 Other Other 40 = 0 . 30 Performance Performance Measures Measures Why? Then with approx. 95% confidence, error D ( h ) 2 [ 0 . 158 , 0 . 442 ] 9 / 35 10 / 35 error V ( h ) is a Random Variable Binomial Probability Distribution CSCE Repeatedly run the experiment, each with different CSCE 478/878 478/878 ✓ N ◆ N ! p r ( 1 � p ) N � r = randomly drawn V (each of size N ) Lecture 4: Lecture 4: r !( N � r )! p r ( 1 � p ) N � r P ( r ) = Experimental Experimental r Design and Design and Probability of observing r misclassified examples: Analysis Analysis Probability P ( r ) of r heads in N coin flips, if p = Pr ( heads ) Stephen Scott Binomial distribution for n = 40, p = 0.3 Stephen Scott 0.14 0.12 Introduction Introduction Expected, or mean value of X , E [ X ] (= # heads on N 0.1 Outline Outline flips = # mistakes on N test exs), is 0.08 P(r) Goals Goals 0.06 N Estimating Estimating 0.04 X Error Error E [ X ] ⌘ iP ( i ) = Np = N · error D ( h ) Types of Error Types of Error 0.02 Estimating Error Estimating Error i = 0 0 Confidence Intervals Confidence Intervals 0 5 10 15 20 25 30 35 40 Variance of X is Comparing Comparing ✓ N ◆ error D ( h ) r ( 1 � error D ( h )) N � r Learning Learning P ( r ) = Var ( X ) ⌘ E [( X � E [ X ]) 2 ] = Np ( 1 � p ) Algorithms Algorithms r Other Other Standard deviation of X , σ X , is Performance Performance Measures Measures I.e., let error D ( h ) be probability of heads in biased coin, then q p E [( X � E [ X ]) 2 ] = σ X ⌘ Np ( 1 � p ) P ( r ) = prob. of getting r heads out of N flips 11 / 35 12 / 35


