Learning from Examples Classification Regression Model Selection & Generalization Recap Learning from Examples Classification Regression Model Selection & Generalization Recap Outline Learning from Examples 1 Motivation Supervised Learning Aspects of Supervised Learning Classification 2 Steven J Zeil Hypotheses Noise Old Dominion Univ. Multiple Classes Fall 2010 Regression 3 Model Selection & Generalization 4 Recap 5 1 2 Learning from Examples Classification Regression Model Selection & Generalization Recap Learning from Examples Classification Regression Model Selection & Generalization Recap Motivation Example: Spam Detection We have a collection of data Training set: A collection of email messages Humans have flagged some as spam. Some instances of which are known to be examples of an Knowledge Extraction: interesting class We scan the messages for “typical” spam vocabulary: names We wish to learn a rule by which we can determine of drugs, “mortgage”, Nigeria, . . . membership in that class. We also scan for irregularities in the formal content obfuscated URLs routing via open SMTP relays mismatches between “From:” domain and IP address of origin, etc. Input Representation: � x , x i , 0 ≤ i < k : count of number of occurrences of spam term w i x i , k ≤ i < n : 1 if irregularity z i − k is present, 0 ow. 3 4
Learning from Examples Classification Regression Model Selection & Generalization Recap Learning from Examples Classification Regression Model Selection & Generalization Recap Example: “Family” Cars Aspects of Supervised Learning Is car x a “family car”? We have a sample x t , r t } N X : { � Knowledge Extraction: What features define a “family” car? t =1 that is independent and identically distributed Training set: positive (+) and negative ( − ) examples ordering is not important Input representation: x 1 : price, x 2 : engine power all instances are drawn from the same joint distribution p ( � x , r ) A model g ( � x | θ ) θ are the parameters An procedure to find the “best” parameter values θ ∗ 5 6 Learning from Examples Classification Regression Model Selection & Generalization Recap Learning from Examples Classification Regression Model Selection & Generalization Recap Training Set X : Family Cars Hypothesis Class x t , r t } X : { � � x 1 � Input: � x = x 2 Label: � 1if � x is a family car r = 0otherwise We have a hypothesis class H parameterized by p 1 , p 2 , e 1 , e 2 ), ( p 1 ≤ price ≤ p 2 ) ∧ ( e 1 ≤ enginepower ≤ e 2 ) 7 8 We want to “learn” a hypothesis h ∈ H to approximate our
Learning from Examples Classification Regression Model Selection & Generalization Recap Learning from Examples Classification Regression Model Selection & Generalization Recap Empirical Error Version Space In this case, there are an Given an hypothesis h that infinite number of returns 1 for a positive hypotheses (the version prediction and 0 for a space ) for which negative prediction, E ( h |X ) = 0. The error of the hypothesis: Suppose that the blue area is the “true” class N If we choose the yellow � x t ) � = r t )) E ( h |X ) = 1( h ( � region as our h , the pure t =1 yellow bars represent false positives, the pure blue represent false negatives 9 10 Learning from Examples Classification Regression Model Selection & Generalization Recap Learning from Examples Classification Regression Model Selection & Generalization Recap Doubt Model Complexity What if not every point can be The most specific hypothesis accounted for by H ? S is the tightest rectangle Use a more complicated enclosing the positive hypothesis class, or examples. Stay with the simpler one prone to false negatives and tolerate a non-zero The most general hypothesis error? G is the largest rectangle enclosing the positive Depends partly on whether we examples but containing no believe this is noise or a failure of negative examples. modeling. prone to false positives Perhaps we should choose something in between? S ⊆ G , G − S is the region of doubt 11 12
Learning from Examples Classification Regression Model Selection & Generalization Recap Learning from Examples Classification Regression Model Selection & Generalization Recap Coping with Noise Learning Multiple Classes Arguments in favor of staying with the simpler model Easier/faster to work with Easier to train (fewer examples needed) Easier to explain May generalize better (Occam’s razor) Training set is x t ,� r t } N X : { � t =1 where � 1 if � x t ∈ C i r t i = x t ∈ C j , j � = i 0 if � r t partitions the input space (even if the { C i } do Notice that � not). 13 14 Learning from Examples Classification Regression Model Selection & Generalization Recap Learning from Examples Classification Regression Model Selection & Generalization Recap Multiple Class Hypotheses Regression x t , r t } X = { � r t ∈ R r t = f ( x t ) + ǫ Train hypotheses h i ( � x ), i = 1 , . . . , k : N E ( g |X ) = 1 r t − g ( x t ) � � � � 1 if � N x t ∈ C i t =1 x t ) = h i ( � x t ∈ C j , j � = i 0 if � g ( x ) = w 1 x + w 0 We train k hypotheses to recognize k classes, w 2 x 2 + w 1 x + w 0 g ( x ) = Each class against the world 15 16
Learning from Examples Classification Regression Model Selection & Generalization Recap Learning from Examples Classification Regression Model Selection & Generalization Recap Regression Triple Trade-Off There is a trade-off between three factors Learning is an ill-posed problem ; Complexity of H, c ( H ), 1 data is usually not sufficient to find Training set size, N 2 a unique solution. Generalization error on new data, E 3 As N increases, so does E Assumptions about H form our inductive basis As c ( H ) increases, E initially decreases and then increases Generalization : how well does a model perform on new data? Overfitting : H is more complex than C (or f ) Underfitting : H is less complex g ( x ) = w 1 x + w 0 than C (or f ) w 2 x 2 + w 1 x + w 0 g ( x ) = 17 18 Learning from Examples Classification Regression Model Selection & Generalization Recap Recap: Aspects of Supervised Learning We have a sample x t , r t } N X : { � t =1 that is independent and identically distributed ordering is not important all instances are drawn from the same joint distribution p ( � x , r ) A model g ( � x | θ ) θ are the parameters A Loss function L ( · ) computing the different between the desired output r t and our approximation to it, g ( � x t | θ ) Approximation Error or Loss � r t , g ( � x t | θ ) � � E ( θ |X ) = L t An optimization procedure to find θ ∗ θ ∗ = arg min θ E ( θ |X ) 19
Recommend
More recommend