CMU-Q 15-381 Lecture 23: Supervised Learning 1 Teacher: Gianni A. Di Caro
M ACHINE L EARNING ? (Model) ? ! e (Inductive learning) c n e i c S a t a D (Targets) (Model) 2
G ENERAL ML S CHEME Data in the domain is Task: define an appropriate described in the mapping from data to the language of selected Outputs Features Learning Problem: Obtaining such a mapping from training data § ML Design: Use the right features (description language), to build the right model , that achieve the task according to desired performance § Learning by examples : Look at some data, guess at a general scientific hypothesis, make statements - predictions on test data - based on this hypothesis § Inductive learning (from evidence) ≠ Deductive learning (logical, from facts) 3
G ENERAL ML S CHEME Hypotheses space Hypothesis function Labeled / Unlabeled Given / Not Given Performance Errors / Rewards criteria 4
S UPERVISED L EARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria § Supervised (inductive) learning ( labeled data ) § A training data set is given § Training data include target outputs (labels put by a teacher / supervisor) § Using the labels a precise error measure for a prediction can be derived § Aims to find out models that explain and generalize observed data 5
E XAMPLE OF S UPERVISED L EARNING T ASK : C LASSIFICATION § Classification, categorical target: given ! possible classes/categories, to which class each (new) data item belongs to? § Task mapping is ": ℱ % → {0,1} , § Binary ( ! = 2 ) : Dog or cat? Rich or poor? Hot or cold? § Multi-class ( ! > 2 ): Cloudy, or snowing, or mostly clear? Dog, or cat, or fox, or …? 6
E XAMPLE OF S UPERVISED L EARNING T ASK : R EGRESSION § Regression, numerical target: which is the function that best describes the (conditional expectation) relation between ! dependent variables ( outputs ) and " independent variables ( predictors )? § Task mapping is #: ℱ & → ℝ ) § Univariate ( * = , ): What is the expected relation between temperature ( predictor ) and peak electricity usage in Doha ( target output )? What is the expected relation between age and diabetes in Qatar? § Multi-variate ( * > , ): What is the expected relation between (temperature, time hour, day of the week) and peak electricity usage in Doha? What is the expected relation between advertising in (TV, radio) and sales of a product? 7
U N S UPERVISED L EARNING Hypotheses space Hypothesis function Unlabeled Given Similarity Performance measures criteria § Unsupervised (associative) learning ( unlabeled data ) § A (training) data set is given § Training data does not include target outputs / labels § Aims to find hidden structure, association relationships in data 8
E XAMPLE OF U NSUPERVISED L EARNING T ASK : C LUSTERING § Clustering, hidden target: based on some measure of similarity/dissimilarity, group data items in ! clusters, ! is (usually) not known in advance. § Task mapping is ": ℱ % → {0,1} , § Given a set of photos, and similarity features (e.g., colors, geometric objects) how can be the photos clustered? § Group these people in different clusters 9
E XAMPLE OF U NSUPERVISED L EARNING T ASK : D ISCOVER RELATIONS § Finding underlying structure, hidden target: discover relations and correlations among data. § Given a set of photos, find possible relations among sub-groups of them. § Given shopping data of customers at gas station spread in the country, discover relations that could suggest what to put in the shops and where to locate the items on display. 10
R EINFORCEMENT L EARNING Hypotheses space Hypothesis function RL is hard(er)! Unlabeled Not Given Rewards Performance criteria § Reinforcement (direct interaction) learning ( reward model ) § Explore, to find the training data by your own § Gathered data does not include target outputs, but are associated to (possibly sparse) rewards/costs (advisory signals) § Aims to learn an optimal action policy or make optimal predictions § Sequential decision-making vs. One-shot decision making 11
S UPERVISED L EARNING : C LASSIFICATION c + a - r s Task: Labeling: Learn class C ≡ Family car Data set: cars + Family car % (car) ? ⁄ ' ( (positive example) − No family car (negative example) Features? Color Engine power Shape Traction Consumption Price …. 12
C LASSIFICATION EXAMPLE § A car is represented as a numeric vector of two features : $ x 1 � � x ∈ R 2 x = ≡ , cc x 2 § The label of a car denotes its type: 1 if x is a positive example (family car) ( r = 0 if x is a negative example § In the data set ! , each car example " is represented by an ordered pair ($ (%) , ( (%) ) , and there are ) examples in the data set : 0 ! = {($ (%) , ( (%) ) } %./ 13
C LASSIFICATION EXAMPLE : H YPOTHESIS § Plot of the dataset in the two-dimensional feature space Which is the relationship between (price, power) and the Class ! ? Hypothesis about the form of the searched mapping: to which class of functions it belongs to? We have to make a choice: Inductive bias . We will explain the data according to the hypothesis class ℋ that we choose (that sets a bias ) 14
C LASSIFICATION EXAMPLE : H YPOTHESIS CLASS Hypothesis class from which we believe $ is drawn: ℋ = set of axis-aligned rectangles (& ' ≤ price ≤ & ) ) ⋀ (, ' ≤ engine power ≤ , ) ) Target of learning: find a particular hypothesis ℎ ∈ ℋ to approximate the (true) class $ as closely as possible Problem: We don’t know $ , we only have a set of examples drawn from $ Learning a vector - of four parameters: ℎ = ℎ / (0) How do we evaluate how good: ℎ = ℎ / 0 is? 15
E MPIRICAL AND GENERALIZATION ERRORS Loss function: quantifies the prediction/classification error done by our § hypothesis function ℎ # on a training or test example ℓ ∶ ℱ / ×ℝ 0 → ℝ § In the example: ℓ ℎ 1 2 , 3 = 1( ℎ 1 2 ℓ ∶ ℝ×ℝ×{0,1} → {0,1} ≠ 3) § The empirical error on the training dataset of ! labeled pairs is: : ℓ ℎ 1 2 (=) , 3 (=) 8 9:; = < =>? § Do we aim to minimize the empirical error? To a certain extent, yes, but we really aim to minimize the generalization error : the loss on new examples, not in the training set! 16
E MPIRICAL AND G ENERALIZATION ERRORS Fundamental problem: we re looking for the parameter values ! that § minimize the prediction error resulting from the hypothesis function ℎ # / ℓ ℎ # 1 (,) , 5 (,) § This “seems” to be equivalent to find: ! = arg min ∑ ,-. § … but actually, what we really care about is loss of prediction on new 1 = , 5 = → Generalization error examples § Expected loss over all > input-output pairs the learning machine may see § To quantify this expectation, we need to define a prior probability distribution < ?, @ over the examples , which we assume as stationary ( < doesn’t change) § The expected generalization loss is: ; ℓ ℎ # ( 1 (,) ), 5 (,) < 1 (,) , 5 (,) 6 789 = : ,-. 17
H OW TO ASSESS G ENERALIZATION E RROR ? § But ! ", $ is not known! Therefore it is only possible to estimate the generalization error, which is the true error for the considered population of data examples given the chosen hypothesis § How can we make a sound estimate? § Two general ways: § Theoretical: derive statistical bounds on the difference between true error and expected empirical error (PAC learning, VC dimension) § Empirical (Practical): Compute the expected empirical error on training dataset as a local indicator of performance, then use a separate data set to test the learned model and use the expected empirical error on the test set to estimate the generalization error 18
E MPIRICAL RISK MINIMIZATION § In any case, we need to estimate the generalization loss with the expected empirical loss on a set of examples ( ≪ 6 , that can be estimated as the sample average of losses: $ " #$% = 1 ℓ ℎ / 0 (*) , 4 (*) ! ( ) *+, § ! " #$% is an approximation of the risk associated with the use of the hypothesis ℎ / for the learning task (i.e., the risk of incurring in prediction losses when classifying samples that are not in the training set ) § The empirical risk minimization principle states that the learning algorithm should choose the hypothesis ℎ / that minimizes empirical risk: $ 7 = arg min 1 ℓ ℎ / 0 (*) , 4 (*) ( ) *+, 19
H OW TO CHOOSE THE HYPOTHESIS ? The true class ! For a choice of ℎ Consistent hypotheses # : Most specific $ : Most general 20
T HE C ANONICAL S UPERVISED ML P ROBLEM Ø Given a collection of input features and outputs ℎ 0 1 (*) , 5 (*) , 6 = 1, … , ( and a hypothesis function ℎ 0 , find parameter values & that minimize the average empirical error: - 1 ℓ ℎ 0 1 (*) , 5 (*) minimize ( ) & *+, Ø Since , - is a constant that depends on the size of the dataset, it may be omitted, making the problem equivalent to minimize the sum of prediction losses: - ℓ ℎ 0 1 (*) , 5 (*) minimize ) & *+, , In some cases (e.g., when using quadratic losses), it can be convenient to use 9- Ø Virtually, all supervised learning algorithms can be described in this form, where we need to specify: 1. The hypothesis class : , ; & ∈ : 2. The loss function ℓ 3. The algorithm for solving the optimization problem (often approximately) 21
B IG P ICTURE (P ATTERN R ECOGNITION ) 22
Recommend
More recommend