introduction
play

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: - PDF document

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970 Lecture 3: Stephen Scott Stephen Scott and Vinod and Vinod Regularization Variyam Variyam Machine learning can generally be distilled to an


  1. Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970 Lecture 3: Stephen Scott Stephen Scott and Vinod and Vinod Regularization Variyam Variyam Machine learning can generally be distilled to an Introduction Introduction optimization problem Outline Outline Choose a classifier (function, hypothesis) from a set of Machine Machine Stephen Scott and Vinod Variyam Learning Learning functions that minimizes an objective function Problems Problems Measuring Measuring Clearly we want part of this function to measure Performance Performance performance on the training set, but this is insufficient Regularization Regularization Estimating Estimating Generalization Generalization Performance Performance Comparing Comparing Learning Learning Algorithms Algorithms sscott@cse.unl.edu Other Other Performance Performance Measures Measures 1 / 52 2 / 52 Outline Machine Learning Problems CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization Supervised Learning: Algorithm is given labeled Stephen Scott Stephen Scott training data and is asked to infer a function and Vinod and Vinod Variyam Variyam ( hypothesis ) from a family of functions (e.g., set of all Types of machine learning problems ANNs) that is able to predict well on new, unseen Introduction Introduction Loss functions examples Outline Outline Classification: Labels come from a finite, discrete set Generalization performance vs training set performance Machine Machine Learning Learning Regression: Labels are real-valued Problems Overfitting Problems Unsupervised Learning: Algorithm is given data Measuring Measuring Regularization Performance Performance without labels and is asked to model its structure Regularization Regularization Estimating generalization performance Clustering, density estimation Estimating Estimating Generalization Generalization Reinforcement Learning: Algorithm controls an agent Performance Performance that interacts with its environment and learns good Comparing Comparing Learning Learning actions in various situations Algorithms Algorithms Other Other Performance Performance Measures Measures 3 / 52 4 / 52 Measuring Performance Measuring Performance Loss Examples of Loss Functions CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization Stephen Scott Stephen Scott and Vinod In any learning problem, need to be able to quantify and Vinod 0-1 Loss: J ( y , ˆ y ) = 1 if y 6 = ˆ y , 0 otherwise Variyam Variyam performance of an algorithm y ) 2 Square Loss: J ( y , ˆ y ) = ( y � ˆ Introduction Introduction In supervised learning, we often use a loss function Cross-Entropy: J ( y , ˆ y ) = � y ln ˆ y � ( 1 � y ) ln ( 1 � ˆ y ) Outline Outline (or error function) J for this task ( y and ˆ y are considered probabilities of a ‘1’ label; Machine Machine Learning Learning Given instance x with true label y , if the learner’s generalizes to multi-class.) Problems Problems prediction on x is ˆ y , then Measuring Measuring Hinge Loss: J ( y , ˆ y ) = max ( 0 , 1 � y ˆ y ) Performance Performance Loss Loss (used sometimes for large margin classifiers like SVMs) J ( y , ˆ y ) Overfitting Overfitting Regularization Regularization All non-negative Estimating is the loss on that instance Estimating Generalization Generalization Performance Performance Comparing Comparing Learning Learning Algorithms Algorithms Other Other 5 / 52 6 / 52 Performance Performance

  2. Measuring Performance Measuring Performance Training Loss Expected Loss CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization Stephen Scott Stephen Scott and Vinod and Vinod More importantly, the learner needs to generalize well: Variyam Variyam Given a loss function J and a training set X , the total Given a new example drawn iid according to unknown Introduction Introduction loss of the classifier h on X is probability distribution D , we want to minimize h ’s Outline Outline expected loss : Machine Machine X error X ( h ) = J ( y x , ˆ y x ) , Learning Learning Problems Problems error D ( h ) = E x ⇠ D [ J ( y x , ˆ y x )] x 2 X Measuring Measuring Performance Performance where y x is x ’s label and ˆ y x is h ’s prediction Loss Loss Is minimizing training loss the same as minimizing Overfitting Overfitting expected loss? Regularization Regularization Estimating Estimating Generalization Generalization Performance Performance Comparing Comparing Learning Learning Algorithms Algorithms Other Other 7 / 52 8 / 52 Performance Performance Measuring Performance Measuring Performance Expected vs Training Loss Overfitting 2 x CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization h 2 Stephen Scott Stephen Scott h Sufficiently sophisticated learners (decision trees, 1 and Vinod and Vinod Variyam multi-layer ANNs) can often achieve arbitrarily small (or Variyam zero) loss on a training set Introduction Introduction Outline A hypothesis (e.g., ANN with specific parameters) h Outline Machine overfits the training data X if there is an alternative Machine Learning Learning x hypothesis h 0 such that 1 Problems Problems Measuring Measuring Performance Performance error X ( h ) < error X ( h 0 ) Loss Loss Overfitting Overfitting Regularization Regularization and Estimating Estimating error D ( h ) > error D ( h 0 ) Generalization Generalization Performance Performance Comparing Comparing Learning Learning Algorithms Algorithms Other Other 9 / 52 10 / 52 Performance Performance Measuring Performance Regularization Overfitting Causes of Overfitting CSCE 970 CSCE 970 Lecture 3: Lecture 3: Generally, if the set of functions H the learner has to Regularization Regularization choose from is complex relative to what is required for Stephen Scott Stephen Scott correctly predicting the labels of X , there’s a larger and Vinod and Vinod y : price Variyam Variyam chance of overfitting due to the large number of “wrong” choices in H Introduction Introduction Could be due to an overly sophisticated set of functions Outline Outline E.g., can fit any set of n real-valued points with an Machine To generalize well, need Machine Learning Learning ( n − 1 ) -degree polynomial, but perhaps only degree 2 is Problems x : milage to balance training accu- Problems needed Measuring racy with simplicity Measuring E.g., using an ANN with 5 hidden layers to solve the Performance Performance Loss logical AND problem Regularization Overfitting Causes of Overfitting Could be due to training an ANN too long Regularization Early Stopping Parameter Norm Over-training an ANN often leads to weights deviating Estimating Penalties Generalization Data Augmentation far from zero Performance Multitask Learning Makes the function more non-linear, and more complex Dropout Comparing Others Learning Often, a larger data set mitigates the problem Estimating Algorithms Generalization Other Performance 11 / 52 12 / 52 Performance

Recommend


More recommend