Learning as Loss Minimization Machine Learning 1
Learning as loss minimization • The setup – Examples x drawn from a fixed, unknown distribution D – Hidden oracle classifier f labels examples – We wish to find a hypothesis h that mimics f • The ideal situation – Define a function L that penalizes bad hypotheses – Learning: Pick a function h 2 H to minimize expected loss But distribution D is unknown • Instead, minimize empirical loss on the training set 2
Learning as loss minimization • The setup – Examples x drawn from a fixed, unknown distribution D – Hidden oracle classifier f labels examples – We wish to find a hypothesis h that mimics f • The ideal situation – Define a function L that penalizes bad hypotheses – Learning: Pick a function h 2 H to minimize expected loss But distribution D is unknown • Instead, minimize empirical loss on the training set 3
Learning as loss minimization • The setup – Examples x drawn from a fixed, unknown distribution D – Hidden oracle classifier f labels examples – We wish to find a hypothesis h that mimics f • The ideal situation – Define a function L that penalizes bad hypotheses – Learning: Pick a function h 2 H to minimize expected loss • Instead, minimize empirical loss on the training set 4
Learning as loss minimization • The setup – Examples x drawn from a fixed, unknown distribution D – Hidden oracle classifier f labels examples – We wish to find a hypothesis h that mimics f • The ideal situation – Define a function L that penalizes bad hypotheses – Learning: Pick a function h 2 H to minimize expected loss But distribution D is unknown • Instead, minimize empirical loss on the training set 5
Learning as loss minimization • The setup – Examples x drawn from a fixed, unknown distribution D – Hidden oracle classifier f labels examples – We wish to find a hypothesis h that mimics f • The ideal situation – Define a function L that penalizes bad hypotheses – Learning: Pick a function h 2 H to minimize expected loss But distribution D is unknown • Instead, minimize empirical loss on the training set 6
Empirical loss minimization Learning = minimize empirical loss on the training set Is there a problem here? 7
Empirical loss minimization Learning = minimize empirical loss on the training set Is there a problem here? Overfitting! We need something that biases the learner towards simpler hypotheses • Achieved using a regularizer, which penalizes complex hypotheses 8
Regularized loss minimization • Learning: • With linear classifiers: • What is a loss function? – Loss functions should penalize mistakes – We are minimizing average loss over the training data • What is the ideal loss function for classification? 9
Regularized loss minimization • Learning: • With linear classifiers: • What is a loss function? – Loss functions should penalize mistakes – We are minimizing average loss over the training data • What is the ideal loss function for classification? 10
Regularized loss minimization • Learning: • With linear classifiers: • What is a loss function? – Loss functions should penalize mistakes – We are minimizing average loss over the training data • What is the ideal loss function for classification? 11
The 0-1 loss Penalize classification mistakes between true label y and prediction y’ • For linear classifiers, the prediction y’ = sgn( w T x) – Mistake if y w T x · 0 Minimizing 0-1 loss is intractable. Need surrogates 12
The 0-1 loss Loss y w T x < 0, misclassification y w T x > 0, no misclassification y w T x 13
Compare to the hinge loss Loss More penalty as w T x is farther away from the separator on the wrong side y w T x < 0, misclassification Penalize predictions even if they are correct, but too close to the margin y w T x > 0, no misclassification y w T x 14
Support Vector Machines • SVM = linear classifier combined with regularization • Ideally, we would like to minimize 0-1 loss, – But we can’t for computational reasons • SVM minimizes hinge loss – Variants exist 15
SVM objective function Regularization term: Empirical Loss: Maximize the margin Hinge loss • • Imposes a preference over the Penalizes weight vectors that make • • hypothesis space and pushes for mistakes better generalization Can be replaced with other Can be replaced with other loss • • regularization terms which impose functions which impose other other preferences preferences 16
SVM objective function Regularization term: Empirical Loss: Maximize the margin Hinge loss • • Imposes a preference over the Penalizes weight vectors that make • • hypothesis space and pushes for mistakes better generalization Can be replaced with other Can be replaced with other loss • • regularization terms which impose functions which impose other other preferences preferences A hyper-parameter that controls the tradeoff between a large margin and a small hinge-loss 17
The loss function zoo Many loss functions exist – Perceptron loss – Hinge loss (SVM) – Exponential loss (AdaBoost) – Logistic loss (logistic regression) 18
The loss function zoo 19
The loss function zoo Zero-one 20
The loss function zoo Hinge: SVM Zero-one 21
The loss function zoo Hinge: SVM Perceptron Zero-one 22
The loss function zoo Hinge: SVM Exponential: AdaBoost Perceptron Zero-one 23
The loss function zoo Hinge: SVM Exponential: AdaBoost Perceptron Zero-one Logistic regression 24
Learning via Loss Minimization: Summary • Learning via Loss Minimization – Write down a loss function – Minimize empirical loss • Regularize to avoid overfitting – Neural networks use other strategies such as dropout • Widely applicable, different loss functions and regularizers 25
Recommend
More recommend