Square error loss function for classification! Square error loss is not suitable for classification: Least square loss penalizes โtoo correctโ predictions (that they lie a long way on the correct side } of the decision) Least square loss also lack robustness to noise } - ๐ฟ = 2 * ๐พ ๐ = @ ๐ฅ๐ฆ : + ๐ฅ E โ ๐ง : :B' 36
Notation } ๐ = ๐ฅ E , ๐ฅ ' , . . . , ๐ฅ Y \ } ๐ = 1, ๐ฆ ' , โฆ , ๐ฆ Y \ } ๐ฅ E + ๐ฅ ' ๐ฆ ' + โฏ + ๐ฅ Y ๐ฆ Y = ๐ \ ๐ } We show input by ๐ or ๐(๐) 37
SSE cost function for classification ๐ฟ = 2 } Is it more suitable if we set ๐ ๐; ๐ = ๐ ๐ \ ๐ ? - sign ๐ \ ๐ โ ๐ง * * ๐พ ๐ = @ sign ๐ \ ๐ : โ ๐ง : ๐ง = 1 :B' sign ๐จ = jโ 1, ๐จ < 0 1, ๐จ โฅ 0 ๐ \ ๐ } ๐พ ๐ is a piecewise constant function shows the number of misclassifications ๐พ(๐) Training error incurred in classifying training samples 38
Perceptron algorithm } Linear classifier } Two-class: ๐ง โ {โ1,1} } ๐ง = โ1 for ๐ท * , ๐ง = 1 for ๐ท ' } Goal: โ๐, ๐ (:) โ ๐ท ' โ ๐ \ ๐ (:) > 0 โ๐, ๐ : โ ๐ท * โ ๐ \ ๐ : < 0 } } ๐ ๐; ๐ = sign(๐ \ ๐) 39
๏ฟฝ Perceptron criterion ๐พ o ๐ = โ @ ๐ \ ๐ : ๐ง : :โโณ โณ : subset of training data that are misclassified Many solutions? Which solution among them? 40
Cost function ๐พ(๐) ๐พ o (๐) ๐ฅ E ๐ฅ E ๐ฅ ' ๐ฅ ' # of misclassifications Perceptronโs as a cost function cost function There may be many solutions in these cost functions 41 [Duda, Hart, and Stork, 2002]
๏ฟฝ ๏ฟฝ Batch Perceptron โGradient Descentโ to solve the optimization problem: ๐ MN' = ๐ M โ ๐๐ผ ๐ ๐พ o (๐ M ) ๐ ๐พ o ๐ = โ @ ๐ : ๐ง : ๐ผ :โโณ Batch Perceptron converges in finite number of steps for linearly separable data: Initialize ๐ Repeat ๐ : ๐ง : ๐ = ๐ + ๐ โ :โโณ Until convergence 42
Stochastic gradient descent for Perceptron } Single-sample perceptron: } If ๐ (:) is misclassified: ๐ MN' = ๐ M + ๐๐ (:) ๐ง (:) } Perceptron convergence theorem: for linearly separable data } If training data are linearly separable, the single-sample perceptron is also guaranteed to find a solution in a finite number of steps Fixed-Increment single sample Perceptron Initialize ๐, ๐ข โ 0 repeat ๐ can be set to 1 and ๐ข โ ๐ข + 1 proof still works ๐ โ ๐ข mod ๐ if ๐ (:) is misclassified then ๐ = ๐ + ๐ (:) ๐ง (:) Until all patterns properly classified 43
Weight Updates 44
Learning: Binary Perceptron Start with weights = 0 } For each training instance: } Classify with current weights } If correct (i.e., y=y*), no change! } If wrong: adjust the weight vector } ๐ MN' = ๐ M + ๐๐ (:) ๐ง (:) 45
Example 46
Perceptron: Example Change ๐ in a direction that corrects the error 47 [Bishop]
Learning: Binary Perceptron } Start with weights = 0 } For each training instance: } Classify with current weights } If correct (i.e., y=y*), no change! } If wrong: adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is -1. 48
Examples: Perceptron } Separable Case 49
Convergence of Perceptron [Duda, Hart & Stork, 2002] } For data sets that are not linearly separable, the single-sample perceptron learning algorithm will never converge 50
Multiclass Decision Rule } If we have multiple classes: A weight vector for each class: } Score (activation) of a class y: } Prediction highest score wins } Binary = multiclass where the negative class has weight zero 51
Learning: Multiclass Perceptron } Start with all weights = 0 } Pick up training examples one by one } Predict with current weights } If correct, no change! } If wrong: lower score of wrong answer, raise score of right answer 52
Example: Multiclass Perceptron โwin the voteโ โwin the electionโ โwin the gameโ BIAS : 1 BIAS : 0 BIAS : 0 win : 0 win : 0 win : 0 game : 0 game : 0 game : 0 vote : 0 vote : 0 vote : 0 the : 0 the : 0 the : 0 ... ... ... 53
Properties of Perceptrons } Separability: true if some parameters get the training set perfectly classified Separable } Convergence: if the training is separable, perceptron will eventually converge (binary case) } Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability Non-Separable 54
Examples: Perceptron } Non-Separable Case 55
Examples: Perceptron } Non-Separable Case 56
Discriminative approach: logistic regression ๐ฟ = 2 ๐ ๐; ๐ = ๐(๐ \ ๐) ๐ = 1, ๐ฆ ' , โฆ , ๐ฆ Y ๐ = ๐ฅ E , ๐ฅ ' , โฆ , ๐ฅ Y ๐ . is an activation function } Sigmoid (logistic) function } Activation function 1 ๐ ๐จ = 1 + ๐ wx 57
Logistic regression: cost function ๐ F = argmin ๐พ(๐) ๐ ๐พ ๐ = A = @ โ๐ง (:) log ๐ ๐ \ ๐ (:) โ (1 โ ๐ง (:) )log 1 โ ๐ ๐ \ ๐ (:) :B' } ๐พ(๐) is convex w.r.t. parameters. 58
Logistic regression: loss function Loss ๐ง, ๐ ๐; ๐ = โ๐งรlog ๐ ๐; ๐ โ (1 โ ๐ง)รlog(1 โ ๐ ๐; ๐ ) โlog(๐(๐; ๐)) if ๐ง = 1 Loss ๐ง, ๐ ๐; ๐ = j Since ๐ง = 1 or ๐ง = 0 โlog(1 โ ๐ ๐; ๐ ) if ๐ง = 0 โ How is it related to zero-one loss? } = j1 ๐ง โ ๐ง } Loss ๐ง, ๐ง 0 ๐ง = ๐ง } 1 ๐ ๐; ๐ = 1 + ๐๐ฆ๐(โ๐ \ ๐) 59
Logistic regression: Gradient descent ๐ MN' = ๐ M โ ๐๐ผ ๐ ๐พ(๐ M ) A ๐ ๐ : ; ๐ โ ๐ง : ๐ : ๐ผ ๐ ๐พ ๐ = @ :B' } Is it similar to gradient of SSE for linear regression? A ๐ \ ๐ : โ ๐ง : ๐ : ๐ผ ๐ ๐พ ๐ = @ :B' 60
Multi-class logistic regression \ } ๐ ๐; ๐ฟ = ๐ ' ๐, ๐ฟ , โฆ , ๐ โข ๐, ๐ฟ } ๐ฟ = ๐ ' โฏ ๐ โข contains one vector of parameters for each class \ ๐ ) exp (๐ โ ๐ โ ๐; ๐ฟ = โข \ ๐ ) โ exp (๐ โฆ โฆB' 61
Logistic regression: multi-class โ = argmin ๐ฟ ๐พ(๐ฟ) ๐ฟ A โข : log ๐ โ ๐ (:) ; ๐ฟ ๐พ ๐ฟ = โ @ @ ๐ง โ :B' โB' ๐ is a vector of length ๐ฟ (1-of-K coding) ๐ฟ = ๐ ' โฏ ๐ โข e.g., ๐ = 0,0,1,0 \ when the target class is ๐ท ห 62
Logistic regression: multi-class MN' = ๐ โฆ M โ ๐๐ผ ๐ฟ ๐พ(๐ฟ M ) ๐ โฆ A ๐ โฆ ๐ : ; ๐ฟ โ ๐ง โฆ : ๐ : ๐ผ ๐ โฐ ๐พ ๐ฟ = @ :B' 63
Multi-class classifier } ๐ ๐; ๐ฟ = ๐ ' ๐, ๐ฟ , โฆ , ๐ โข ๐, ๐ฟ } ๐ฟ = ๐ ' โฏ ๐ โข contains one vector of parameters for each class } In linear classifiers, ๐ฟ is ๐ร๐ฟ where ๐ shows number of features } ๐ฟ \ ๐ provides us a vector } ๐ ๐; ๐ฟ contains K numbers giving class scores for the input ๐ 64
Example } Output obtained from ๐ฟ \ ๐ + ๐ ๐ฆ ' โฎ ๐ = ๐ฆ โขลฝโข 28 ๐ ' ร28 ๐ฟ \ = โฎ ๐ 'E 'Eรโขลฝโข ๐ ' โฎ ๐ = ๐ 'E 65 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017
Example ๐ฟ \ How can we tell whether this W and b is good or bad? 66 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017
Bias can also be included in the W matrix 67 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017
Softmax classifier loss: example โ โข(โ) ๐ ๐ (:) = โ log โข ๐ โ โฐ โ โฆB' ๐ (') = โ log 0.13 = 0.89 68 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017
Support Vector Machines } Maximizing the margin: good according to intuition, theory, practice } Support vector machines (SVMs) find the separator with max margin 69
Hard-margin SVM: Optimization problem 2 max ๐ ๐,โ ห s. t. ๐ \ ๐ A + ๐ฅ E โฅ 1 โ๐ง A = 1 ๐ \ ๐ A + ๐ฅ E โค โ1 โ๐ง A = โ1 ๐ \ ๐ + ๐ฅ E = 0 ๐ฆ 2 1 ๐ * Margin: ๐ ๐ \ ๐ + ๐ฅ E = 1 ๐ ๐ \ ๐ + ๐ฅ E = โ1 70 ๐ฆ 1
Distance between an ๐ (A) and the plane distance = ๐ \ ๐ (A) + ๐ฅ E ๐ ๐ (A) 71
Hard-margin SVM: Optimization problem We can equivalently optimize: 1 2 ๐ \ ๐ min ๐,โ ห ๐ \ ๐ A + ๐ฅ E โฅ 1 ๐ = 1, โฆ , ๐ s. t. ๐ง A } It is a convex Quadratic Programming (QP) problem } There are computationally efficient packages to solve it. } It has a global minimum (if any). 72
Error measure } Margin violation amount ๐ A ( ๐ A โฅ 0 ): ๐ \ ๐ A + ๐ฅ E โฅ 1 โ ๐ A } ๐ง A - } Total violation: โ ๐ A AB' 73
Soft-margin SVM: Optimization problem } SVM with slack variables: allows samples to fall within the margin, but penalizes them - 1 2 ๐ * + ๐ท @ ๐ A min ลพ ๐,โ ห , ลก โบ โบลโข AB' ๐ \ ๐ A + ๐ฅ E โฅ 1 โ ๐ A ๐ = 1, โฆ , ๐ s. t. ๐ง A ๐ A โฅ 0 ๐ A : slack variables ๐ฆ 2 ๐ < 1 0 < ๐ A < 1 : if ๐ A is correctly classified but inside margin ๐ A > 1 : if ๐ A is misclassifed ๐ > 1 ๐ฆ 1 74
Soft-margin SVM: Cost function - 1 2 ๐ * + ๐ท @ ๐ A min ลพ ๐,โ ห , ลก โบ โบลโข AB' ๐ \ ๐ A + ๐ฅ E โฅ 1 โ ๐ A ๐ = 1, โฆ , ๐ s. t. ๐ง A ๐ A โฅ 0 } It is equivalent to the unconstrained optimization problem: - 1 2 ๐ * + ๐ท @ max (0,1 โ ๐ง (A) (๐ \ ๐ (A) + ๐ฅ E )) min ๐,โ ห AB' 75
Multi-class SVM - ๐พ ๐ฟ = 1 ๐ @ ๐ : + ๐ ๐ฟ * :B' โฆ โก ๐ โฆ ๐ : ; ๐ฟ ๐ก ๐ : = @ max 0,1 + ๐ก โฆ โ ๐ก ยก (โ) Hinge loss: \ ๐ (:) = ๐ โฆ โฆยขยก (โ) \ ๐ (:) โ ๐ ยก (โ) \ ๐ (:) = @ max 0,1 + ๐ โฆ โฆยขยก (โ) โข Y * ๐ ๐ฟ = @ @ ๐ฅ ยคโ L2 regularization: โB' ยคB' 76
Multi-class SVM loss: Example 3 training examples, 3 classes. With some W the scores are ๐ \ ๐ฆ \ ๐ (:) ๐ก โฆ = ๐ โฆ ๐ : = @ max 0,1 + ๐ก โฆ โ ๐ก ยก (โ) โฆยขยก (โ) - 1 = 1 ๐ @ ๐ : 3 2.9 + 0 + 12.9 = 5.7 :B' ๐ (ห) = max ๐ (') = max 0,1 + 5.1 โ 3.2 ๐ (*) = max 0,1 + 1.3 โ 4.9 (0, 2.2 โ (โ3.1) + 1) +max (0, 2.5 โ (โ3.1) + 1) + max 0,1 โ 1.7 โ 3.2 + max 0,1 + 2 โ 4.9 = max (0, 6.3) + max (0, 6.6) = max 0,2.9 + max(0, โ3.9) = max 0, โ2.6 + max(0, โ1.9) = 2.9 + 0 = 6.3 + 6.6 = 12.9 = 0 + 0 77 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017
Recap We need ๐ผ ยช ๐ to update weights โข Y * L2 regularization ๐ ๐ = โ โ ๐ฅ ยคโ โB' ยคB' โข Y ๐ ๐ = โ โ L1 regularization ๐ฅ ยคโ โB' ยคB' 78 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017
Generalized linear } Linear combination of fixed non-linear function of the input vector ๐(๐; ๐) = ๐ฅ E + ๐ฅ ' ๐ ' (๐)+ . . . ๐ฅ ยฌ ๐ ยฌ (๐) {๐ ' (๐), . . . , ๐ ยฌ (๐)} : set of basis functions (or features) ๐ : ๐ : โ Y โ โ 79
Basis functions: examples } Linear } Polynomial (univariate) 80
Polynomial regression: example ๐ = 3 ๐ = 1 ๐ = 5 ๐ = 7 81
Generalized linear classifier } Assume a transformation ๐: โ Y โ โ ยฌ on the feature space ๐ ๐ = [๐ ' (๐), . . . , ๐ ยฌ (๐)] } ๐ โ ๐ ๐ {๐ ' (๐), . . . , ๐ ยฌ (๐)} : set of basis functions (or features) ๐ : ๐ : โ Y โ โ } Find a hyper-plane in the transformed feature space: ๐ * (๐) ๐ฆ 2 ๐: ๐ โ ๐ ๐ ๐ \ ๐ ๐ + ๐ฅ E = 0 ๐ฆ 1 ๐ ' (๐) 82
Model complexity and overfitting } With limited training data, models may achieve zero training error but a large test error. 1 A * ๐ง : โ ๐ ๐ : ; ๐พ Training ๐ @ โ 0 :B' (empirical) loss * โซ 0 Expected E ๐ฒ,ยด ๐ง โ ๐ ๐; ๐พ (true) loss } Over-fitting: when the training loss no longer bears any relation to the test (generalization) loss. } Fails to generalize to unseen examples. 83
Polynomial regression ๐ = 0 ๐ = 1 ๐ง ๐ง ๐ = 9 ๐ = 3 ๐ง ๐ง 84 [Bishop]
Over-fitting causes } Model complexity } E.g., Model with a large number of parameters (degrees of freedom) } Low number of training data } Small data size compared to the complexity of the model 85
Model complexity } Example: } Polynomials with larger ๐ are becoming increasingly tuned to the random noise on the target values. ๐ = 0 ๐ = 1 ๐ง ๐ง ๐ = 3 ๐ = 9 ๐ง ๐ง 86 86 [Bishop]
Number of training data & overfitting } Over-fitting problem becomes less severe as the size of training data increases. ๐ = 9 ๐ = 9 ๐ = 15 ๐ = 100 [Bishop] 87
How to evaluate the learnerโs performance? } Generalization error: true (or expected) error that we would like to optimize } Two ways to assess the generalization error is: } Practical: Use a separate data set to test the model } Theoretical: Law of Large numbers } statistical bounds on the difference between training and expected errors 88
Avoiding over-fitting } Determine a suitable value for model complexity (Model Selection) } Simple hold-out method } Cross-validation } Regularization (Occamโs Razor) } Explicit preference towards simple models } Penalize for the model complexity in the objective function 89
Model Selection } learning algorithm defines the data-driven search over the hypothesis space (i.e. search for good parameters) } hyperparameters are the tunable aspects of the model, that the learning algorithm does not select This slide has been adopted from CMU ML course: 90 http://www.cs.cmu.edu/~mgormley/courses/10601-s18/
Model Selection } Model selection is the process by which we choose the โbestโ model from among a set of candidates } assume access to a function capable of measuring the quality of a model } typically done โoutsideโ the main training algorithm } Model selection / hyperparameter optimization is just another form of learning This slide has been adopted from CMU ML course: 91 http://www.cs.cmu.edu/~mgormley/courses/10601-s18/
๏ฟฝ Simple hold-out: model selection } Steps: } Divide training data into training and validation set ๐ค_๐ก๐๐ข } Use only the training set to train a set of models } Evaluate each learned model on the validation set * ๐ง (:) โ ๐ ๐ (:) ; ๐ ' ยธ_โยนM โ } ๐พ ยธ ๐ = :โยธ_โยนM } Choose the best model based on the validation set error } Usually, too wasteful of valuable training data } Training data may be limited. } On the other hand, small validation set give a relatively noisy estimate of performance. 92
Simple hold out: training, validation, and test sets } Simple hold-out chooses the model that minimizes error on validation set. } ๐พ ยธ ๐ F is likely to be an optimistic estimate of generalization error. } extra parameter (e.g., degree of polynomial) is fit to this set. } Estimate generalization error for the test set } performance of the selected model is finally evaluated on the test set Training Validation 93 Test
Cross-Validation (CV): Evaluation } ๐ -fold cross-validation steps: } Shuffle the dataset and randomly partition training data into ๐ groups of approximately equal size } for ๐ = 1 to ๐ } Choose the ๐ -th group as the held-out validation group } Train the model on all but the ๐ -th group of data } Evaluate the model on the held-out group } Performance scores of the model from ๐ runs are averaged . } The average error rate can be considered as an estimation of the true performance. โฆ First run โฆ Second run โฆ โฆ (k-1)th run โฆ k-th run 94
Cross-Validation (CV): Model Selection } For each model we first find the average error find by CV. } The model with the best average performance is selected. 95
Cross-validation: polynomial regression example } 5-fold CV } 100 runs } average ๐ = 3 ๐ = 1 CV: ๐๐๐น = 1.45 CV: ๐๐๐น = 0.30 ๐ = 5 ๐ = 7 CV: ๐๐๐น = 45.44 CV: ๐๐๐น = 31759 96
Regularization } Adding a penalty term in the cost function to discourage the coefficients from reaching large values. } Ridge regression (weight decay): A * ๐ง : โ ๐ \ ๐ ๐ : + ๐๐ \ ๐ ๐พ ๐ = @ :B' w๐ ๐พ \ ๐ ยพ = ๐พ \ ๐พ + ๐๐ฑ ๐ 97
Polynomial order } Polynomials with larger ๐ are becoming increasingly tuned to the random noise on the target values. } magnitude of the coefficients typically gets larger by increasing ๐ . [Bishop] 98
Regularization parameter ๐ = 9 ๐ฅ F E ๐ฅ F ' ๐ฅ F * ๐ฅ F ห ๐ฅ F โข ๐ฅ F ร ๐ฅ F ร ๐ฅ F โข [Bishop] ๐ฅ F ลฝ ๐ฅ F ร ๐๐๐ = โโ ๐๐๐ = โ18 99
Regularization parameter } Generalization } ๐ now controls the effective complexity of the model and hence determines the degree of over-fitting 100 [Bishop]
Recommend
More recommend