learning linear methods
play

Learning: Linear Methods CE417: Introduction to Artificial - PowerPoint PPT Presentation

Learning: Linear Methods CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2019 Soleymani Some slides are based on Klein and Abdeel, CS188, UC Berkeley. Paradigms of ML } Supervised learning (regression,


  1. Square error loss function for classification! Square error loss is not suitable for classification: Least square loss penalizes โ€˜too correctโ€™ predictions (that they lie a long way on the correct side } of the decision) Least square loss also lack robustness to noise } - ๐ฟ = 2 * ๐พ ๐’™ = @ ๐‘ฅ๐‘ฆ : + ๐‘ฅ E โˆ’ ๐‘ง : :B' 36

  2. Notation } ๐’™ = ๐‘ฅ E , ๐‘ฅ ' , . . . , ๐‘ฅ Y \ } ๐’š = 1, ๐‘ฆ ' , โ€ฆ , ๐‘ฆ Y \ } ๐‘ฅ E + ๐‘ฅ ' ๐‘ฆ ' + โ‹ฏ + ๐‘ฅ Y ๐‘ฆ Y = ๐’™ \ ๐’š } We show input by ๐’š or ๐‘”(๐’š) 37

  3. SSE cost function for classification ๐ฟ = 2 } Is it more suitable if we set ๐‘” ๐’š; ๐’™ = ๐‘• ๐’™ \ ๐’š ? - sign ๐’™ \ ๐’š โˆ’ ๐‘ง * * ๐พ ๐’™ = @ sign ๐’™ \ ๐’š : โˆ’ ๐‘ง : ๐‘ง = 1 :B' sign ๐‘จ = jโˆ’ 1, ๐‘จ < 0 1, ๐‘จ โ‰ฅ 0 ๐’™ \ ๐’š } ๐พ ๐’™ is a piecewise constant function shows the number of misclassifications ๐พ(๐’™) Training error incurred in classifying training samples 38

  4. Perceptron algorithm } Linear classifier } Two-class: ๐‘ง โˆˆ {โˆ’1,1} } ๐‘ง = โˆ’1 for ๐ท * , ๐‘ง = 1 for ๐ท ' } Goal: โˆ€๐‘—, ๐’š (:) โˆˆ ๐ท ' โ‡’ ๐’™ \ ๐’š (:) > 0 โˆ€๐‘—, ๐’š : โˆˆ ๐ท * โ‡’ ๐’™ \ ๐’š : < 0 } } ๐‘• ๐’š; ๐’™ = sign(๐’™ \ ๐’š) 39

  5. ๏ฟฝ Perceptron criterion ๐พ o ๐’™ = โˆ’ @ ๐’™ \ ๐’š : ๐‘ง : :โˆˆโ„ณ โ„ณ : subset of training data that are misclassified Many solutions? Which solution among them? 40

  6. Cost function ๐พ(๐’™) ๐พ o (๐’™) ๐‘ฅ E ๐‘ฅ E ๐‘ฅ ' ๐‘ฅ ' # of misclassifications Perceptronโ€™s as a cost function cost function There may be many solutions in these cost functions 41 [Duda, Hart, and Stork, 2002]

  7. ๏ฟฝ ๏ฟฝ Batch Perceptron โ€œGradient Descentโ€ to solve the optimization problem: ๐’™ MN' = ๐’™ M โˆ’ ๐œƒ๐›ผ ๐’™ ๐พ o (๐’™ M ) ๐’™ ๐พ o ๐’™ = โˆ’ @ ๐’š : ๐‘ง : ๐›ผ :โˆˆโ„ณ Batch Perceptron converges in finite number of steps for linearly separable data: Initialize ๐’™ Repeat ๐’š : ๐‘ง : ๐’™ = ๐’™ + ๐œƒ โˆ‘ :โˆˆโ„ณ Until convergence 42

  8. Stochastic gradient descent for Perceptron } Single-sample perceptron: } If ๐’š (:) is misclassified: ๐’™ MN' = ๐’™ M + ๐œƒ๐’š (:) ๐‘ง (:) } Perceptron convergence theorem: for linearly separable data } If training data are linearly separable, the single-sample perceptron is also guaranteed to find a solution in a finite number of steps Fixed-Increment single sample Perceptron Initialize ๐’™, ๐‘ข โ† 0 repeat ๐œƒ can be set to 1 and ๐‘ข โ† ๐‘ข + 1 proof still works ๐‘— โ† ๐‘ข mod ๐‘‚ if ๐’š (:) is misclassified then ๐’™ = ๐’™ + ๐’š (:) ๐‘ง (:) Until all patterns properly classified 43

  9. Weight Updates 44

  10. Learning: Binary Perceptron Start with weights = 0 } For each training instance: } Classify with current weights } If correct (i.e., y=y*), no change! } If wrong: adjust the weight vector } ๐’™ MN' = ๐’™ M + ๐œƒ๐’š (:) ๐‘ง (:) 45

  11. Example 46

  12. Perceptron: Example Change ๐’™ in a direction that corrects the error 47 [Bishop]

  13. Learning: Binary Perceptron } Start with weights = 0 } For each training instance: } Classify with current weights } If correct (i.e., y=y*), no change! } If wrong: adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is -1. 48

  14. Examples: Perceptron } Separable Case 49

  15. Convergence of Perceptron [Duda, Hart & Stork, 2002] } For data sets that are not linearly separable, the single-sample perceptron learning algorithm will never converge 50

  16. Multiclass Decision Rule } If we have multiple classes: A weight vector for each class: } Score (activation) of a class y: } Prediction highest score wins } Binary = multiclass where the negative class has weight zero 51

  17. Learning: Multiclass Perceptron } Start with all weights = 0 } Pick up training examples one by one } Predict with current weights } If correct, no change! } If wrong: lower score of wrong answer, raise score of right answer 52

  18. Example: Multiclass Perceptron โ€œwin the voteโ€ โ€œwin the electionโ€ โ€œwin the gameโ€ BIAS : 1 BIAS : 0 BIAS : 0 win : 0 win : 0 win : 0 game : 0 game : 0 game : 0 vote : 0 vote : 0 vote : 0 the : 0 the : 0 the : 0 ... ... ... 53

  19. Properties of Perceptrons } Separability: true if some parameters get the training set perfectly classified Separable } Convergence: if the training is separable, perceptron will eventually converge (binary case) } Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability Non-Separable 54

  20. Examples: Perceptron } Non-Separable Case 55

  21. Examples: Perceptron } Non-Separable Case 56

  22. Discriminative approach: logistic regression ๐ฟ = 2 ๐‘• ๐’š; ๐’™ = ๐œ(๐’™ \ ๐’š) ๐’š = 1, ๐‘ฆ ' , โ€ฆ , ๐‘ฆ Y ๐’™ = ๐‘ฅ E , ๐‘ฅ ' , โ€ฆ , ๐‘ฅ Y ๐œ . is an activation function } Sigmoid (logistic) function } Activation function 1 ๐œ ๐‘จ = 1 + ๐‘“ wx 57

  23. Logistic regression: cost function ๐’™ F = argmin ๐พ(๐’™) ๐’™ ๐พ ๐’™ = A = @ โˆ’๐‘ง (:) log ๐œ ๐’™ \ ๐’š (:) โˆ’ (1 โˆ’ ๐‘ง (:) )log 1 โˆ’ ๐œ ๐’™ \ ๐’š (:) :B' } ๐พ(๐’™) is convex w.r.t. parameters. 58

  24. Logistic regression: loss function Loss ๐‘ง, ๐‘” ๐’š; ๐’™ = โˆ’๐‘งร—log ๐œ ๐’š; ๐’™ โˆ’ (1 โˆ’ ๐‘ง)ร—log(1 โˆ’ ๐œ ๐’š; ๐’™ ) โˆ’log(๐œ(๐’š; ๐’™)) if ๐‘ง = 1 Loss ๐‘ง, ๐œ ๐’š; ๐’™ = j Since ๐‘ง = 1 or ๐‘ง = 0 โˆ’log(1 โˆ’ ๐œ ๐’š; ๐’™ ) if ๐‘ง = 0 โ‡’ How is it related to zero-one loss? } = j1 ๐‘ง โ‰  ๐‘ง } Loss ๐‘ง, ๐‘ง 0 ๐‘ง = ๐‘ง } 1 ๐œ ๐’š; ๐’™ = 1 + ๐‘“๐‘ฆ๐‘ž(โˆ’๐’™ \ ๐’š) 59

  25. Logistic regression: Gradient descent ๐’™ MN' = ๐’™ M โˆ’ ๐œƒ๐›ผ ๐’™ ๐พ(๐’™ M ) A ๐œ ๐’š : ; ๐’™ โˆ’ ๐‘ง : ๐’š : ๐›ผ ๐’™ ๐พ ๐’™ = @ :B' } Is it similar to gradient of SSE for linear regression? A ๐’™ \ ๐’š : โˆ’ ๐‘ง : ๐’š : ๐›ผ ๐’™ ๐พ ๐’™ = @ :B' 60

  26. Multi-class logistic regression \ } ๐‘• ๐’š; ๐‘ฟ = ๐‘• ' ๐’š, ๐‘ฟ , โ€ฆ , ๐‘• โ€ข ๐’š, ๐‘ฟ } ๐‘ฟ = ๐’™ ' โ‹ฏ ๐’™ โ€ข contains one vector of parameters for each class \ ๐’š ) exp (๐’™ โ€š ๐‘• โ€š ๐’š; ๐‘ฟ = โ€ข \ ๐’š ) โˆ‘ exp (๐’™ โ€ฆ โ€ฆB' 61

  27. Logistic regression: multi-class โ€  = argmin ๐‘ฟ ๐พ(๐‘ฟ) ๐‘ฟ A โ€ข : log ๐‘• โ€š ๐’š (:) ; ๐‘ฟ ๐พ ๐‘ฟ = โˆ’ @ @ ๐‘ง โ€š :B' โ€šB' ๐’› is a vector of length ๐ฟ (1-of-K coding) ๐‘ฟ = ๐’™ ' โ‹ฏ ๐’™ โ€ข e.g., ๐’› = 0,0,1,0 \ when the target class is ๐ท ห† 62

  28. Logistic regression: multi-class MN' = ๐’™ โ€ฆ M โˆ’ ๐œƒ๐›ผ ๐‘ฟ ๐พ(๐‘ฟ M ) ๐’™ โ€ฆ A ๐‘• โ€ฆ ๐’š : ; ๐‘ฟ โˆ’ ๐‘ง โ€ฆ : ๐’š : ๐›ผ ๐’™ โ€ฐ ๐พ ๐‘ฟ = @ :B' 63

  29. Multi-class classifier } ๐‘• ๐’š; ๐‘ฟ = ๐‘• ' ๐’š, ๐‘ฟ , โ€ฆ , ๐‘• โ€ข ๐’š, ๐‘ฟ } ๐‘ฟ = ๐’™ ' โ‹ฏ ๐’™ โ€ข contains one vector of parameters for each class } In linear classifiers, ๐‘ฟ is ๐‘’ร—๐ฟ where ๐‘’ shows number of features } ๐‘ฟ \ ๐’š provides us a vector } ๐‘• ๐’š; ๐‘ฟ contains K numbers giving class scores for the input ๐’š 64

  30. Example } Output obtained from ๐‘ฟ \ ๐’š + ๐’„ ๐‘ฆ ' โ‹ฎ ๐’š = ๐‘ฆ โ€ขลฝโ€ข 28 ๐’™ ' ร—28 ๐‘ฟ \ = โ‹ฎ ๐’™ 'E 'Eร—โ€ขลฝโ€ข ๐‘ ' โ‹ฎ ๐’„ = ๐‘ 'E 65 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

  31. Example ๐‘ฟ \ How can we tell whether this W and b is good or bad? 66 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

  32. Bias can also be included in the W matrix 67 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

  33. Softmax classifier loss: example โ€ โ€ข(โ€“) ๐‘“ ๐‘€ (:) = โˆ’ log โ€ข ๐‘“ โ€ โ€ฐ โˆ‘ โ€ฆB' ๐‘€ (') = โˆ’ log 0.13 = 0.89 68 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

  34. Support Vector Machines } Maximizing the margin: good according to intuition, theory, practice } Support vector machines (SVMs) find the separator with max margin 69

  35. Hard-margin SVM: Optimization problem 2 max ๐’™ ๐’™,โ€” หœ s. t. ๐’™ \ ๐’š A + ๐‘ฅ E โ‰ฅ 1 โˆ€๐‘ง A = 1 ๐’™ \ ๐’š A + ๐‘ฅ E โ‰ค โˆ’1 โˆ€๐‘ง A = โˆ’1 ๐’™ \ ๐’š + ๐‘ฅ E = 0 ๐‘ฆ 2 1 ๐’™ * Margin: ๐’™ ๐’™ \ ๐’š + ๐‘ฅ E = 1 ๐’™ ๐’™ \ ๐’š + ๐‘ฅ E = โˆ’1 70 ๐‘ฆ 1

  36. Distance between an ๐’š (A) and the plane distance = ๐’™ \ ๐’š (A) + ๐‘ฅ E ๐’™ ๐’š (A) 71

  37. Hard-margin SVM: Optimization problem We can equivalently optimize: 1 2 ๐’™ \ ๐’™ min ๐’™,โ€” หœ ๐’™ \ ๐’š A + ๐‘ฅ E โ‰ฅ 1 ๐‘œ = 1, โ€ฆ , ๐‘‚ s. t. ๐‘ง A } It is a convex Quadratic Programming (QP) problem } There are computationally efficient packages to solve it. } It has a global minimum (if any). 72

  38. Error measure } Margin violation amount ๐œŠ A ( ๐œŠ A โ‰ฅ 0 ): ๐’™ \ ๐’š A + ๐‘ฅ E โ‰ฅ 1 โˆ’ ๐œŠ A } ๐‘ง A - } Total violation: โˆ‘ ๐œŠ A AB' 73

  39. Soft-margin SVM: Optimization problem } SVM with slack variables: allows samples to fall within the margin, but penalizes them - 1 2 ๐’™ * + ๐ท @ ๐œŠ A min ลพ ๐’™,โ€” หœ , ลก โ€บ โ€บล“โ€ข AB' ๐’™ \ ๐’š A + ๐‘ฅ E โ‰ฅ 1 โˆ’ ๐œŠ A ๐‘œ = 1, โ€ฆ , ๐‘‚ s. t. ๐‘ง A ๐œŠ A โ‰ฅ 0 ๐œŠ A : slack variables ๐‘ฆ 2 ๐œŠ < 1 0 < ๐œŠ A < 1 : if ๐’š A is correctly classified but inside margin ๐œŠ A > 1 : if ๐’š A is misclassifed ๐œŠ > 1 ๐‘ฆ 1 74

  40. Soft-margin SVM: Cost function - 1 2 ๐’™ * + ๐ท @ ๐œŠ A min ลพ ๐’™,โ€” หœ , ลก โ€บ โ€บล“โ€ข AB' ๐’™ \ ๐’š A + ๐‘ฅ E โ‰ฅ 1 โˆ’ ๐œŠ A ๐‘œ = 1, โ€ฆ , ๐‘‚ s. t. ๐‘ง A ๐œŠ A โ‰ฅ 0 } It is equivalent to the unconstrained optimization problem: - 1 2 ๐’™ * + ๐ท @ max (0,1 โˆ’ ๐‘ง (A) (๐’™ \ ๐’š (A) + ๐‘ฅ E )) min ๐’™,โ€” หœ AB' 75

  41. Multi-class SVM - ๐พ ๐‘ฟ = 1 ๐‘‚ @ ๐‘€ : + ๐œ‡ ๐‘ฟ * :B' โ€ฆ โ‰ก ๐‘• โ€ฆ ๐’š : ; ๐‘ฟ ๐‘ก ๐‘€ : = @ max 0,1 + ๐‘ก โ€ฆ โˆ’ ๐‘ก ยก (โ€“) Hinge loss: \ ๐’š (:) = ๐’™ โ€ฆ โ€ฆยขยก (โ€“) \ ๐’š (:) โˆ’ ๐’™ ยก (โ€“) \ ๐’š (:) = @ max 0,1 + ๐’™ โ€ฆ โ€ฆยขยก (โ€“) โ€ข Y * ๐‘† ๐‘ฟ = @ @ ๐‘ฅ ยคโ€š L2 regularization: โ€šB' ยคB' 76

  42. Multi-class SVM loss: Example 3 training examples, 3 classes. With some W the scores are ๐‘‹ \ ๐‘ฆ \ ๐’š (:) ๐‘ก โ€ฆ = ๐’™ โ€ฆ ๐‘€ : = @ max 0,1 + ๐‘ก โ€ฆ โˆ’ ๐‘ก ยก (โ€“) โ€ฆยขยก (โ€“) - 1 = 1 ๐‘‚ @ ๐‘€ : 3 2.9 + 0 + 12.9 = 5.7 :B' ๐‘€ (ห†) = max ๐‘€ (') = max 0,1 + 5.1 โˆ’ 3.2 ๐‘€ (*) = max 0,1 + 1.3 โˆ’ 4.9 (0, 2.2 โˆ’ (โˆ’3.1) + 1) +max (0, 2.5 โˆ’ (โˆ’3.1) + 1) + max 0,1 โˆ’ 1.7 โˆ’ 3.2 + max 0,1 + 2 โˆ’ 4.9 = max (0, 6.3) + max (0, 6.6) = max 0,2.9 + max(0, โˆ’3.9) = max 0, โˆ’2.6 + max(0, โˆ’1.9) = 2.9 + 0 = 6.3 + 6.6 = 12.9 = 0 + 0 77 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

  43. Recap We need ๐›ผ ยช ๐‘€ to update weights โ€ข Y * L2 regularization ๐‘† ๐‘‹ = โˆ‘ โˆ‘ ๐‘ฅ ยคโ€š โ€šB' ยคB' โ€ข Y ๐‘† ๐‘‹ = โˆ‘ โˆ‘ L1 regularization ๐‘ฅ ยคโ€š โ€šB' ยคB' 78 This slide has been adopted from Fei Fei Li and colleagues lectures, cs231n, Stanford 2017

  44. Generalized linear } Linear combination of fixed non-linear function of the input vector ๐‘”(๐’š; ๐’™) = ๐‘ฅ E + ๐‘ฅ ' ๐œš ' (๐’š)+ . . . ๐‘ฅ ยฌ ๐œš ยฌ (๐’š) {๐œš ' (๐’š), . . . , ๐œš ยฌ (๐’š)} : set of basis functions (or features) ๐œš : ๐’š : โ„ Y โ†’ โ„ 79

  45. Basis functions: examples } Linear } Polynomial (univariate) 80

  46. Polynomial regression: example ๐‘› = 3 ๐‘› = 1 ๐‘› = 5 ๐‘› = 7 81

  47. Generalized linear classifier } Assume a transformation ๐œš: โ„ Y โ†’ โ„ ยฌ on the feature space ๐” ๐’š = [๐œš ' (๐’š), . . . , ๐œš ยฌ (๐’š)] } ๐’š โ†’ ๐” ๐’š {๐œš ' (๐’š), . . . , ๐œš ยฌ (๐’š)} : set of basis functions (or features) ๐œš : ๐’š : โ„ Y โ†’ โ„ } Find a hyper-plane in the transformed feature space: ๐œš * (๐’š) ๐‘ฆ 2 ๐œš: ๐’š โ†’ ๐” ๐’š ๐’™ \ ๐” ๐’š + ๐‘ฅ E = 0 ๐‘ฆ 1 ๐œš ' (๐’š) 82

  48. Model complexity and overfitting } With limited training data, models may achieve zero training error but a large test error. 1 A * ๐‘ง : โˆ’ ๐‘” ๐’š : ; ๐œพ Training ๐‘œ @ โ‰ˆ 0 :B' (empirical) loss * โ‰ซ 0 Expected E ๐ฒ,ยด ๐‘ง โˆ’ ๐‘” ๐’š; ๐œพ (true) loss } Over-fitting: when the training loss no longer bears any relation to the test (generalization) loss. } Fails to generalize to unseen examples. 83

  49. Polynomial regression ๐‘› = 0 ๐‘› = 1 ๐‘ง ๐‘ง ๐‘› = 9 ๐‘› = 3 ๐‘ง ๐‘ง 84 [Bishop]

  50. Over-fitting causes } Model complexity } E.g., Model with a large number of parameters (degrees of freedom) } Low number of training data } Small data size compared to the complexity of the model 85

  51. Model complexity } Example: } Polynomials with larger ๐‘› are becoming increasingly tuned to the random noise on the target values. ๐‘› = 0 ๐‘› = 1 ๐‘ง ๐‘ง ๐‘› = 3 ๐‘› = 9 ๐‘ง ๐‘ง 86 86 [Bishop]

  52. Number of training data & overfitting } Over-fitting problem becomes less severe as the size of training data increases. ๐‘› = 9 ๐‘› = 9 ๐‘œ = 15 ๐‘œ = 100 [Bishop] 87

  53. How to evaluate the learnerโ€™s performance? } Generalization error: true (or expected) error that we would like to optimize } Two ways to assess the generalization error is: } Practical: Use a separate data set to test the model } Theoretical: Law of Large numbers } statistical bounds on the difference between training and expected errors 88

  54. Avoiding over-fitting } Determine a suitable value for model complexity (Model Selection) } Simple hold-out method } Cross-validation } Regularization (Occamโ€™s Razor) } Explicit preference towards simple models } Penalize for the model complexity in the objective function 89

  55. Model Selection } learning algorithm defines the data-driven search over the hypothesis space (i.e. search for good parameters) } hyperparameters are the tunable aspects of the model, that the learning algorithm does not select This slide has been adopted from CMU ML course: 90 http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

  56. Model Selection } Model selection is the process by which we choose the โ€œbestโ€ model from among a set of candidates } assume access to a function capable of measuring the quality of a model } typically done โ€œoutsideโ€ the main training algorithm } Model selection / hyperparameter optimization is just another form of learning This slide has been adopted from CMU ML course: 91 http://www.cs.cmu.edu/~mgormley/courses/10601-s18/

  57. ๏ฟฝ Simple hold-out: model selection } Steps: } Divide training data into training and validation set ๐‘ค_๐‘ก๐‘“๐‘ข } Use only the training set to train a set of models } Evaluate each learned model on the validation set * ๐‘ง (:) โˆ’ ๐‘” ๐’š (:) ; ๐’™ ' ยธ_โ€ยนM โˆ‘ } ๐พ ยธ ๐’™ = :โˆˆยธ_โ€ยนM } Choose the best model based on the validation set error } Usually, too wasteful of valuable training data } Training data may be limited. } On the other hand, small validation set give a relatively noisy estimate of performance. 92

  58. Simple hold out: training, validation, and test sets } Simple hold-out chooses the model that minimizes error on validation set. } ๐พ ยธ ๐’™ F is likely to be an optimistic estimate of generalization error. } extra parameter (e.g., degree of polynomial) is fit to this set. } Estimate generalization error for the test set } performance of the selected model is finally evaluated on the test set Training Validation 93 Test

  59. Cross-Validation (CV): Evaluation } ๐‘™ -fold cross-validation steps: } Shuffle the dataset and randomly partition training data into ๐‘™ groups of approximately equal size } for ๐‘— = 1 to ๐‘™ } Choose the ๐‘— -th group as the held-out validation group } Train the model on all but the ๐‘— -th group of data } Evaluate the model on the held-out group } Performance scores of the model from ๐‘™ runs are averaged . } The average error rate can be considered as an estimation of the true performance. โ€ฆ First run โ€ฆ Second run โ€ฆ โ€ฆ (k-1)th run โ€ฆ k-th run 94

  60. Cross-Validation (CV): Model Selection } For each model we first find the average error find by CV. } The model with the best average performance is selected. 95

  61. Cross-validation: polynomial regression example } 5-fold CV } 100 runs } average ๐‘› = 3 ๐‘› = 1 CV: ๐‘๐‘‡๐น = 1.45 CV: ๐‘๐‘‡๐น = 0.30 ๐‘› = 5 ๐‘› = 7 CV: ๐‘๐‘‡๐น = 45.44 CV: ๐‘๐‘‡๐น = 31759 96

  62. Regularization } Adding a penalty term in the cost function to discourage the coefficients from reaching large values. } Ridge regression (weight decay): A * ๐‘ง : โˆ’ ๐’™ \ ๐” ๐’š : + ๐œ‡๐’™ \ ๐’™ ๐พ ๐’™ = @ :B' w๐Ÿ ๐šพ \ ๐’› ยพ = ๐šพ \ ๐šพ + ๐œ‡๐‘ฑ ๐’™ 97

  63. Polynomial order } Polynomials with larger ๐‘› are becoming increasingly tuned to the random noise on the target values. } magnitude of the coefficients typically gets larger by increasing ๐‘› . [Bishop] 98

  64. Regularization parameter ๐‘› = 9 ๐‘ฅ F E ๐‘ฅ F ' ๐‘ฅ F * ๐‘ฅ F ห† ๐‘ฅ F โ€ข ๐‘ฅ F ร‚ ๐‘ฅ F รƒ ๐‘ฅ F โ€ข [Bishop] ๐‘ฅ F ลฝ ๐‘ฅ F ร„ ๐‘š๐‘œ๐œ‡ = โˆ’โˆž ๐‘š๐‘œ๐œ‡ = โˆ’18 99

  65. Regularization parameter } Generalization } ๐œ‡ now controls the effective complexity of the model and hence determines the degree of over-fitting 100 [Bishop]

Recommend


More recommend