perceptron and logistic regression
play

Perceptron and Logistic Regression Milan Straka October 19, 2020 - PowerPoint PPT Presentation

NPFL129, Lecture 3 Perceptron and Logistic Regression Milan Straka October 19, 2020 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Cross-Validation We


  1. NPFL129, Lecture 3 Perceptron and Logistic Regression Milan Straka October 19, 2020 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

  2. Cross-Validation We already talked about a train set and a test set . Given that the main goal of machine learning is to perform well on unseen data, the test set must not be used during training nor hyperparameter selection. Ideally, it is hidden to us altogether. Therefore, to evaluate a machine learning model (for example to select model architecture, features, or hyperparameter value), we normally need the validation or a development set. However, using a single development set might give us noisy results. To obtain less noisy results (i.e., with smaller variance), we can use cross-validation . In cross-validation, we choose multiple validation sets from the training data, and for every one, we train a model on the rest of the training data and evaluate on the chosen validation sets. A commonly used strategy to choose the validation sets is called k-fold cross-validation . Here the k training set is partitioned into subsets of approximately the same size, and each subset takes turn to play a role of a validation set. NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 2/30

  3. Cross-Validation An extreme case of the k-fold cross-validation is leave-one-out cross-validation , where every element is considered a separate validation set. Computing leave-one-out cross-validation is usually extremely inefficient for larger training sets, but in case of linear regression with L2 regularization, it can be evaluated efficiently. If you are interested, see: Ryan M. Rifkin and Ross A. Lippert: Notes on Regularized Least Square http://cbcl.mit.edu/publications/ps/MIT-CSAIL-TR-2007-025.pdf Implemented by sklearn.linear_model.RidgeCV . NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 3/30

  4. Binary Classification Binary classification is a classification in two classes. To extend linear regression to binary classification, we might seek a threshold and then classify y ( x ; w ) = x w + T b an input as negative/positive depending whether is smaller/larger than a given threshold. Zero value is usually used as the threshold, both because of symmetry and also because the bias parameter acts as a trainable threshold anyway. NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 4/30

  5. Binary Classification Consider two points on the decision y ( x ; w ) = y ( x ; w ) 1 2 boundary. Because , ( x − ) w = 0 2 T x w 1 we have , and so is orthogonal to every vector on the decision w surface – is a normal of the boundary. x x ⊥ Consider and let be orthogonal x projection of to the bounary, so we can x = x + w r ⊥ ∣∣ w ∣∣ write . Multiplying both w T b sides by and adding , we get that the y ( x ) r = x ∣∣ w ∣∣ distance of to the boundary is . The distance of the decision boundary from ∣ b ∣ ∣∣ w ∣∣ origin is therefore . NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 5/30

  6. Perceptron The perceptron algorithm is probably the oldest one for training weights of a binary t ∈ {−1, +1} w classification. Assuming the target value , the goal is to find weights such that for all train data, T sign( y ( x ; w )) = sign( x w ) = , t i i i or equivalently, y ( x ; w ) = T w > 0. t t x i i i i Note that a set is called linearly separable , if there exists a w weight vector such that the above equation holds. NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 6/30

  7. Perceptron The perceptron algorithm was invented by Rosenblat in 1958. X ∈ R N × D t ∈ {−1, +1} N Input : Linearly separable dataset ( , ). w ∈ R D w > 0 T t x i i i Output : Weights such that for all . w ← 0 i until all examples are classified correctly, process example : y ← x T w i y ≤ 0 t i if (incorrectly classified example): w ← w + t x i i w We will prove that the algorithm always arrives at some correct set of weights if the training set is linearly separable. NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 7/30

  8. Perceptron as SGD Consider the main part of the perceptron algorithm: y ← x T w i y ≤ 0 t i if (incorrectly classified example): w ← w + t x i i We can derive the algorithm using on-line gradient descent, using the following loss function { − t x w if t x w ≤ 0 T T def L ( y ( x ; w ), t ) = = max(0, − t x w ) = T ReLU(− t x w ). T 0 otherwise In this specific case, the value of the learning rate does not actually matter, because multiplying w by a constant does not change a prediction. NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 8/30

  9. Perceptron Example NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 9/30

  10. Proof of Perceptron Convergence w w k ∗ k Let be some weights separating the training data and let be the weights after w 0 non-trivial updates of the perceptron algorithm, with being 0. α w w k ∗ We will prove that the angle between and decreases at each step. Note that T w w ∗ k cos( α ) = . ∣∣ w ∣∣ ⋅ ∣∣ w ∣∣ ∗ k NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 10/30

  11. Proof of Perceptron Convergence ∣∣ x ∣∣ R Assume that the maximum norm of any training example is bounded by , and that ≥ γ . T γ w t x w ∗ ∗ is the minimum margin of , so w w ∗ k First consider the dot product of and : = ( w + ) ≥ + γ . T T T w w w t x w w k −1 k −1 ∗ ∗ ∗ k k k By iteratively applying this equation, we get T ≥ kγ . w w ∗ k w k Now consider the length of : 2 = ∣∣ w 2 2 2 ∣∣ w ∣∣ + t ∣∣ = ∣∣ w ∣∣ + 2 t T + ∣∣ x ∣∣ x x w k −1 k −1 k −1 k k k k k k 2 2 2 ≤ 0 ∣∣ w ∣∣ ≤ ∣∣ w ∣∣ + R . T x t x w k −1 k −1 k k k k Because was misclassified, we know that , so 2 k ⋅ R 2 ∣∣ w ∣∣ ≤ k When applied iteratively, we get . NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 11/30

  12. Proof of Perceptron Convergence Putting everything together, we get T w w kγ ∗ k cos( α ) = ≥ . ∣∣ w ∣∣ ⋅ ∣∣ w ∣∣ kR 2 ∣∣ w ∣∣ ∗ k ∗ cos( α ) cos( α ) Therefore, the increases during every update. Because the value of is at most one, we can compute the upper bound on the number of steps when the algorithm converges as 2 2 R ∣∣ w ∣∣ k γ ∗ 1 ≤ or k ≥ . γ 2 R 2 ∣∣ w ∣∣ ∗ NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 12/30

  13. Perceptron Issues Perceptron has several drawbacks: If the input set is not linearly separable, the algorithm never finishes. The algorithm cannot be easily extended to classification into more than two classes. The algorithm performs only prediction, it is not able to return the probabilities of predictions. Most importantly, Perceptron algorithm finds some solution, not necessary a good one, because once it finds some, it cannot perform any more updates. NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 13/30

  14. Common Probability Distributions Bernoulli Distribution The Bernoulli distribution is a distribution over a binary random variable. It has a single φ ∈ [0, 1] parameter , which specifies the probability of the random variable being equal to 1. 1− x P ( x ) = φ (1 − φ ) x E [ x ] = φ Var( x ) = φ (1 − φ ) Categorical Distribution k Extension of the Bernoulli distribution to random variables taking one of different discrete k p ∈ [0, 1] k = 1 ∑ i =1 p i outcomes. It is parametrized by such that . ∏ k x P ( x ) = p i i i E [ x ] = p , Var( x ) = p (1 − p ) i i i i i NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 14/30

  15. Information Theory Self Information Amount of surprise when a random variable is sampled. Should be zero for events with probability 1. Less likely events are more surprising. Independent events should have additive information. 1 def − log P ( x ) = log I ( x ) = P ( x ) NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 15/30

  16. Information Theory Entropy Amount of surprise in the whole distribution. def E − E H ( P ) = [ I ( x )] = [log P ( x )] x∼ P x∼ P P H ( P ) = − P ( x ) log P ( x ) ∑ x for discrete : P H ( P ) = − P ( x ) log P ( x ) d x ∫ for continuous : Note that in the continuous case, the continuous entropy (also called differential entropy ) has slightly different semantics, for example, it can be negative. From now on, all logarithms are natural logarithms with e base . NPFL129, Lecture 3 CV Perceptron ProbabilityBasics MLE LogisticRegression 16/30

Recommend


More recommend