statistical machine learning
play

Statistical Machine Learning Lecture 09: Classification Kristian - PowerPoint PPT Presentation

Statistical Machine Learning Lecture 09: Classification Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters Statistical Machine Learning Summer Term 2020 1 / 61 Todays Objectives Make you


  1. Statistical Machine Learning Lecture 09: Classification Kristian Kersting TU Darmstadt Summer Term 2020 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 1 / 61

  2. Today’s Objectives Make you understand how to do build a discriminative classifier! Covered Topics: Discriminant Functions Multi-Class Classification Fisher Discriminate Analysis Perceptrons Logistic Regression K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 2 / 61

  3. Outline 1. Discriminant Functions 2. Fisher Discriminant Analysis 3. Perceptron Algorithm 4. Logistic Regression 5. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 3 / 61

  4. 1. Discriminant Functions Outline 1. Discriminant Functions 2. Fisher Discriminant Analysis 3. Perceptron Algorithm 4. Logistic Regression 5. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 4 / 61

  5. 1. Discriminant Functions Reminder of Bayesian Decision Theory We want to find the a-posteriori probability (posterior) of the class C k given the observation (feature) x p ( C k | x ) = p ( x | C k ) p ( C k ) p ( x | C k ) p ( C k ) = � � � � � p ( x ) x | C j j p p C j p ( C k | x ) - class posterior p ( x | C k ) - class-conditional probability (likelihood) p ( C k ) - class prior p ( x ) - normalization term K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 5 / 61

  6. 1. Discriminant Functions Reminder of Bayesian Decision Theory Decision rule Decide C 1 if p ( C 1 | x ) > p ( C 2 | x ) Using the definition of conditional distributions, equivalent to p ( x | C 1 ) p ( C 1 ) > p ( x | C 2 ) p ( C 2 ) ≡ p ( x | C 1 ) p ( x | C 2 ) > p ( C 2 ) p ( C 1 ) A classifier obeying this rule is called a Bayes optimal classifier K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 6 / 61

  7. 1. Discriminant Functions Reminder of Bayesian Decision Theory Current approach p ( C k | x ) = p ( x | C k ) p ( C k ) / p ( x ) (Bayes’ rule) Model and estimate the class-conditional density p ( x | C k ) and the class prior p ( C k ) Compute posterior p ( C k | x ) Minimize the error probability by maximizing p ( C k | x ) New approach Directly encode the decision boundary Without modeling the densities directly Still minimize the error probability K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 7 / 61

  8. 1. Discriminant Functions Discriminant Functions Formulate classification using comparisons Discriminant functions y 1 ( x ) , . . . , y K ( x ) Classify x as class C k iff y k ( x ) > y j ( x ) ∀ j � = k More formally, a discriminant maps a vector x to one of the K available classes K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 8 / 61

  9. 1. Discriminant Functions Discriminant Functions Example of discriminant functions from the Bayes classifier y k ( x ) = p ( C k | x ) y k ( x ) = p ( x | C k ) p ( C k ) y k ( x ) = log p ( x | C k ) + log p ( C k ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 9 / 61

  10. 1. Discriminant Functions Discriminant Functions Base case with 2 classes y 1 ( x ) > y 2 ( x ) 0 y 1 ( x ) − y 2 ( x ) > y ( x ) > 0 Example from the Bayes classifier y ( x ) = p ( C 1 | x ) − p ( C 2 | x ) log p ( x | C 1 ) p ( x | C 2 ) + log p ( C 1 ) y ( x ) = p ( C 2 ) K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 10 / 61

  11. 1. Discriminant Functions Example - Bayes Classifier Base case with 2 classes and Gaussian class-conditionals K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 11 / 61

  12. 1. Discriminant Functions Linear Discriminant Functions Base case with 2 classes y ( x ) > 0 decide class 1, otherwise class 2 Simplest case: linear decision boundary In linear discriminants , the decision surfaces are (hyper)planes Linear Discriminant Function y ( x ) = w ⊺ x + w 0 Where w is the normal vector and w 0 the offset K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 12 / 61

  13. 1. Discriminant Functions Linear Discriminant Functions Illustration of the 2D case � � ⊺ y ( x ) = w ⊺ x + w 0 , x = x 1 x 2 x 2 y > 0 y = 0 R 1 y < 0 R 2 x w y ( x ) k w k x ? x 1 − w 0 k w k K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 13 / 61

  14. 1. Discriminant Functions Linear Discriminant Functions K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 14 / 61

  15. 1. Discriminant Functions Discriminant Functions Why might we want to use discriminant functions? We could easily fit the class-conditionals using Gaussians and use a Bayes classifier K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 15 / 61

  16. 1. Discriminant Functions Discriminant Functions How about now? Do these points matter for making the decision between the two classes? K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 16 / 61

  17. 1. Discriminant Functions Distribution-free Classifiers We do not necessarily need to model all the details of the class-conditional distributions to come up with a good decision boundary. (The class-conditionals may have many intricacies that do not matter at the end of the day) If we can learn where to place the decision boundary directly, we can avoid some of the complexity It would be unwise to believe that such classifiers are inherently superior to probabilistic ones. We shall see why later... K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 17 / 61

  18. 1. Discriminant Functions Multi-Class Case What if we constructed a multi-class classifier from several 2-class classifiers? If we base our decision rule on binary decisions, this may lead to ambiguities, where we can votes for several classes such as C 1 , C 2 respectively C 1 , C 2 , C 3 K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 18 / 61

  19. 1. Discriminant Functions Multi-Class Case - Better Solution Use a discriminant function to encode how strongly we believe in each class y 1 ( x ) , . . . , y K ( x ) Decision rule: Decide k if y k ( x ) > y j ( x ) ∀ j � = k If the discriminant functions are linear, the decision regions are connected and convex K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 19 / 61

  20. 2. Fisher Discriminant Analysis Outline 1. Discriminant Functions 2. Fisher Discriminant Analysis 3. Perceptron Algorithm 4. Logistic Regression 5. Wrap-Up K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 20 / 61

  21. 2. Fisher Discriminant Analysis Linear Discriminant Functions Illustration of the 2D case � � ⊺ y ( x ) = w ⊺ x + w 0 , x = x 1 x 2 x 2 y > 0 y = 0 R 1 y < 0 R 2 x w y ( x ) k w k x ? x 1 − w 0 k w k K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 21 / 61

  22. 2. Fisher Discriminant Analysis First Attempt: Least Squares Try to achieve a certain value of the discriminative function y ( x ) = + 1 ⇔ x ∈ C 1 y ( x ) = − 1 ⇔ x ∈ C 2 � � x 1 ∈ R d , . . . , x n Training data inputs: X = Training data labels: Y = { y 1 ∈ {− 1 , + 1 } , . . . , y n } Linear Discriminant Function Try to enforce x ⊺ i w + w 0 = y i , ∀ i = 1 , . . . , n There is one linear equation for each training data point/label pair K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 22 / 61

  23. 2. Fisher Discriminant Analysis First Attempt: Least Squares Linear system of equations x ⊺ i w + w 0 = y i , ∀ i = 1 , . . . , n � x i � w 1 � ⊺ ∈ R d × 1 , ˆ � ⊺ ∈ R d × 1 Define ˆ x i = w = w 0 Rewrite the equation system x ⊺ w = y i , ∀ i = 1 , . . . , n ˆ i ˆ In matrix-vector notation we have X ⊺ ˆ ˆ w = y x n ] ∈ R d × n and y = [ y 1 , . . . , y n ] ⊺ With ˆ X = [ˆ x 1 , . . . , ˆ K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 23 / 61

  24. 2. Fisher Discriminant Analysis First Attempt: Least Squares X ⊺ ˆ ˆ w = y An overdetermined system of equations There are n equations and d + 1 unknowns K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 24 / 61

  25. 2. Fisher Discriminant Analysis First Attempt: Least Squares Look for the least squares solution � � 2 w ∗ = arg min X ⊺ ˆ � ˆ � � ˆ w − y � ˆ w � � ⊺ � � X ⊺ ˆ X ⊺ ˆ ˆ ˆ = arg min w − y w − y ˆ w X ⊺ ˆ w ⊺ ˆ X ˆ w − 2 y ⊺ ˆ ˆ = arg min X ⊺ ˆ w + y ⊺ y w ˆ � � X ⊺ ˆ w ⊺ ˆ X ˆ w − 2 y ⊺ ˆ ∇ ˆ ˆ w + y ⊺ y = 0 X ⊺ ˆ w X ⊺ � − 1 ˆ � X ˆ ˆ ˆ w = X y � �� � pseudo-inverse K. Kersting based on Slides from J. Peters · Statistical Machine Learning · Summer Term 2020 25 / 61

Recommend


More recommend