BAYES AND NEAREST NEIGHBOR BAYES AND NEAREST NEIGHBOR CLASSIFIERS CLASSIFIERS Matthieu R Bloch Tuesday, January 21, 2020 1
LOGISTICS LOGISTICS TAs and Office hours Monday: Mehrdad (TSRB 523a) - 2pm-3:15pm Tuesday: TJ (VL C449 Cubicle D) - 1:30pm - 2:45pm Wednesday: Matthieu (TSRB 423) - 12:00:pm-1:15pm Thursday: Hossein (VL C449 Cubicle B): 10:45pm - 12:00pm Friday: Brighton (TSRB 523a) - 12pm-1:15pm Homework 1 posted on Canvas Due Wednesday January 29, 2020 (11:59PM EST) (Wednesday February 5, 2020 for DL) 2
RECAP: BAYES CLASSIFIER RECAP: BAYES CLASSIFIER What is the best risk (smallest) that we can achieve? Assume that we actually know and P X P Y | X Denote the a posteriori class probabilities of by x ∈ X η k ( x ) ≜ P ( Y = k | X = x ) Denote the a priori class probabilities by π k ≜ P ( Y = k ) Lemma (Bayes classifier) The classifier is optimal, i.e., for any classifier , we have h B ( x ) ≜ argmax k ∈[0; K −1] η k ( x ) h . h B R ( ) ≤ R ( h ) h B R ( ) = [ 1 − ( X ) ] E X max η k k Terminology is called the Bayes classifier h B is called the Bayes risk h B R B ≜ R ( ) 3
4
OTHER FORMS OF THE BAYES CLASSIFIER OTHER FORMS OF THE BAYES CLASSIFIER h B ( x ) ≜ argmax k ∈[0; K −1] η k ( x ) h B ( x ) ≜ argmax k ∈[0; K −1] π k p X | Y ( x | k ) For (binary classification): log-likelihood ratio test K = 2 ( x |1) p X | Y π 0 log ≷ log p X | Y ( x |0) π 1 If all classes are equally likely π 0 = π 1 = ⋯ = π K −1 h B ( x ) ≜ argmax p X | Y ( x | k ) k ∈[0; K −1] Example (Bayes classifier) Assume and . The Bayes risk for is X | Y = 0 ∼ N (0, 1) X | Y = 1 ∼ N (1, 1) π 0 = π 1 with 1 h B R ( ) = Φ(− ) Φ ≜ Normal CDF 2 In practice we do not now and P X P Y | X Plugin methods : use the data to learn the distributions and plug result in Bayes classifier 5
OTHER LOSS FUNCTIONS OTHER LOSS FUNCTIONS We have focused on the risk obtained for a binary loss function P ( h ( X ) ≠ Y ) 1 { h ( X ) ≠ Y } There are many situations in which this is not appropriate Cost sensitive classification: false alarm and missed detection may not be equivalent c 0 1 { h ( X ) ≠ 0 and Y = 0} + c 1 1 { h ( X ) ≠ 1 and Y = 1} Unbalanced data set: the probability of the largest class will dominate More to explore in the next homework! 6
NEAREST NEIGHBOR CLASSIFIER NEAREST NEIGHBOR CLASSIFIER Back to our training dataset D ≜ {( x 1 y 1 , ), ⋯ , ( x N y N , )} The nearest-neighbor (NN) classifier is where h NN ( x ) ≜ y NN( x ) NN( x ) ≜ argmin i ∥ − x ∥ x i Risk of NN classifier conditioned on and x x NN( x ) R NN ( x , x NN( x ) ) = ∑ η k x NN( x ) ( )(1 − η k ( x )) = ∑ η k ( x )(1 − η k x NN( x ) ( )). k k How well does the average risk compare to the Bayes risk for large ? h NN R NN = R ( ) N Lemma. Let , be i.i.d. in a separable metric space . Let be the nearest neighbor of . x { x i } N ∼ P x X x x NN( x ) i =1 Then with probability one as x NN( x ) → x N → ∞ Theorem (Binary NN classifier) Let be a separable metric space. Let , be such that, with probability one, is X p ( x | y = 0) p ( x | y = 1) x either a continuity point of and or a point of non-zero probability measure. p ( x | y = 0) p ( x | y = 1) Then, as , N → ∞ h B h NN h B h B R ( ) ≤ R ( ) ≤ 2 R ( )(1 − R ( )) 7
8
9
K NEAREST NEIGHBOR CLASSIFIER K NEAREST NEIGHBOR CLASSIFIER Can drive the risk of the NN classifier to the Bayes risk by increasing the size of the neighborhood Assign label to by taking majority vote among nearest neighbors h K -NN x K − − − 2 h K -NN h B √ N →∞ E lim [ R ( ) ] ≤ ( 1 + ) R ( ) K Definition. Let be a classifier learned from a set of data points. The classifier is consistent if ^ N ^ N h N h as . ^ N [ R ( ) ] → N → ∞ E h R B Theorem (Stone's Theorem) If , , , then is consistent h K -NN N → ∞ K → ∞ K / N → 0 Choosing is a problem of model selection K Do not choose by minimizing the empirical risk on training: K N 1 ˆ N h 1-NN R ( ) = N ∑ 1 { h 1 x i ( ) = y i } = 0 i =1 Need to rely on estimates from model selection techniques (more later!) 10
K NEAREST NEIGHBOR CLASSIFIER K NEAREST NEIGHBOR CLASSIFIER Given enough data, a -NN classifier will do just as well as pretty much any other method K The number of samples can be huge (especially in high-dimension) N The choice of matters a lot, model selection is important K Finding the nearest neighbors out of a millions of datapoints is still computationally hard -d trees help, but still expensive in high dimension when K N ≈ d We will discuss other classifiers that make more assumptions about the underlying data 11
12
Recommend
More recommend