actual class
play

Actual Class Adapted from Fawcett (2003} P N Estimated Class T - PowerPoint PPT Presentation

A DA B OOST AND R ANK B OOST Josiah Yoder School of Electrical and Computer Engineering RVL Seminar, 4 Feb 2011 1/31 O VERVIEW ROC & Precision-Recall Curves AdaBoost RankBoost Not really any of my research 2/31 ROC & P RECISION -R


  1. A DA B OOST AND R ANK B OOST Josiah Yoder School of Electrical and Computer Engineering RVL Seminar, 4 Feb 2011 1/31

  2. O VERVIEW ROC & Precision-Recall Curves AdaBoost RankBoost Not really any of my research 2/31

  3. ROC & P RECISION -R ECALL ROC – Receiver Operating Precision-Recall Curve Characteristics Good for evaluating Good with skewed ranked search results Distribution Consistent for comparing Good with unequal costs performance Good for visualizing maximum performance with any given threshold 3/31

  4. P ERFORMANCE M ETRICS Actual Class Adapted from Fawcett (2003} P N Estimated Class T rue F alse P P ositive P ositive F alse T rue N N egitive N egitive 4/31 P N

  5. T HE A XES ROC Curve P-R (Precision-Recall) Curve Actual Class Actual Class P N P N Estimated Class Estimated Class TP FP TP FP FN TN FN TN precision = tp rate TP / (TP + FP) = TP / P fp rate = FP / N ‘ ‘ recall = TP / P 5/31

  6. D ISCRETE CLASSIFIERS ROC Points P-R (Precision-Recall) Points Adapted from Fawcett (2003} Adapted from Fawcett (2003} D D B A B C A tp rate recall C = E E TP / P fp rate = FP / N precision ‘ ‘ (50% P, 50% N) 6/31

  7. N UMERIC C LASSIFIERS Output: Numeric value (score, probability, rank . . . ) Thresholding output gives a discrete score — single point on ROC or PR curve “Sweeping” the threshold from ∞ to − ∞ traces out a curve There is a more efficient way to do it Note that ROC and PR curves are parametric 7/31

  8. D ISCRETE CLASSIFIERS ROC Curve P-R (Precision-Recall) Curve Adapted from Fawcett (2003} precision tp recall ‘ fp ‘ (50% P, 50% N) GT 1 0 1 0 1 feat 0.9 0.7 0.6 0.5 0.2 8/31

  9. D ISCRETE CLASSIFIERS ROC Curve P-R (Precision-Recall) Curve Adapted from Fawcett (2003} precision tp ‘ recall fp ‘ (20% P, 80% N) GT 1 0 0 0 0 1 0 0 0 0 1 feat 0.9 0.7 0.7 0.7 0.7 0.6 0.5 0.5 0.5 0.5 0.2 9/31

  10. W HICH IS BETTER ? ROC doesn’t change if ratio P/N changes Makes it easy to visualize maximum performance when ratio of true to false changes P-R illustrates search results Clearly illustrates effect of false positives on performance 10/31

  11. A DA B OOST AND R ANK B OOST AdaBoost RankBoost Classification Ranking ROC P-R 11/31

  12. A DA B OOST What is Boosting? Linear combination of weak classifiers with proven bounds on performance A weak classifier is one that gets more than random guessing right. AdaBoost h ( x ) — weak binary classifier, h : X → {− 1 , 1 } � ∑ T � t = 1 α t h t ( x ) H ( x ) — strong classifier, H ( x ) = sign 12/31

  13. N ONLINEAR SEPARATION Combination of weak classifiers is linear, but weak classifiers are non-linear Example x 2 � � x 1 x = x x 2 1 H ( x ) = sign ( 0 . 5 [[ x 1 > 1 ]]+ 0 . 5 [[ x 2 > 2 ]] ) 13/31

  14. H OW TO PICK WEAK CLASSIFIERS AND THEIR WEIGHTS Greedy: Pick the best weak classifier repeatedly Focus on the points misclassified by the previous classifier (Not on overall performance) Following frames from http://cseweb.ucsd.edu/~yfreund/adaboost/index.html 14/31

  15. H OW TO REWEIGHT THE TRAINING DATA How to reweight the training data D 1 = 1 / m where m is the number of training points. D t + 1 ( i ) = D t ( i ) ·♠ Z t What to use for ♠ ? (dropping t subscript for a while. . . ) Idea #1: ♠ = [[ y i � = h ( x i )]] Problem: correctly classified data is totally forgotten � y i � = h ( x i ) a Idea #2: ♠ = 1 / a y i = h ( x i ) Tada! This is what AdaBoost does! 20/31

  16. A DABOOST ’ S ♠ � y i � = h ( x i ) a Idea #2: ♠ = 1 / a y i = h ( x i ) � e α y i � = h ( x i ) AdaBoost: ♠ = e − α y i = h ( x i ) or equivalently, ♠ = exp [ − α y i h ( x i )] What is α ? How do you choose it? We will come back to this. . . We shall see that the classifier has nice properties no matter how we choose α 21/31

  17. T HE B ASIC A DA B OOST A LGORITHM ( t subscripts are back) D 1 = 1 / m for i = 1 ,..., T Choose h t ( x i ) based on data weighted by D t Reweight D t + 1 ( i ) = D t ( i ) · exp [ α t y i h t ( x i )] Z t Z t = ∑ D t ( i ) · exp [ − α t y i h t ( x i )] i � ∑ α t h t ( x i ) � H ( x ) = sign f ( x ) = ∑ α t h t ( x i ) 22/31

  18. W HY CAN WE JUST USE α AS THE WEIGHT ? It allows a nice bound on the error. 1 [[ H ( x i ) � = y i ]] ≤ 1 m ∑ m ∑ exp [ − y i f ( x i )] i i [[ H ( x i ) � = y i ]] ≤ exp [ − y i f ( x i )] [[ f ( x i ) y i ≤ 0 ]] ≤ exp [ − y i f ( x i )] 1 [[ H ( x i ) � = y i ]] ≤ 1 exp [ − y i f ( x i )] = ∏ m ∑ m ∑ Z t t i i That looks pretty. But what does it mean? Z t is the normalizing constant for the weights on each point Roughly, Z t ≈ ∑ i D t ( i )[[ y i � = h t ( x i )]] , the cost of misclassifying the weighted points in the t th round. 23/31

  19. T HAT ’ S A PRETTY NICE BOUND . How do we know it’s true? D 1 exp [ α 1 yih 1 ( xi )] exp [ α 2 y i h 2 ( x i )] Z 1 exp [ α 3 y i h 3 ( x i )] Z 2 D 4 ( i ) = Z 3 1 1 m exp [ ∑ t − α t y i h t ( x i )] m exp [ y i f ( x i )] D T + 1 ( i ) = D 1 ∏ t exp [ − α t y i h t ( x i )] = = ∏ t Z t ∏ t Z t ∏ t Z t D T + 1 ( i ) = 1 1 = ∑ m ∑ exp [ y i f ( x i )] / ∏ Z t i i t Z t = 1 err train ≤ ∏ ∏ m ∑ exp [ y i f ( x i )] so Z t t i t 24/31

  20. W E CAN ALMOST IMPLEMENT THIS . N OW WHAT ABOUT α ? Choose α t to minimize Z t ∂ ∂α Z t = ... � � 1 − ε t α t = 2ln where ε t is the weighted error of the classifier ε t at the t th stage And to compute the weak learners, create an ROC curve using each dimension of the data alone. Take the best point on the ROC curve and use that as your classifier. This completes the algorithm. Thinking back to ROC curves: The goal of AdaBoost is to minimize the classifier error. Same thing as trying to make the best performance of the ROC curve as close to the top left corner as possible. How can we maximize the area under the P-R curve? (MAP 25/31 — mean average precision)

  21. R ANK B OOST Goal is to find ordering, not “quality” of each point. Does not attempt to directly maximize MAP, but is successful at doing this anyhow Error is defined in terms of the number of data pairs which are out of order D ( x 0 , x 1 ) = c if x 0 should be ranked below x 1 , D ( x 0 , x 1 ) = 0 otherwise. ∑ x 0 , x 1 D ( x 0 , x 1 ) = 1 More complicated D can be use to emphasize really important pairs. 26/31

  22. E XAMPLE e.g. 1. True rank should be e.g. 2. True rank should be rank ( x 1 ) < rank ( x 2 ) < rank ( x 3 ) . rank ( x 1 ) < rank ( x 2 ) = rank ( x 3 ) . D x 1 x 2 x 3 D x 1 x 2 x 3 0 1/3 1/3 0 1/3 1/3 x 1 x 1 0 0 1/3 0 0 0 x 2 x 2 0 0 0 0 0 0 x 3 x 3 27/31

  23. B ASIC R ANK B OOST A LGORITHM Given: Rank matrix D with true ranks of the training data D 1 = D For t = 1 ,..., T Train h t : X → {− 1 , 1 } Choose α t (more suspense...) Update D t + 1 ( x 0 , x 1 ) = D t ( x 0 , x 1 ) exp [ α t ( h t ( x 0 ) − h t ( x 1 ))] Z t Final ranking: H ( x ) = ∑ t α t h t 28/31

  24. B OUNDS AND α As before, err train ≤ ∏ t Z t � 1 + r Selection of α is different. Now α = 1 � where 2 ln 1 − r r = W i − W + = ∑ D ( x 0 , x 1 )( h ( x 1 ) − h ( x 0 )) and W + is the weight of the pairs for which h ( x 0 ) > h ( x 1 ) and W − is the weights of the pairs for which h ( x 0 ) < h ( x 1 ) . Selection of weak classifier also needs to be redone. Turns out we need to maximize r , which can be written as r = ∑ v ( x ′ ) ∑ h ( x ) s ( x ) v ( x ) x x ′ ∈ s ( x ) � = s ( x ′ ) where v ( x ) [SHOULD BE] given above and � + 1 x ∈ X 0 s ( x ) = − 1 x ∈ X 1 29/31

  25. S ORRY ! I ran out of time! 30/31

  26. A CKNOWLEDGEMENTS I wish to thank my advisor, Prof. Avi Kak for the helpful observation that ROC curves and PR curves are parametric. His presentation on retrieval in the summer of 2011 also presents an entertaining comparison of PR-curves for random and ideal retrieval. I also wish to thank my lab-members for pointing out errors in this presentation that have hopefully all been corrected in this version. Note that we covered RankBoost in 5 or 10 minutes so errors probably remain on slides 11 through 30. 31/31

Recommend


More recommend