selective sampling algorithms for cost sensitive
play

Selective sampling algorithms for cost-sensitive multiclass - PowerPoint PPT Presentation

Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft Research Alekh Agarwal Selective sampling for multiclass prediction Why active learning? Standard setting - receive randomly sampled examples


  1. Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft Research Alekh Agarwal Selective sampling for multiclass prediction

  2. Why active learning? Standard setting - receive randomly sampled examples Alekh Agarwal Selective sampling for multiclass prediction

  3. Why active learning? Standard setting - receive randomly sampled examples Not all data points are equally informative! Alekh Agarwal Selective sampling for multiclass prediction

  4. Why active learning? Standard setting - receive randomly sampled examples Not all data points are equally informative! Labelled data points are expensive, unlabelled cheap Object recognition - images need human labelling Protein interaction prediction - lab test for each protein pair Web ranking - human editors to label relevant pages Alekh Agarwal Selective sampling for multiclass prediction

  5. What is active learning? Sequentially query points with label uncertainty Alekh Agarwal Selective sampling for multiclass prediction

  6. What is active learning? Sequentially query points with label uncertainty Like random search vs. binary search Alekh Agarwal Selective sampling for multiclass prediction

  7. What is active learning? Sequentially query points with label uncertainty Like random search vs. binary search Example: sampling near decision boundary for linear separators Alekh Agarwal Selective sampling for multiclass prediction

  8. Online selective sampling paradigm x 1 x 2 x t Z t = 0 Predict ˆ y t Algorithm Don’t Observe y t Z t = 1 Observe y t Filter examples online, querying only a subset of labels. Examples not revisited Alekh Agarwal Selective sampling for multiclass prediction

  9. Prior work Bulk of work in the binary setting Agnostic active learning Atlas, Balcan, Beygelzimer, Cohn, Dasgupta, Hanneke, Hsu, Ladner, Langford, . . . Linear selective sampling: Cesa-Bianchi, Gentile and co-authors Alekh Agarwal Selective sampling for multiclass prediction

  10. Prior work Bulk of work in the binary setting Agnostic active learning Atlas, Balcan, Beygelzimer, Cohn, Dasgupta, Hanneke, Hsu, Ladner, Langford, . . . Linear selective sampling: Cesa-Bianchi, Gentile and co-authors Empirical work in the multiclass setting: Jain and Kapoor (2009), Joshi et al. (2012), . . . Relatively little theoretical work Alekh Agarwal Selective sampling for multiclass prediction

  11. This talk Efficient algorithm in a multiclass GLM setting Analysis of regret and label complexity Sharp rates under Tsybakov-type noise condition O (1 / √ N T ) (noisy) to � Regret ranges between � O (exp( − c 0 N T )) (hard-margin) Alekh Agarwal Selective sampling for multiclass prediction

  12. This talk Efficient algorithm in a multiclass GLM setting Analysis of regret and label complexity Sharp rates under Tsybakov-type noise condition O (1 / √ N T ) (noisy) to � Regret ranges between � O (exp( − c 0 N T )) (hard-margin) Safety guarantee under model mismatch Numerical simulations Alekh Agarwal Selective sampling for multiclass prediction

  13. Multiclass prediction x ∈ R d , y ∈ { 1 , 2 , . . . , K } Only one label per example Cat Dog Horse Alekh Agarwal Selective sampling for multiclass prediction

  14. Multiclass prediction x ∈ R d , y ∈ { 1 , 2 , . . . , K } Only one label per example Cost matrix C ∈ R K × K Penalty C ij for predicting label j when true label is i Cat Cat Dog Horse Cat 0 1 10 Dog Dog 1 0 10 Horse 10 10 0 Horse Alekh Agarwal Selective sampling for multiclass prediction

  15. Structured cost matrices Often have block- or tree-structured cost matrices in applications 0 / 1 Block Tree 1 1 1 1 0.9 0.9 2 2 2 0.9 0.8 0.8 0.8 3 3 4 0.7 0.7 0.7 4 4 0.6 6 0.6 0.6 5 5 0.5 0.5 8 0.5 6 6 0.4 0.4 10 0.4 7 7 0.3 0.3 0.3 8 8 12 0.2 0.2 0.2 9 9 14 0.1 0.1 0.1 10 10 16 0 0 0 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 12 14 16 Alekh Agarwal Selective sampling for multiclass prediction

  16. Multiclass GLM Weight matrix W ∗ ∈ R K × d Convex function Φ : R K �→ R Definition (Multiclass GLM) For every x ∈ R d , the class conditional probabilities follow the model P ( Y = i | W ∗ , x ) = ( ∇ Φ( W ∗ x )) i d x W ∗ x P ( Y | W ∗ , x ) ∇ Φ = W ∗ K K d Alekh Agarwal Selective sampling for multiclass prediction

  17. Multiclass GLM intuition 1 0.9 0.8 Binary: K = 2. Φ is convex ⇐ ⇒ link P ( y = 1 | w, x ) 0.7 0.6 function is monotone increasing. E.g.: 0.5 0.4 logistic, linear, . . . 0.3 0.2 0.1 0 −100 −50 0 50 100 w T x Alekh Agarwal Selective sampling for multiclass prediction

  18. Example: multiclass logistic Define Φ( v ) = log( � K i =1 exp( v i )) Obtain ( ∇ Φ( v )) i = exp( v i ) / ( � K j =1 exp( v j )) Yields the multinomial logit noise model exp( x T W i ) P ( Y = i | W , x ) = . � K j =1 exp( x T W j ) Alekh Agarwal Selective sampling for multiclass prediction

  19. Loss function Given Φ, define the loss ℓ ( W x , y ) = Φ( W x ) − y T W x . Convex since Φ is convex Fisher consistent: W ∗ minimizes E [ ℓ ( W x , y ) | W ∗ , x ] for each x Alekh Agarwal Selective sampling for multiclass prediction

  20. Loss function Given Φ, define the loss ℓ ( W x , y ) = Φ( W x ) − y T W x . Convex since Φ is convex Fisher consistent: W ∗ minimizes E [ ℓ ( W x , y ) | W ∗ , x ] for each x = E [ ∇ Φ( W x ) | W ∗ , x ] − E [ ∇ y T W x | W ∗ , x ] E [ ∇ ℓ ( W x , y ) | W ∗ , x ] = ∇ Φ( W x ) x T − E [ y | W ∗ , x ] x T = ∇ Φ( W x ) x T − ∇ Φ( W ∗ x ) x T Alekh Agarwal Selective sampling for multiclass prediction

  21. Score function Given a cost matrix C and Φ, define K � S x W ( i ) = − C ( j , i ) ( ∇ Φ( W x )) j . � �� � � �� � j =1 cost of i probability of j Negative expected cost of predicting j , when W is the true weight matrix Maximum score ⇐ ⇒ minimum expected cost Alekh Agarwal Selective sampling for multiclass prediction

  22. Score function Given a cost matrix C and Φ, define K � S x W ( i ) = − C ( j , i ) ( ∇ Φ( W x )) j . � �� � � �� � j =1 cost of i probability of j Negative expected cost of predicting j , when W is the true weight matrix Maximum score ⇐ ⇒ minimum expected cost Bayes predictor predicts arg max K i =1 S x W ∗ ( i ) Alekh Agarwal Selective sampling for multiclass prediction

  23. CS-Selectron algorithm with general query function Input: Query function Q , cost matrix C , parameter γ > 0 Initialize: W 1 = 0, M 1 = γ I /γ ℓ For t = 1 , 2 , . . . T Alekh Agarwal Selective sampling for multiclass prediction

  24. CS-Selectron algorithm with general query function Input: Query function Q , cost matrix C , parameter γ > 0 Initialize: W 1 = 0, M 1 = γ I /γ ℓ For t = 1 , 2 , . . . T Observe x t , H t +1 = H t ∪ { x t } x 1 x 2 x t Algorithm Alekh Agarwal Selective sampling for multiclass prediction

  25. CS-Selectron algorithm with general query function For t = 1 , 2 , . . . T Observe x t , H t +1 = H t ∪ { x t } y t = arg max i =1 , 2 ,..., K S x t Predict ˆ W t ( i ) x 1 x 2 x t Predict Algorithm y t = arg max S x t ˆ W t ( i ) Alekh Agarwal Selective sampling for multiclass prediction

  26. CS-Selectron algorithm with general query function For t = 1 , 2 , . . . T Observe x t , H t +1 = H t ∪ { x t } y t = arg max i =1 , 2 ,..., K S x t Predict ˆ W t ( i ) If Q ( x t , H t ) = 1 , then x 1 x 2 x t Predict Algorithm y t = arg max S x t ˆ W t ( i ) Q ( x t , H t ) = 1 Alekh Agarwal Selective sampling for multiclass prediction

  27. CS-Selectron algorithm with general query function For t = 1 , 2 , . . . T Observe x t , H t +1 = H t ∪ { x t } y t = arg max i =1 , 2 ,..., K S x t Predict ˆ W t ( i ) If Q ( x t , H t ) = 1 , then Query label y t x 1 x 2 x t Predict Algorithm y t = arg max S x t ˆ W t ( i ) Q ( x t , H t ) = 1 Observe y t Alekh Agarwal Selective sampling for multiclass prediction

  28. CS-Selectron algorithm with general query function For t = 1 , 2 , . . . T Observe x t , H t +1 = H t ∪ { x t } y t = arg max i =1 , 2 ,..., K S x t Predict ˆ W t ( i ) If Q ( x t , H t ) = 1 , then Query label y t Update W t , M t and H t x 1 x 2 x t Predict Algorithm y t = arg max S x t ˆ W t ( i ) Q ( x t , H t ) = 1 Observe y t Z t = 1, H t +1 = H t ∪ { y t } , M t +1 = M t + x t x T t � � t � Z s ℓ ( W x s , y s ) + γ � W � 2 W t +1 = arg min . F W ∈W s =1 Alekh Agarwal Selective sampling for multiclass prediction

  29. Algorithm intuition Low-regret algorithm on queried examples Update ensures � W t − W ∗ � M t is small Query function ensures low regret on rounds with no queries Alekh Agarwal Selective sampling for multiclass prediction

  30. Query function: BBQ ǫ rule Variant of Cesa-Bianchi et al. (2009) � ≥ ǫ 2 � η ǫ � x t � 2 Q ( x t , H t ) = 1 1 M − 1 t Alekh Agarwal Selective sampling for multiclass prediction

  31. Query function: BBQ ǫ rule Variant of Cesa-Bianchi et al. (2009) � ≥ ǫ 2 � η ǫ � x t � 2 Q ( x t , H t ) = 1 1 M − 1 t Note: � W ∗ x t − W t x t � 2 ≤ � W ∗ − W t � M t � x t � M − 1 t Queries points with large confidence intervals on the predictions x t x t M t M t Q ( x t , H t ) = 1 Q ( x t , H t ) = 0 Alekh Agarwal Selective sampling for multiclass prediction

Recommend


More recommend