active learning with disagreement graphs
play

Active Learning with Disagreement Graphs Corinna Cortes 1 , Giulia - PowerPoint PPT Presentation

Active Learning with Disagreement Graphs Corinna Cortes 1 , Giulia DeSalvo 1 , Claudio Gentile 1 , Mehryar Mohri 1 , 2 , Ningshan Zhang 3 1 Google Research 2 Courant Institute, NYU 3 NYU ICML, June 12, 2019 On-line Active Learning Setup At


  1. Active Learning with Disagreement Graphs Corinna Cortes 1 , Giulia DeSalvo 1 , Claudio Gentile 1 , Mehryar Mohri 1 , 2 , Ningshan Zhang 3 1 Google Research 2 Courant Institute, NYU 3 NYU ICML, June 12, 2019

  2. On-line Active Learning Setup ◮ At each round t ∈ [ T ] , receives unlabeled x t ∼ D X i.i.d. ◮ Decides whether to request label: ◮ If label requested, receives y t . ◮ After T rounds, returns a hypothesis h T ∈ H . Objective: ◮ Generalizations error: � � ◮ Accurate predictor h T : small expected loss R ( h T ) = E x , y ℓ ( h T ( x ) , y ) . ◮ Close to best-in-class h ∗ = argmin h ∈ H R ( h ) . ◮ Label complexity: few label requests.

  3. Disagreement-based Active Learning Key idea: Request label when there is some disagreement among hypotheses. Examples: ◮ Separable case: CAL (Cohn et al., 1994). ◮ Non-separable case: A 2 (Balcan et al., 2006), DHM (Dasgupta et al., 2008). ◮ IWAL (Beygelzimer et al., 2009). Can we improve upon existing disagreement-based algorithms, such as IWAL? ◮ Better guarantees? ◮ Leverage average disagreements?

  4. This talk ◮ IWAL-D algorithm: enhanced IWAL with disagreement graph. ◮ IZOOM algorithm: enhanced IWAL-D with zooming-in. ◮ Better generalization and label complexity guarantees. ◮ Experimental results.

  5. Disagreement Graph (D-Graph) ◮ Vertices: hypotheses in H (a finite hypothesis set) ◮ Edges: fully connected. The edge between h , h ′ ∈ H is weighted by their expected disagreement: � � � � � ℓ ( h ( x ) , y ) − ℓ ( h ′ ( x ) , y ) � L ( h , h ′ ) = max . E x ∼ D X y ∈ Y L symmetric, ℓ ≤ 1 ⇒ L ≤ 1. ◮ D-Graph can be accurately estimated using unlabeled data.

  6. Disagreement Graph (D-Graph) One favorable scenario: ◮ Best-in-class h ∗ ( ) is within an isolated cluster; ◮ L ( h , h ∗ ) is small within the cluster.

  7. IWAL-D Algorithm: IWAL with D-Graph ◮ At round t ∈ [ T ] , receive x t . 1. Flip a coin Q t ∼ Ber ( p t ) , with disagreement-based bias: � � � ℓ ( h ( x t ) , y ) − ℓ ( h ′ ( x t ) , y ) � . p t = max h , h ′ ∈ H t max y ∈ Y 2. If Q t = 1, request the label y t . 3. Trim the version space: � � � � h ∈ H t : L t ( h ) ≤ L t ( � 1 + L ( h , � H t + 1 = h t ) + h t ) ∆ t , which uses importance weighted empirical risk �� � t � L t ( h ) = 1 Q s � ∆ t = � log ( T | H | ) ℓ ( h ( x s ) , y s ) , h t = argmin L t ( h ) , O . t t p s h ∈ H t s = 1 ◮ After T rounds, return � h T .

  8. IWAL-D vs. IWAL: Quantify the Improvement Theorem (IWAL-D) With high probability, � � h T ) ≤ R ∗ + R ( � 1 + L ( � h T , h ∗ ) ∆ T , � � � � � � 2 R ∗ + max 2 + L ( h , � h t − 1 ) + L ( h , h ∗ ) E p t |F t − 1 ≤ 2 θ ∆ t − 1 . x ∼ D X h ∈ H t ◮ θ : disagreement coefficient (Hanneke, 2007). ◮ More aggressive trimming of the version space. ◮ Slightly better generalization guarantee and label complexity.

  9. IWAL and IWAL-D Problem: ◮ Theoretical guarantees only hold for finite hypothesis sets. ◮ Need ǫ -cover to extend to infinite hypothesis sets. ◮ Expensive to construct ǫ -cover in practice. Can we adaptively enrich the hypothesis set, with theoretical guarantees?

  10. IZOOM: IWAL-D with Zooming-in At round t , ◮ Request label based on dis. of ( H t ) Trim Resample

  11. IZOOM: IWAL-D with Zooming-in At round t , ◮ Request label based on dis. of ( H t ) ◮ H ′ t + 1 ← Trim ( H t ) Trim Resample

  12. IZOOM: IWAL-D with Zooming-in At round t , ◮ Request label based on dis. of ( H t ) ◮ H ′ t + 1 ← Trim ( H t ) ◮ H ′′ t + 1 ← Resample ( H ′ t + 1 ) Trim Resample Resample ( H ′ t + 1 ) : sample new h ∈ ConvexHull ( H ′ t + 1 ) . ◮ E.g., random convex combination of � h t and h ∈ H ′ t + 1 .

  13. IZOOM: IWAL-D with Zooming-in At round t , ◮ Request label based on dis. of ( H t ) ◮ H ′ t + 1 ← Trim ( H t ) ◮ H ′′ t + 1 ← Resample ( H ′ t + 1 ) ◮ H t + 1 ← H ′ t + 1 ∪ H ′′ t + 1 Trim Resample

  14. IZOOM vs. IWAL-D Let H t = ∪ t s = 1 H t , i.e. all the hypotheses ever considered up to time t . Let h ∗ t = argmin h ∈ H t R ( h ) . Theorem (IZOOM) With high probability, � � R ( � 1 + L ( � ∆ T + O ( 1 h T ) ≤ R ∗ h T , h ∗ T + T ) T ) , � � � � � � 2 + L ( h , � + O ( 1 2 R ∗ h t ) + L ( h , h ∗ p t + 1 |F t ≤ 2 θ t t + max t ) ∆ t T ) . E x ∼ D X h ∈ H t + 1 t = min h ∈ H t R ( h ) is smaller than R ∗ = min h ∈ H 0 R ( h ) . ◮ R ∗ ◮ More accurate � h T , with fewer label requests.

  15. Experiments Tasks: 8 Binary classification datasets from UCI repository. ◮ ℓ : logistic loss rescaled to [ 0 , 1 ] . Baselines: ◮ IWAL with 3,000 hypotheses. ◮ IWAL with 12,000 hypotheses. ◮ IZOOM with 3,000 hypotheses. Performance measure: ◮ 0-1 loss on test data vs. number of label requests.

  16. Experiments nomao codrna skin IWAL 3000 IWAL 12000 IZOOM 3000 IWAL 3000 IWAL 12000 IZOOM 3000 IWAL 3000 IWAL 12000 IZOOM 3000 0.14 Misclassification Loss Misclassification Loss Misclassification Loss 0.175 0.20 0.13 0.18 0.150 0.12 0.16 0.125 0.11 0.14 7 8 9 10 11 7 8 9 10 11 7 8 9 10 11 log 2 (Number of Labels) log 2 (Number of Labels) log 2 (Number of Labels) covtype magic04 a9a IWAL 3000 IWAL 12000 IZOOM 3000 IWAL 3000 IWAL 12000 IZOOM 3000 IWAL 3000 IWAL 12000 IZOOM 3000 0.26 Misclassification Loss Misclassification Loss Misclassification Loss 0.42 0.25 0.24 0.24 0.40 0.22 0.38 0.23 0.20 0.36 0.22 0.18 7 8 9 10 11 7 8 9 10 11 7 8 9 10 11 log 2 (Number of Labels) log 2 (Number of Labels) log 2 (Number of Labels)

  17. Conclusion ◮ Key introduction and role of disagreement graph. ◮ More favorable generalization and label complexity guarantees. ◮ Substantial performance improvements. ◮ Effective solutions for active learning. Poster: Pacific Ballroom #265 KDD workshop (Alaska, August 2019) on Active Learning: Data Collection, Curation, and Labeling for Mining and Learning.

Recommend


More recommend