Active Learning for Classification with Abstention Shubhanshu Shekhar 1 University of California, San Diego Mohammad Ghavamzadeh Facebook AI Research Tara Javidi University of California, San Diego ISIT 2020 1 shshekha@eng.ucsd.edu 1 / 17
Introduction: Classification with Abstention Feature space X , P XY a joint distribution on X × { 0 , 1 } , and augmented label set ¯ Y = { 0 , 1 , ∆ } , where ∆ is the ‘abstain’ option. Risk of an abstaining classifier g : X �→ ¯ Y R λ ( g ) := P XY ( g ( X ) � = Y , g ( X ) � = ∆) + λ P X ( g ( X ) = ∆) . � �� � � �� � risk from mis-classification risk from abstention Cost of mis-classification is 1, while abstention cost is λ ∈ (0 , 1 / 2). Goal: Given access only to a training sample D n = { ( X i , Y i ) : 1 ≤ i ≤ n } , construct a classifier g n with small risk. 2 / 17
Introduction: Active and Passive Learning Passive Learning: D n := { ( X i , Y i ) | 1 ≤ i ≤ n } is generated in an i . i . d . manner. Active Learning: the learner constructs D n by sequentially querying a labelling oracle. Three commonly used active learning models: membership query: request label at any x ∈ X Stream based: query/discard from a stream of inputs X t ∼ P X . Pool based: query points from a given pool of inputs. This talk: Membership query. Paper: All three models. 3 / 17
Prior Work Prior work has focused mainly on the passive case: Chow (1957): derived the Bayes optimal classifier. Chow (1970): trade-off b/w the error and abstention rate. Herbei & Wegkamp (2006): plug-in and ERM classifiers. Bartlett & Wegkamp (2008) Yuan (2010), Cortes et al. (2016): calibrated convex surrogate. Geifman & El-Yaniv (2017): incorporated in Neural Networks. Active Learning for classification with abstention is largely unexplored. 4 / 17
Prior Work Prior work has focused mainly on the passive case: Chow (1957): derived the Bayes optimal classifier. Chow (1970): trade-off b/w the error and abstention rate. Herbei & Wegkamp (2006): plug-in and ERM classifiers. Bartlett & Wegkamp (2008) Yuan (2010), Cortes et al. (2016): calibrated convex surrogate. Geifman & El-Yaniv (2017): incorporated in Neural Networks. Active Learning for classification with abstention is largely unexplored. Questions 1 Performance limits of learning algorithms in both settings 2 Design active algorithm which match the performance limit. 3 Characterizing the performance gain for active over passive. 4 / 17
Prior Work Prior work has focused mainly on the passive case: Chow (1957): derived the Bayes optimal classifier. Chow (1970): trade-off b/w the error and abstention rate. Herbei & Wegkamp (2006): plug-in and ERM classifiers. Bartlett & Wegkamp (2008) Yuan (2010), Cortes et al. (2016): calibrated convex surrogate. Geifman & El-Yaniv (2017): incorporated in Neural Networks. Active Learning for classification with abstention is largely unexplored. Questions 1 Performance limits of learning algorithms in both settings 2 Design active algorithm which match the performance limit. 3 Characterizing the performance gain for active over passive. 4 Computationally efficient active algorithms via convex surrogates 5 Active learning for Neural Networks with abstention 4 / 17
Introduction: Bayes Optimal Classifier Define the regression function η ( x ) := P XY ( Y = 1 | X = x ). For a fixed cost of abstention λ ∈ (0 , 1 / 2), the Bayes optimal classifier g ∗ λ is defined as: 1 if η ( x ) ≥ 1 − λ g ∗ λ = 0 if η ( x ) ≤ λ ∆ if η ( x ) ∈ ( λ, 1 − λ ) The optimal classifier can equivalently be represented by the triplet ( G ∗ 0 , G ∗ 1 , G ∗ ∆ ) where G ∗ v := { x ∈ X | g ( x ) = v } for v ∈ { 0 , 1 , ∆ } . 5 / 17
Introduction: Bayes Optimal Classifier 6 / 17
Introduction: Bayes Optimal Classifier 6 / 17
Assumptions We make the following two assumptions on the joint distribution P XY . (H¨ O) The regression function η is H¨ older continuous with parameters L > 0 and 0 < β ≤ 1, i.e., | η ( x 1 ) − η ( x 2 ) | ≤ Ld ( x 1 , x 2 ) β , ∀ x 1 , x 2 ∈ ( X , d ) . (MA) The joint distribution P XY of the input-label pair satisfies the margin assumption with parameters C 0 > 0 and α 0 ≥ 0, for γ ∈ { λ, 1 − λ } , which means that for any P X ( | η ( X ) − γ | ≤ t ) ≤ C 0 t α 0 , ∀ 0 < t ≤ 1 . We will use P ( α 0 , β ) to denote the class of P XY satisfying these conditions. 7 / 17
Proposed Algorithm: Outline Repeat for t = 1 , 2 , . . . , classified unclassified ���� ���� Q ( c ) Q ( u ) ∪ : partition of X . t t E_1 E_2 8 / 17
Proposed Algorithm: Outline Repeat for t = 1 , 2 , . . . , classified unclassified ���� ���� Q ( c ) Q ( u ) ∪ : partition of X . t t For each set E ∈ Q ( u ) , compute η t ( E ) = (1 / n t ( E )) � t • ˆ X i ∈ E Y i • u t ( E ) = ˆ η t ( E ) +term1 + term2 • ℓ t ( E ) = ˆ η t ( E ) - term1 - term2 E_1 E_2 8 / 17
Proposed Algorithm: Outline Repeat for t = 1 , 2 , . . . , classified unclassified ���� ���� Q ( c ) Q ( u ) ∪ : partition of X . t t For each set E ∈ Q ( u ) , compute η t ( E ) = (1 / n t ( E )) � t • ˆ X i ∈ E Y i • u t ( E ) = ˆ η t ( E ) +term1 + term2 • ℓ t ( E ) = ˆ η t ( E ) - term1 - term2 If u t ( E ) < λ or ℓ t ( E ) > 1 − λ ⇒ add E to Q ( c ) . Otherwise, keep in Q ( u ) . t t E_1 E_2 8 / 17
Proposed Algorithm: Outline Repeat for t = 1 , 2 , . . . , classified unclassified ���� ���� Q ( c ) Q ( u ) ∪ : partition of X . t t For each set E ∈ Q ( u ) , compute η t ( E ) = (1 / n t ( E )) � t • ˆ X i ∈ E Y i • u t ( E ) = ˆ η t ( E ) +term1 + term2 • ℓ t ( E ) = ˆ η t ( E ) - term1 - term2 If u t ( E ) < λ or ℓ t ( E ) > 1 − λ ⇒ add E to Q ( c ) . Otherwise, keep in Q ( u ) . t t E t = arg max E ∈Q ( u ) u t ( E ) − ℓ t ( E ). t E_1 E_2 8 / 17
Proposed Algorithm: Outline Repeat for t = 1 , 2 , . . . , classified unclassified ���� ���� Q ( c ) Q ( u ) ∪ : partition of X . t t For each set E ∈ Q ( u ) , compute η t ( E ) = (1 / n t ( E )) � t • ˆ X i ∈ E Y i • u t ( E ) = ˆ η t ( E ) +term1 + term2 • ℓ t ( E ) = ˆ η t ( E ) - term1 - term2 If u t ( E ) < λ or ℓ t ( E ) > 1 − λ ⇒ add E to Q ( c ) . Otherwise, keep in Q ( u ) . t t E t = arg max E ∈Q ( u ) u t ( E ) − ℓ t ( E ). t Refine if term1 < term2, else Query at x t ∼ Unif( E t ). E_1 E_2 8 / 17
Proposed Algorithm: Outline Repeat for t = 1 , 2 , . . . , classified unclassified ���� ���� Q ( c ) Q ( u ) ∪ : partition of X . t t For each set E ∈ Q ( u ) , compute η t ( E ) = (1 / n t ( E )) � t • ˆ X i ∈ E Y i • u t ( E ) = ˆ η t ( E ) +term1 + term2 • ℓ t ( E ) = ˆ η t ( E ) - term1 - term2 If u t ( E ) < λ or ℓ t ( E ) > 1 − λ ⇒ add E to Q ( c ) . Otherwise, keep in Q ( u ) . t t E t = arg max E ∈Q ( u ) u t ( E ) − ℓ t ( E ). t Refine if term1 < term2, else Query at x t ∼ Unif( E t ). Classifier Definition. � � 1 if ℓ t n x > 1 − λ, E_1 E_2 � � g n ( x ) = ˆ 0 if u t n < λ, (1) x ∆ otherwise . 8 / 17
9 / 17
Bound on Excess Risk Theorem Under margin and smoothness assumptions and n large enough, for the classifier ˆ g n defined by (1) , we have � n − β ( α 0 +1) / (2 β + ˜ D ) � g n ) − R λ ( g ∗ λ )] = ˜ E [ R λ (ˆ O . Conversely, we have the following lower bound for any algorithm � n − β (1+ α 0 ) / (2 β + D ) � E [ R λ ( g ) − R λ ( g ∗ sup λ )] = Ω . P ∈P ( α 0 ,β ) ˜ D ≤ D is a measure of dimensionality of X near the classifier boundaries { x : η ( x ) ∈ { λ, 1 − λ } } . D depends on the parameters of both (MA) and (H¨ ˜ O) assumptions. 10 / 17 D ≤ D and there exists cases with ˜ ˜ D < D and ˜ D = D .
Lower Bound: Key Inequality Lemma Suppose g = ( G 0 , G 1 , G ∆ ) is a classifier constructed by an algorithm A , and g ∗ λ = ( G ∗ 0 , G ∗ 1 , G ∗ ∆ ) . Then, for any λ ∈ (0 , 1 / 2) we have � � (1+ α 0 ) /α 0 R λ ( g ) − R λ ( g ∗ ( G ∗ ∆ \ G ∆ ) ∪ ( G ∆ \ G ∗ λ ) ≥ c P X ∆ ) . � �� � � �� � := E ( A , P XY , n ) ( Excess Risk ) := � E ( A , P XY , n ) Rest of the proof relies on a reduction to multiple hypothesis testing. This Lemma suggests an appropriate pseudo-metric, and motivates the construction of hypothesis set. 11 / 17
Improvement Over Passive Corollary For passive algorithms A p , we can obtain lower bound E as: � n − β (1+ α 0 ) / ( D +2 β + α 0 β ) � inf sup E [ E ( A p , P , n )] = Ω . A p P ∈P ( α 0 ,β ) This rate is slower than the rate achieved by our active algorithm, � n − ( β (1+ α 0 ) / ( D +2 β ) � ≈ ˜ O , for all D , α 0 , β . In particular, the gain due to active learning increases with (i) decreasing D , (ii) increasing α 0 and (iii) increasing β . Under an additional strong density assumption, further improvements are possible. 12 / 17
Improvement over passive: α 0 = 1 , β = 0 . 5 , D ∈ { 1 , 2 , . . . , 50 } log E log n lim n →∞ Dimension 13 / 17
Improvement over passive: α 0 ∈ [1 , 10] , β = 0 . 5 , D = 5 log E log n lim n →∞ Margin Parameter α 0 14 / 17
Recommend
More recommend