Active Learning for Classification with Abstention Shubhanshu Shekhar - PowerPoint PPT Presentation

Active Learning for Classification with Abstention Shubhanshu Shekhar 1 University of California, San Diego Mohammad Ghavamzadeh Facebook AI Research Tara Javidi University of California, San Diego ISIT 2020 1 shshekha@eng.ucsd.edu 1 / 17

Introduction: Classification with Abstention Feature space X , P XY a joint distribution on X × { 0 , 1 } , and augmented label set ¯ Y = { 0 , 1 , ∆ } , where ∆ is the ‘abstain’ option. Risk of an abstaining classifier g : X �→ ¯ Y R λ ( g ) := P XY ( g ( X ) � = Y , g ( X ) � = ∆) + λ P X ( g ( X ) = ∆) . � �� risk from mis-classification risk from abstention Cost of mis-classification is 1, while abstention cost is λ ∈ (0 , 1 / 2). Goal: Given access only to a training sample D n = { ( X i , Y i ) : 1 ≤ i ≤ n } , construct a classifier g n with small risk. 2 / 17

Introduction: Active and Passive Learning Passive Learning: D n := { ( X i , Y i ) | 1 ≤ i ≤ n } is generated in an i . i . d . manner. Active Learning: the learner constructs D n by sequentially querying a labelling oracle. Three commonly used active learning models: membership query: request label at any x ∈ X Stream based: query/discard from a stream of inputs X t ∼ P X . Pool based: query points from a given pool of inputs. This talk: Membership query. Paper: All three models. 3 / 17

Prior Work Prior work has focused mainly on the passive case: Chow (1957): derived the Bayes optimal classifier. Chow (1970): trade-off b/w the error and abstention rate. Herbei & Wegkamp (2006): plug-in and ERM classifiers. Bartlett & Wegkamp (2008) Yuan (2010), Cortes et al. (2016): calibrated convex surrogate. Geifman & El-Yaniv (2017): incorporated in Neural Networks. Active Learning for classification with abstention is largely unexplored. 4 / 17

Prior Work Prior work has focused mainly on the passive case: Chow (1957): derived the Bayes optimal classifier. Chow (1970): trade-off b/w the error and abstention rate. Herbei & Wegkamp (2006): plug-in and ERM classifiers. Bartlett & Wegkamp (2008) Yuan (2010), Cortes et al. (2016): calibrated convex surrogate. Geifman & El-Yaniv (2017): incorporated in Neural Networks. Active Learning for classification with abstention is largely unexplored. Questions 1 Performance limits of learning algorithms in both settings 2 Design active algorithm which match the performance limit. 3 Characterizing the performance gain for active over passive. 4 / 17

Prior Work Prior work has focused mainly on the passive case: Chow (1957): derived the Bayes optimal classifier. Chow (1970): trade-off b/w the error and abstention rate. Herbei & Wegkamp (2006): plug-in and ERM classifiers. Bartlett & Wegkamp (2008) Yuan (2010), Cortes et al. (2016): calibrated convex surrogate. Geifman & El-Yaniv (2017): incorporated in Neural Networks. Active Learning for classification with abstention is largely unexplored. Questions 1 Performance limits of learning algorithms in both settings 2 Design active algorithm which match the performance limit. 3 Characterizing the performance gain for active over passive. 4 Computationally efficient active algorithms via convex surrogates 5 Active learning for Neural Networks with abstention 4 / 17

Introduction: Bayes Optimal Classifier Define the regression function η ( x ) := P XY ( Y = 1 | X = x ). For a fixed cost of abstention λ ∈ (0 , 1 / 2), the Bayes optimal classifier g ∗ λ is defined as:  1 if η ( x ) ≥ 1 − λ   g ∗ λ = 0 if η ( x ) ≤ λ   ∆ if η ( x ) ∈ ( λ, 1 − λ ) The optimal classifier can equivalently be represented by the triplet ( G ∗ 0 , G ∗ 1 , G ∗ ∆ ) where G ∗ v := { x ∈ X | g ( x ) = v } for v ∈ { 0 , 1 , ∆ } . 5 / 17

Introduction: Bayes Optimal Classifier 6 / 17

Assumptions We make the following two assumptions on the joint distribution P XY . (H¨ O) The regression function η is H¨ older continuous with parameters L > 0 and 0 < β ≤ 1, i.e., | η ( x 1 ) − η ( x 2 ) | ≤ Ld ( x 1 , x 2 ) β , ∀ x 1 , x 2 ∈ ( X , d ) . (MA) The joint distribution P XY of the input-label pair satisfies the margin assumption with parameters C 0 > 0 and α 0 ≥ 0, for γ ∈ { λ, 1 − λ } , which means that for any P X ( | η ( X ) − γ | ≤ t ) ≤ C 0 t α 0 , ∀ 0 < t ≤ 1 . We will use P ( α 0 , β ) to denote the class of P XY satisfying these conditions. 7 / 17

Proposed Algorithm: Outline Repeat for t = 1 , 2 , . . . , classified unclassified �� Q ( c ) Q ( u ) ∪ : partition of X . t t E_1 E_2 8 / 17

Proposed Algorithm: Outline Repeat for t = 1 , 2 , . . . , classified unclassified �� Q ( c ) Q ( u ) ∪ : partition of X . t t For each set E ∈ Q ( u ) , compute η t ( E ) = (1 / n t ( E )) � t • ˆ X i ∈ E Y i • u t ( E ) = ˆ η t ( E ) +term1 + term2 • ℓ t ( E ) = ˆ η t ( E ) - term1 - term2 E_1 E_2 8 / 17

Proposed Algorithm: Outline Repeat for t = 1 , 2 , . . . , classified unclassified �� Q ( c ) Q ( u ) ∪ : partition of X . t t For each set E ∈ Q ( u ) , compute η t ( E ) = (1 / n t ( E )) � t • ˆ X i ∈ E Y i • u t ( E ) = ˆ η t ( E ) +term1 + term2 • ℓ t ( E ) = ˆ η t ( E ) - term1 - term2 If u t ( E ) < λ or ℓ t ( E ) > 1 − λ ⇒ add E to Q ( c ) . Otherwise, keep in Q ( u ) . t t E_1 E_2 8 / 17

Proposed Algorithm: Outline Repeat for t = 1 , 2 , . . . , classified unclassified �� Q ( c ) Q ( u ) ∪ : partition of X . t t For each set E ∈ Q ( u ) , compute η t ( E ) = (1 / n t ( E )) � t • ˆ X i ∈ E Y i • u t ( E ) = ˆ η t ( E ) +term1 + term2 • ℓ t ( E ) = ˆ η t ( E ) - term1 - term2 If u t ( E ) < λ or ℓ t ( E ) > 1 − λ ⇒ add E to Q ( c ) . Otherwise, keep in Q ( u ) . t t E t = arg max E ∈Q ( u ) u t ( E ) − ℓ t ( E ). t E_1 E_2 8 / 17

Proposed Algorithm: Outline Repeat for t = 1 , 2 , . . . , classified unclassified �� Q ( c ) Q ( u ) ∪ : partition of X . t t For each set E ∈ Q ( u ) , compute η t ( E ) = (1 / n t ( E )) � t • ˆ X i ∈ E Y i • u t ( E ) = ˆ η t ( E ) +term1 + term2 • ℓ t ( E ) = ˆ η t ( E ) - term1 - term2 If u t ( E ) < λ or ℓ t ( E ) > 1 − λ ⇒ add E to Q ( c ) . Otherwise, keep in Q ( u ) . t t E t = arg max E ∈Q ( u ) u t ( E ) − ℓ t ( E ). t Refine if term1 < term2, else Query at x t ∼ Unif( E t ). E_1 E_2 8 / 17

Proposed Algorithm: Outline Repeat for t = 1 , 2 , . . . , classified unclassified �� Q ( c ) Q ( u ) ∪ : partition of X . t t For each set E ∈ Q ( u ) , compute η t ( E ) = (1 / n t ( E )) � t • ˆ X i ∈ E Y i • u t ( E ) = ˆ η t ( E ) +term1 + term2 • ℓ t ( E ) = ˆ η t ( E ) - term1 - term2 If u t ( E ) < λ or ℓ t ( E ) > 1 − λ ⇒ add E to Q ( c ) . Otherwise, keep in Q ( u ) . t t E t = arg max E ∈Q ( u ) u t ( E ) − ℓ t ( E ). t Refine if term1 < term2, else Query at x t ∼ Unif( E t ). Classifier Definition.  � �  1 if ℓ t n x > 1 − λ, E_1 E_2  � � g n ( x ) = ˆ 0 if u t n < λ, (1) x   ∆ otherwise . 8 / 17

9 / 17

Bound on Excess Risk Theorem Under margin and smoothness assumptions and n large enough, for the classifier ˆ g n defined by (1) , we have � n − β ( α 0 +1) / (2 β + ˜ D ) � g n ) − R λ ( g ∗ λ )] = ˜ E [ R λ (ˆ O . Conversely, we have the following lower bound for any algorithm � n − β (1+ α 0 ) / (2 β + D ) � E [ R λ ( g ) − R λ ( g ∗ sup λ )] = Ω . P ∈P ( α 0 ,β ) ˜ D ≤ D is a measure of dimensionality of X near the classifier boundaries { x : η ( x ) ∈ { λ, 1 − λ } } . D depends on the parameters of both (MA) and (H¨ ˜ O) assumptions. 10 / 17 D ≤ D and there exists cases with ˜ ˜ D < D and ˜ D = D .

Lower Bound: Key Inequality Lemma Suppose g = ( G 0 , G 1 , G ∆ ) is a classifier constructed by an algorithm A , and g ∗ λ = ( G ∗ 0 , G ∗ 1 , G ∗ ∆ ) . Then, for any λ ∈ (0 , 1 / 2) we have � � (1+ α 0 ) /α 0 R λ ( g ) − R λ ( g ∗ ( G ∗ ∆ \ G ∆ ) ∪ ( G ∆ \ G ∗ λ ) ≥ c P X ∆ ) . � �� := E ( A , P XY , n ) ( Excess Risk ) := � E ( A , P XY , n ) Rest of the proof relies on a reduction to multiple hypothesis testing. This Lemma suggests an appropriate pseudo-metric, and motivates the construction of hypothesis set. 11 / 17

Improvement Over Passive Corollary For passive algorithms A p , we can obtain lower bound E as: � n − β (1+ α 0 ) / ( D +2 β + α 0 β ) � inf sup E [ E ( A p , P , n )] = Ω . A p P ∈P ( α 0 ,β ) This rate is slower than the rate achieved by our active algorithm, � n − ( β (1+ α 0 ) / ( D +2 β ) � ≈ ˜ O , for all D , α 0 , β . In particular, the gain due to active learning increases with (i) decreasing D , (ii) increasing α 0 and (iii) increasing β . Under an additional strong density assumption, further improvements are possible. 12 / 17

Improvement over passive: α 0 = 1 , β = 0 . 5 , D ∈ { 1 , 2 , . . . , 50 } log E log n lim n →∞ Dimension 13 / 17

Improvement over passive: α 0 ∈ [1 , 10] , β = 0 . 5 , D = 5 log E log n lim n →∞ Margin Parameter α 0 14 / 17

Active Learning for Classification with Abstention Shubhanshu Shekhar - PowerPoint PPT Presentation

Active Learning for Classification with Abstention Shubhanshu Shekhar 1 University of California, San Diego Mohammad Ghavamzadeh Facebook AI Research Tara Javidi University of California, San Diego ISIT 2020 1 shshekha@eng.ucsd.edu 1 / 17

Agenda Intro to Active Learning Activity Design Resources for Active Learning Lunch with Active

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Combating Label Noise in Deep Learning using Abstention Speaker: Sunil Thulasidasan

The Active Card An Active Mind in an Active Body More people, More Active, More often! The

Active Adversary Lecture 7 CCA Security MAC Active Adversary Active Adversary An active

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

On the Distortion Value of the Elections with Abstention Mohammad Ghodsi, Mohamad Lati fj an,

Learning Loss for Active Learning Rymarczyk D., Zieliski B., Tabor J., Sadowski M., Titov M.

Partnership event 21 st November 2019 Welcome #ActiveBradford Active Bradford Members Active

MAC. SKE in Practice. Lecture 5 Active Adversary Active Adversary An active adversary can

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Batch Mode Active Learning and Its Application to Medical Image Classification ICML 2006 S. Hoi,

Active Learning Passive Learning Active Learning 1. Think 1. Acquisition of knowledge Ability

Active Threat on Campus Prevention & Response Active threat defined An active threat can be

Active Learning with Active Learning with Model Selection Neil Rubens Sugiyama Lab / Tokyo

HYBRID CLASSROOM USING HIGH-TECH/HIGH-TOUCH Roneet Merkin Florida International University

reduced in active learning of agent strategies? Cline Hocquette & Stephen H. Muggleton

to promote student conceptual understanding Sahana Murthy IDP in Educational Technology Indian

Learning as Tools for Biosecurity Education Jo L. Husbands Scholar/Senior Project Director

Holding Effective Office Hours Learning objective: Purpose, strategies for effective office

Release from Active Learning / Release from Active Learning / Model Selection Dilemma: Model

user preference elicitation in recommender systems Mara Hernndez Rubio ( presenting author )

Word Choices During Coaching Conversations Dr. Bethanie C. Pletcher bethanie.pletcher@tamucc.edu

Active Learning for Classification with Abstention Shubhanshu Shekhar - PowerPoint PPT Presentation

Active Learning for Classification with Abstention Shubhanshu Shekhar 1 University of California, San Diego Mohammad Ghavamzadeh Facebook AI Research Tara Javidi University of California, San Diego ISIT 2020 1 shshekha@eng.ucsd.edu 1 / 17

Agenda Intro to Active Learning Activity Design Resources for Active Learning Lunch with Active

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Combating Label Noise in Deep Learning using Abstention Speaker: Sunil Thulasidasan

The Active Card An Active Mind in an Active Body More people, More Active, More often! The

Active Adversary Lecture 7 CCA Security MAC Active Adversary Active Adversary An active

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

On the Distortion Value of the Elections with Abstention Mohammad Ghodsi, Mohamad Lati fj an,

Learning Loss for Active Learning Rymarczyk D., Zieliski B., Tabor J., Sadowski M., Titov M.

Partnership event 21 st November 2019 Welcome #ActiveBradford Active Bradford Members Active

MAC. SKE in Practice. Lecture 5 Active Adversary Active Adversary An active adversary can

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

Batch Mode Active Learning and Its Application to Medical Image Classification ICML 2006 S. Hoi,

Active Learning Passive Learning Active Learning 1. Think 1. Acquisition of knowledge Ability

Active Threat on Campus Prevention &amp; Response Active threat defined An active threat can be

Active Learning with Active Learning with Model Selection Neil Rubens Sugiyama Lab / Tokyo

HYBRID CLASSROOM USING HIGH-TECH/HIGH-TOUCH Roneet Merkin Florida International University

reduced in active learning of agent strategies? Cline Hocquette &amp; Stephen H. Muggleton

to promote student conceptual understanding Sahana Murthy IDP in Educational Technology Indian

Learning as Tools for Biosecurity Education Jo L. Husbands Scholar/Senior Project Director

Holding Effective Office Hours Learning objective: Purpose, strategies for effective office

Release from Active Learning / Release from Active Learning / Model Selection Dilemma: Model

user preference elicitation in recommender systems Mara Hernndez Rubio ( presenting author )

Word Choices During Coaching Conversations Dr. Bethanie C. Pletcher bethanie.pletcher@tamucc.edu

Active Threat on Campus Prevention & Response Active threat defined An active threat can be

reduced in active learning of agent strategies? Cline Hocquette & Stephen H. Muggleton