Noise-adaptive Margin- based Active Learning, and Yining Wang , Aarti Singh Carnegie Mellon University Lower Bounds
Machine Learning: the setup ❖ The machine learning problem ❖ Each data point consists of data and label ( x i , y i ) y i x i ❖ Access to training data ( x 1 , y 1 ) , · · · , ( x n , y n ) ❖ Goal: train classifier to predict y based on x ˆ f ❖ Example: Classification x i ∈ R d , y i ∈ { +1 , − 1 }
Machine learning: passive vs. active ❖ Classical framework: passive learning i.i.d. ❖ I.I.D. training data ( x i , y i ) ∼ D h i ❖ Evaluation: generalization error y 6 = ˆ Pr f ( x ) D ❖ An active learning framework ❖ Data are cheap, but labels are expensive! ❖ Example : medical data (labels require domain knowledge) ❖ Active learning: minimize label requests
Active Learning ❖ Pool-based active learning ❖ The learner A has access to unlabeled data stream i.i.d. x 1 , x 2 , · · · ∼ D ❖ For each , the learner decides whether to query; if x i label requested, A obtains y i ❖ Minimize number of requests, while scanning through polynomial number of unlabeled data.
Active Learning ❖ Example: learning homogeneous linear classifier y i = sgn( w > x i ) + noise ❖ Basic (passive) approach: empirical risk minimization (ERM) n X I [ y i 6 = sgn( w > x i )] w 2 argmin k w k 2 =1 ˆ i =1 ❖ How about active learning?
Margin-based Active Learning BALCAN, BRODER and ZHANG, COLT’07 ❖ Data dimension d , query budget T , no. of iterations E ❖ At each iteration k ∈ { 1 , · · · , E } ❖ Determine parameters b k − 1 , β k − 1 { x ∈ R d : | ˆ ❖ Find samples in w k − 1 · x | ≤ b k − 1 } n = T/E ❖ Constrained ERM: w k − 1 ) ≤ β k − 1 L ( { x i , y i } n w k = ˆ min i =1 ; w ) θ ( w, ˆ ❖ Final output: ˆ w E
Tsybakov Noise Condition ❖ There exist constants such that µ > 0 , α ∈ (0 , 1) µ · θ ( w, w ∗ ) 1 / (1 − α ) ≤ err( w ) − err( w ∗ ) ❖ : key noise magnitude parameter in TNC α ∈ (0 , 1) ❖ Which one is harder? err( w ) − err( w ∗ ) small α large α θ ( w, w ∗ )
Margin-based Active Learning ❖ Main Theorem [BBZ07]: when D is the uniform distribution, the margin-based algorithm achieves (✓ d ◆ 1 / 2 α ) w ) − err( w ∗ ) = e err( ˆ O P . T Passive Learning: 1 − α 2 α ) O (( d/T )
Proof outline BALCAN, BRODER and ZHANG, COLT’07 ❖ At each iteration k , perform restricted ERM over within- margin data c w k = ˆ argmin err( w | S 1 ) , θ ( w, ˆ w k − 1 ) β k − 1 S 1 = { x : | x > ˆ w k � 1 | ≤ b k � 1 }
Proof outline ❖ Key fact: if and then b k = ˜ √ θ ( ˆ w k − 1 , w ∗ ) ≤ β k − 1 Θ ( β k / d ) ⇣ ⌘ w k ) − err( w ∗ ) = ˜ p err( ˆ O d/T β k − 1 ❖ Proof idea: decompose the excess error into two terms [err( ˆ w k | S 1 ) − err( w ∗ | S 1 )] Pr[ x ∈ S 1 ] | {z } | {z } O ( √ √ ˜ ˜ O ( b k − 1 d ) d/T ) 1 ] = ˜ w k | S c 1 ) − err( w ∗ | S c 1 )] Pr[ x ∈ S c [err( ˆ O (tan β k − 1 ) ❖ Must ensure w * is always within reach! β k = 2 α − 1 β k − 1
Problem ❖ What if is not known? How to set key parameters α b k , β k ❖ If the true parameter is but the algorithm is run with α α 0 > α ❖ The convergence is instead of ! α 0 α
Noise-adaptive Algorithm ❖ Agnostic parameter settings 2 log T, β k = 2 − k π , b k = 2 β k E = 1 √ 2 E √ d ❖ Main analysis: two-phase behaviors ❖ “Tipping point” : , depending on k ∗ ∈ { 1 , · · · , E } α ❖ Phase I: , we have that θ ( ˆ w k , w ∗ ) ≤ β k k ≤ k ∗ ❖ Phase II: , we have that k > k ∗ p w k ) ≤ β k · e err( ˆ w k +1 ) − err( ˆ O ( d/T )
Noise-Adaptive Analysis ❖ Main theorem: for all α ∈ (0 , 1 / 2) (✓ d ◆ 1 / 2 α ) w ) − err( w ∗ ) = e err( ˆ O P . T ❖ Matching the upper bound in [BBZ07] ❖ … and also a lower bound (this paper)
Lower Bound ❖ Is there any active learning algorithm that can do better e O P (( d/T ) 1 / 2 α ) than the sample complexity? ❖ In general, no [Henneke, 2015]. But the data distribution D is quite contrived in the negative example. ❖ We show that is tight even if D is as e O P (( d/T ) 1 / 2 α ) simple as the uniform distribution over unit sphere.
Lower Bound ❖ The “Membership Query Synthesis” (QS) setting ❖ The algorithm A picks an arbitrary data point x i ❖ The algorithm receives its label y i ❖ Repeat the procedure T times, with T the budget ❖ QS is more powerful than pool-based setting when D has density bounded away from below. ❖ We prove lower bounds for the QS setting, which implies lower bounds in the pool-based setting.
Tsybakov’s Main Theorem TSYBAKOV and ZAIATS, Introduction to Nonparametric Estimation ❖ Let be a set of models. Suppose F 0 = { f 0 , · · · , f M } ❖ Separation: D ( f j , f k ) � 2 ρ , 8 j, k 2 { 1 , · · · , M } , j 6 = k M 1 ❖ Closeness: X KL( P f j k P f 0 ) γ log M M j =1 ❖ Regularity: P f j ⌧ P f 0 , 8 j 2 { 1 , · · · , M } ❖ Then the following bound holds √ ✓ ◆ r M h i γ D ( ˆ inf sup Pr f, f ) ≥ ρ 1 − 2 γ − 2 . √ ≥ log M 1 + M f ˆ f f ∈ F 0
Negative Example Construction ❖ Separation: D ( f j , f k ) � 2 ρ , 8 j, k 2 { 1 , · · · , M } , j 6 = k ❖ Find hypothesis class such that W = { w 1 , · · · , w m } t θ ( w i , w j ) 6 . 5 t, 8 i 6 = j ❖ … can be done for all , using constant t ∈ (0 , 1 / 4) weight coding ❖ … can guarantee that log |W| = Ω ( d )
Negative Example Construction
Negative Example Construction M 1 ❖ Closeness: X KL( P f j k P f 0 ) γ log M M j =1 P ( i ) " # X 1 ,Y 1 , ··· ,X T ,Y T ( x 1 , y 1 , · · · , x T , y T ) KL( P i,T k P j,T ) = log E i P ( j ) X 1 ,Y 1 , ··· ,X T ,Y T ( x 1 , y 1 , · · · , x T , y T ) 2 3 t =1 P ( i ) Q T Y t | X t ( y t | x t ) P X t | X 1 ,Y 1 , ··· ,X t − 1 ,Y t − 1 ( x t | x 1 , y 1 , · · · , x t � 1 , y t � 1 ) = 4 log E i 5 Q T t =1 P ( j ) Y t | X t ( y t | x t ) P X t | X 1 ,Y 1 , ··· ,X t − 1 ,Y t − 1 ( x t | x 1 , y 1 , · · · , x t � 1 , y t � 1 ) 2 3 Q T t =1 P ( i ) Y t | X t ( y t | x t ) = 4 log E i 5 t =1 P ( j ) Q T Y t | X t ( y t | x t ) 2 2 3 3 P ( i ) T � Y | X ( y t | x t ) � X = 4 log � X 1 = x 1 , · · · , X T = x T E i 4 E i � 5 5 P ( j ) � Y | X ( y t | x t ) t =1 KL( P ( i ) Y | X ( ·| x ) k P ( j ) T · sup Y | X ( ·| x )) . 2 X
Lower Bound TSYBAKOV and ZAIATS, Introduction to Nonparametric Estimation ❖ Let be a set of models. Suppose F 0 = { f 0 , · · · , f M } ❖ Separation: D ( f j , f k ) � 2 ρ , 8 j, k 2 { 1 , · · · , M } , j 6 = k M 1 ❖ Closeness: X KL( P f j k P f 0 ) γ log M M j =1 ❖ Regularity: P f j ⌧ P f 0 , 8 j 2 { 1 , · · · , M } ρ = Θ ( t ) = Θ (( d/T ) (1 − α ) / 2 α ) ❖ Take log M = Θ ( d ) ❖ We have that � w, w ∗ ) ≥ t inf w sup w ∗ Pr θ ( ˆ = Ω (1) 2 ˆ
Lower Bound ❖ Suppose D has density bounded away from below and µ > 0 , α ∈ (0 , 1) fix . Let be class of distributions P Y | X ( µ, α ) satisfying -TNC. Then we have that "✓ d ◆ 1 / 2 α # w ) − err( w ∗ )] ≥ Ω inf sup E P [err( ˆ . T A P ∈ P Y | X
Extension: “Proactive” learning ❖ Suppose there are m different users (labelers) who share the same classifier w * but with different TNC parameters α 1 , · · · , α m ❖ The TNC parameters are not known. ❖ At each iteration, the algorithm picks a data point x and also a user j , and observes f(x;j) ❖ The goal is to estimate the Bayes classifier w*
Extension: “Proactive” learning ❖ Algorithm framework: ❖ Operate in iterations. E = O (log T ) ❖ At each iteration, use conventional Bandit algorithms to address exploration-exploitation tradeoff ❖ Key property: search space and margin does { b k } { β k } not depend on unknown TNC parameters. ❖ Many interesting extensions: what if multiple labelers can be involved each time?
Thanks! Questions?
Recommend
More recommend