Noise-adaptive Margin- based Active Learning, and Yining Wang , - PowerPoint PPT Presentation

Noise-adaptive Margin- based Active Learning, and Yining Wang , Aarti Singh Carnegie Mellon University Lower Bounds

Machine Learning: the setup ❖ The machine learning problem ❖ Each data point consists of data and label ( x i , y i ) y i x i ❖ Access to training data ( x 1 , y 1 ) , · · · , ( x n , y n ) ❖ Goal: train classifier to predict y based on x ˆ f ❖ Example: Classification x i ∈ R d , y i ∈ { +1 , − 1 }

Machine learning: passive vs. active ❖ Classical framework: passive learning i.i.d. ❖ I.I.D. training data ( x i , y i ) ∼ D h i ❖ Evaluation: generalization error y 6 = ˆ Pr f ( x ) D ❖ An active learning framework ❖ Data are cheap, but labels are expensive! ❖ Example : medical data (labels require domain knowledge) ❖ Active learning: minimize label requests

Active Learning ❖ Pool-based active learning ❖ The learner A has access to unlabeled data stream i.i.d. x 1 , x 2 , · · · ∼ D ❖ For each , the learner decides whether to query; if x i label requested, A obtains y i ❖ Minimize number of requests, while scanning through polynomial number of unlabeled data.

Active Learning ❖ Example: learning homogeneous linear classifier y i = sgn( w > x i ) + noise ❖ Basic (passive) approach: empirical risk minimization (ERM) n X I [ y i 6 = sgn( w > x i )] w 2 argmin k w k 2 =1 ˆ i =1 ❖ How about active learning?

Margin-based Active Learning BALCAN, BRODER and ZHANG, COLT’07 ❖ Data dimension d , query budget T , no. of iterations E ❖ At each iteration k ∈ { 1 , · · · , E } ❖ Determine parameters b k − 1 , β k − 1 { x ∈ R d : | ˆ ❖ Find samples in w k − 1 · x | ≤ b k − 1 } n = T/E ❖ Constrained ERM: w k − 1 ) ≤ β k − 1 L ( { x i , y i } n w k = ˆ min i =1 ; w ) θ ( w, ˆ ❖ Final output: ˆ w E

Tsybakov Noise Condition ❖ There exist constants such that µ > 0 , α ∈ (0 , 1) µ · θ ( w, w ∗ ) 1 / (1 − α ) ≤ err( w ) − err( w ∗ ) ❖ : key noise magnitude parameter in TNC α ∈ (0 , 1) ❖ Which one is harder? err( w ) − err( w ∗ ) small α large α θ ( w, w ∗ )

Margin-based Active Learning ❖ Main Theorem [BBZ07]: when D is the uniform distribution, the margin-based algorithm achieves (✓ d ◆ 1 / 2 α ) w ) − err( w ∗ ) = e err( ˆ O P . T Passive Learning: 1 − α 2 α ) O (( d/T )

Proof outline BALCAN, BRODER and ZHANG, COLT’07 ❖ At each iteration k , perform restricted ERM over within- margin data c w k = ˆ argmin err( w | S 1 ) , θ ( w, ˆ w k − 1 )  β k − 1 S 1 = { x : | x > ˆ w k � 1 | ≤ b k � 1 }

Proof outline ❖ Key fact: if and then b k = ˜ √ θ ( ˆ w k − 1 , w ∗ ) ≤ β k − 1 Θ ( β k / d ) ⇣ ⌘ w k ) − err( w ∗ ) = ˜ p err( ˆ O d/T β k − 1 ❖ Proof idea: decompose the excess error into two terms [err( ˆ w k | S 1 ) − err( w ∗ | S 1 )] Pr[ x ∈ S 1 ] | {z } | {z } O ( √ √ ˜ ˜ O ( b k − 1 d ) d/T ) 1 ] = ˜ w k | S c 1 ) − err( w ∗ | S c 1 )] Pr[ x ∈ S c [err( ˆ O (tan β k − 1 ) ❖ Must ensure w * is always within reach! β k = 2 α − 1 β k − 1

Problem ❖ What if is not known? How to set key parameters α b k , β k ❖ If the true parameter is but the algorithm is run with α α 0 > α ❖ The convergence is instead of ! α 0 α

Noise-adaptive Algorithm ❖ Agnostic parameter settings 2 log T, β k = 2 − k π , b k = 2 β k E = 1 √ 2 E √ d ❖ Main analysis: two-phase behaviors ❖ “Tipping point” : , depending on k ∗ ∈ { 1 , · · · , E } α ❖ Phase I: , we have that θ ( ˆ w k , w ∗ ) ≤ β k k ≤ k ∗ ❖ Phase II: , we have that k > k ∗ p w k ) ≤ β k · e err( ˆ w k +1 ) − err( ˆ O ( d/T )

Noise-Adaptive Analysis ❖ Main theorem: for all α ∈ (0 , 1 / 2) (✓ d ◆ 1 / 2 α ) w ) − err( w ∗ ) = e err( ˆ O P . T ❖ Matching the upper bound in [BBZ07] ❖ … and also a lower bound (this paper)

Lower Bound ❖ Is there any active learning algorithm that can do better e O P (( d/T ) 1 / 2 α ) than the sample complexity? ❖ In general, no [Henneke, 2015]. But the data distribution D is quite contrived in the negative example. ❖ We show that is tight even if D is as e O P (( d/T ) 1 / 2 α ) simple as the uniform distribution over unit sphere.

Lower Bound ❖ The “Membership Query Synthesis” (QS) setting ❖ The algorithm A picks an arbitrary data point x i ❖ The algorithm receives its label y i ❖ Repeat the procedure T times, with T the budget ❖ QS is more powerful than pool-based setting when D has density bounded away from below. ❖ We prove lower bounds for the QS setting, which implies lower bounds in the pool-based setting.

Tsybakov’s Main Theorem TSYBAKOV and ZAIATS, Introduction to Nonparametric Estimation ❖ Let be a set of models. Suppose F 0 = { f 0 , · · · , f M } ❖ Separation: D ( f j , f k ) � 2 ρ , 8 j, k 2 { 1 , · · · , M } , j 6 = k M 1 ❖ Closeness: X KL( P f j k P f 0 )  γ log M M j =1 ❖ Regularity: P f j ⌧ P f 0 , 8 j 2 { 1 , · · · , M } ❖ Then the following bound holds √ ✓ ◆ r M h i γ D ( ˆ inf sup Pr f, f ) ≥ ρ 1 − 2 γ − 2 . √ ≥ log M 1 + M f ˆ f f ∈ F 0

Negative Example Construction ❖ Separation: D ( f j , f k ) � 2 ρ , 8 j, k 2 { 1 , · · · , M } , j 6 = k ❖ Find hypothesis class such that W = { w 1 , · · · , w m } t  θ ( w i , w j )  6 . 5 t, 8 i 6 = j ❖ … can be done for all , using constant t ∈ (0 , 1 / 4) weight coding ❖ … can guarantee that log |W| = Ω ( d )

Negative Example Construction

Lower Bound TSYBAKOV and ZAIATS, Introduction to Nonparametric Estimation ❖ Let be a set of models. Suppose F 0 = { f 0 , · · · , f M } ❖ Separation: D ( f j , f k ) � 2 ρ , 8 j, k 2 { 1 , · · · , M } , j 6 = k M 1 ❖ Closeness: X KL( P f j k P f 0 )  γ log M M j =1 ❖ Regularity: P f j ⌧ P f 0 , 8 j 2 { 1 , · · · , M } ρ = Θ ( t ) = Θ (( d/T ) (1 − α ) / 2 α ) ❖ Take log M = Θ ( d ) ❖ We have that  � w, w ∗ ) ≥ t inf w sup w ∗ Pr θ ( ˆ = Ω (1) 2 ˆ

Lower Bound ❖ Suppose D has density bounded away from below and µ > 0 , α ∈ (0 , 1) fix . Let be class of distributions P Y | X ( µ, α ) satisfying -TNC. Then we have that "✓ d ◆ 1 / 2 α # w ) − err( w ∗ )] ≥ Ω inf sup E P [err( ˆ . T A P ∈ P Y | X

Extension: “Proactive” learning ❖ Suppose there are m different users (labelers) who share the same classifier w * but with different TNC parameters α 1 , · · · , α m ❖ The TNC parameters are not known. ❖ At each iteration, the algorithm picks a data point x and also a user j , and observes f(x;j) ❖ The goal is to estimate the Bayes classifier w*

Extension: “Proactive” learning ❖ Algorithm framework: ❖ Operate in iterations. E = O (log T ) ❖ At each iteration, use conventional Bandit algorithms to address exploration-exploitation tradeoff ❖ Key property: search space and margin does { b k } { β k } not depend on unknown TNC parameters. ❖ Many interesting extensions: what if multiple labelers can be involved each time?

Thanks! Questions?

Noise-adaptive Margin- based Active Learning, and Yining Wang , - PowerPoint PPT Presentation

Noise-adaptive Margin- based Active Learning, and Yining Wang , Aarti Singh Carnegie Mellon University Lower Bounds Machine Learning: the setup The machine learning problem Each data point consists of data and label ( x

Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Maximum Margin Criterion Math

Module-2c: Two Port Noise Modelling 20 July 2018 16:40 Shot Noise vs. Flicker Noise Simple

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Agenda Intro to Active Learning Activity Design Resources for Active Learning Lunch with Active

Visioning Committee Air Quality and Noise January 23, 2020 Noise Data Noise is evaluated on

Johnson Noise: Determinations of k and Absolute Zero Edwin Ng | 12 December 2011 Nyquists

Lecture 19- ECE 240a Laser Phase Noise 1 ECE 240a Lasers - Fall 2019 Lecture 19 Phase Noise

Making Polynomials Robust to Noise Alexander Sherstov U C L A Noise in computation 2 Noise in

Noises Jaanus Jaggo Noise Noise is a function: noise(coordinate) -> value Pseudo-random:

Noises Jaanus Jaggo Noise Noise is a function: noise(coordinate) -> value Pseudo-random:

About this class Maximizing the Margin Maximum margin classifiers Picture of large and small

Support Vector Machines Greg Mori - CMPT 419/726 Bishop PRML Ch. 7 Maximum Margin Criterion

The Active Card An Active Mind in an Active Body More people, More Active, More often! The

Active Adversary Lecture 7 CCA Security MAC Active Adversary Active Adversary An active

2008, nature Shlens et al 09 Pillow et al, 2008 Pillow et al, 2008 Whats role of coupling

Lecture 4: Linear filters Tuesday, Sept 11 Many slides by (or adapted from) D. Forsyth, Y.

On the Noisy Gradient Descent that Generalizes as SGD Jingfeng Wu , Wenqing Hu, Haoyi Xiong, Jun

Realistic noise simulation in LArSoft Andrea Scarpelli (CNRS/APC) Outline 3x1x1 noise patterns

Neural Networks for Machine Learning Lecture 9a Overview of ways to improve generalization

Example: Grid World CS 188: Artificial Intelligence Markov Decision Processes II A

Generalized Cross Entropy Loss for Noisy Labels Zhilu Zhang and Mert R. Sabuncu Cornell

Overview of State Space Models Standard State Space Model Standard state space model x n +1 =

Noise-adaptive Margin- based Active Learning, and Yining Wang , - PowerPoint PPT Presentation

Noise-adaptive Margin- based Active Learning, and Yining Wang , Aarti Singh Carnegie Mellon University Lower Bounds Machine Learning: the setup The machine learning problem Each data point consists of data and label ( x

Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Maximum Margin Criterion Math

Module-2c: Two Port Noise Modelling 20 July 2018 16:40 Shot Noise vs. Flicker Noise Simple

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

Agenda Intro to Active Learning Activity Design Resources for Active Learning Lunch with Active

Visioning Committee Air Quality and Noise January 23, 2020 Noise Data Noise is evaluated on

Johnson Noise: Determinations of k and Absolute Zero Edwin Ng | 12 December 2011 Nyquists

Lecture 19- ECE 240a Laser Phase Noise 1 ECE 240a Lasers - Fall 2019 Lecture 19 Phase Noise

Making Polynomials Robust to Noise Alexander Sherstov U C L A Noise in computation 2 Noise in

Noises Jaanus Jaggo Noise Noise is a function: noise(coordinate) -&gt; value Pseudo-random:

Noises Jaanus Jaggo Noise Noise is a function: noise(coordinate) -&gt; value Pseudo-random:

About this class Maximizing the Margin Maximum margin classifiers Picture of large and small

Support Vector Machines Greg Mori - CMPT 419/726 Bishop PRML Ch. 7 Maximum Margin Criterion

The Active Card An Active Mind in an Active Body More people, More Active, More often! The

Active Adversary Lecture 7 CCA Security MAC Active Adversary Active Adversary An active

2008, nature Shlens et al 09 Pillow et al, 2008 Pillow et al, 2008 Whats role of coupling

Lecture 4: Linear filters Tuesday, Sept 11 Many slides by (or adapted from) D. Forsyth, Y.

On the Noisy Gradient Descent that Generalizes as SGD Jingfeng Wu , Wenqing Hu, Haoyi Xiong, Jun

Realistic noise simulation in LArSoft Andrea Scarpelli (CNRS/APC) Outline 3x1x1 noise patterns

Neural Networks for Machine Learning Lecture 9a Overview of ways to improve generalization

Example: Grid World CS 188: Artificial Intelligence Markov Decision Processes II A

Generalized Cross Entropy Loss for Noisy Labels Zhilu Zhang and Mert R. Sabuncu Cornell

Overview of State Space Models Standard State Space Model Standard state space model x n +1 =

Noises Jaanus Jaggo Noise Noise is a function: noise(coordinate) -> value Pseudo-random:

Noises Jaanus Jaggo Noise Noise is a function: noise(coordinate) -> value Pseudo-random: