Selective Sampling (Realizable) Ji Xu October 2nd, 2017 Basic - PowerPoint PPT Presentation

Selective Sampling (Realizable) Ji Xu October 2nd, 2017

Basic Settings Model: ◮ D : a distribution over X × Y where X is the input space and Y = {± 1 } are the possible labels. ◮ ( X , Y ) ∈ X × Y be a pair of random variables with joint distribution D . ◮ H be a set of hypotheses mapping from X to Y . The error of a hypothesis h : X → Y is err ( h ) := Pr ( h ( X ) � = Y ) . ◮ Let h ∗ := argmin { err ( h ) : h ∈ H} be a hypothesis with minimum error in H .

Basic Settings Goal: with high probability, we return ˆ h ∈ H such that err (ˆ h ) ≤ err ( h ∗ ) + ǫ. In realizable case, we have err ( h ∗ ) = 0, hence, we want err (ˆ h ) ≤ ǫ.

Basic Settings Passive VS Active: ◮ Passive setting: ◮ At time t , observe X t and choose h t ∈ H . ◮ Make prediction h t ( X t ) and then observe feedback Y t . ◮ Minimize the total number of mistakes of h t ( X t ) � = Y t .

Basic Settings Passive VS Active: ◮ Active setting: ◮ At time t , observe X t . ◮ We choose whether we need the feedback Y t . ◮ Minimize the number of mistakes of ˆ h and the total number of queries of the correct label Y t .

Basic Settings Passive VS Active: ◮ Active setting: ◮ At time t , observe X t . ◮ We choose whether we need the feedback Y t . ◮ Minimize the number of mistakes of ˆ h and the total number of queries of the correct label Y t . Hence, intuitively, ( X t , Y t ) does not provide any information if h ( X t ) are the same for all the potential hypotheses at time t , and thus we should not query for such X t .

Concepts Definition For a set of hypotheses V , the region of disagreement R ( V ) is R ( V ) := { x ∈ X : ∃ h , h ′ ∈ V such that h ( x ) � = h ′ ( x ) } . Definition For a given set of hypotheses H and sample set Z T = { ( X t , Y t ) , t = 1 · · · T } , the uncertainty region U ( H , Z T ) is { x ∈ X : ∃ h , h ′ ∈ H such that h ( x ) � = h ′ ( x ) U ( H , Z T ) := and h ( X t ) = h ′ ( X t ) = Y t , ∀ t ∈ [ T ] } .

Remarks ◮ Let C = { h ∈ H : h ( X t ) = Y t , ∀ t ∈ [ T ] } . Then we have U ( H , Z T ) = R ( C ) . ◮ Ideally, the area of the uncertainty region will be monotonically non-increasing by more training samples. ◮ If we can control the sampling procedure over X t , it is better to only sample on U ( H , Z t ) . (Selective Sampling or Approximate Selective Sampling) ◮ Correctness of all labels Y t for X t not in the query. Need to query X t + 1 if X t + 1 ∈ U ( H , Z t ) . ◮ The complexity of finding a good set ˆ H such that h ∗ ∈ ˆ H ⊆ H can be intuitively measured by the ratio between H err ( h ) and Pr ( R ( ˆ sup h ∈ ˆ H )) .

Concepts Definition We redefine the region of disagreement by R ( h , r ) of radius r around a hypothesis h ∈ H in the disagreement metric space ( H , ρ ) is R ( h , r ) := { x ∈ X : ∃ h ′ ∈ B ( h , r ) such that h ( x ) � = h ′ ( x ) } . where the disagreement (pseudo) metric ρ on H is defined by ρ ( h , h ′ ) := Pr ( h ( X ) � = h ′ ( X )) . Hence, we have err ( h ) = ρ ( h , h ∗ ) .

Concepts Definition We redefine the region of disagreement by R ( h , r ) of radius r around a hypothesis h ∈ H in the disagreement metric space ( H , ρ ) is R ( h , r ) := { x ∈ X : ∃ h ′ ∈ B ( h , r ) such that h ( x ) � = h ′ ( x ) } . where the disagreement (pseudo) metric ρ on H is defined by ρ ( h , h ′ ) := Pr ( h ( X ) � = h ′ ( X )) . Hence, we have err ( h ) = ρ ( h , h ∗ ) . Remarks: We have R ( h ∗ , r ) ⊆ R ( B ( h ∗ , r )) , but the reverse may not be true.

Concepts Definition The disagreement coefficient θ ( h , H , D ) with respect to a hypothesis h ∈ H in the disagreement metric space ( H , ρ ) is Pr ( X ∈ R ( h , r )) θ ( h , H , D ) := sup . r r > 0

Concepts Definition The disagreement coefficient θ ( h , H , D ) with respect to a hypothesis h ∈ H in the disagreement metric space ( H , ρ ) is Pr ( X ∈ R ( h , r )) θ ( h , H , D ) := sup . r r > 0 Examples: ◮ X is uniform on [ 0 , 1 ] . H = { h = I X ≥ r , ∀ r > 0 } . Then θ ( h , H , D ) = 2 , ∀ h ∈ H . ◮ Replace H by H = { h = I X ∈ [ a , b ] , ∀ 0 < a < b < 1 } . Then θ ( h , H , D ) = max ( 4 , 1 / Pr ( h ( X ) = 1 )) , ∀ h ∈ H .

Examples Proposition Let P X be the uniform distribution on the unit sphere S d − 1 := { x ∈ R d : � x � 2 = 1 } ⊂ R d , and let H be the class of homogeneous linear threshold functions in R d , i.e, H = { h w : h w ( x ) = sign ( � w , x � ) , ∀ w ∈ S d − 1 } . There is an absolute constant C > 0 such that √ θ ( h , H , P X ) ≤ C · d .

Algorithm (CAL) ◮ Initialize: Z 0 := ∅ , V 0 := H . ◮ For t = 1 , 2 , · · · , n : ◮ Obtain unlabeled data point X t . ◮ If X t ∈ R ( V t − 1 ) : (a) Then: Query Y t , and set Z t := Z t − 1 � { ( X t , Y t ) } . (b) Else: Set ˜ Y t := h ( X t ) for any h ∈ V t − 1 , and set � { ( X t , ˜ Z t := Z t − 1 Y t ) } OR Set Z t := Z t − 1 . ◮ Set V t := { h ∈ H : h ( X i ) = Y i , ∀ ( X i , Y i ) ∈ Z t } . ◮ Return: any h ∈ V n .

Algorithm (Reduction-based CAL) ◮ Initialize: Z 0 := ∅ . ◮ For t = 1 , 2 , · · · , n : ◮ Obtain unlabeled data point X t . ◮ If there exists both: • h + ∈ H consistent with Z t − 1 � { ( X t , + 1 ) } • h − ∈ H consistent with Z t − 1 � { ( X t , − 1 ) } (a) Then: Query Y t , and set Z t := Z t − 1 � { ( X t , Y t ) } . (b) Else: only h y exists for some y ∈ {± 1 } : Set ˜ Y t := y and set � { ( X t , ˜ Z t := Z t − 1 Y t ) } ◮ Return: any h ∈ H consistent with Z n .

Algorithm (Reduction-based CAL) ◮ Initialize: Z 0 := ∅ . ◮ For t = 1 , 2 , · · · , n : ◮ Obtain unlabeled data point X t . ◮ If there exists both: • h + ∈ H consistent with Z t − 1 � { ( X t , + 1 ) } • h − ∈ H consistent with Z t − 1 � { ( X t , − 1 ) } (a) Then: Query Y t , and set Z t := Z t − 1 � { ( X t , Y t ) } . (b) Else: only h y exists for some y ∈ {± 1 } : Set ˜ Y t := y and set � { ( X t , ˜ Z t := Z t − 1 Y t ) } ◮ Return: any h ∈ H consistent with Z n . Remark: Reduction-based CAL is equivalent to CAL.

Label Complexity Analysis Theorem The expected number of labels queried by Reduction-based CAL after n iterations is at most � θ ( h ∗ , H , D ) d log 2 n � O , where d is the VC-dimension of class H . For any ǫ > 0 and δ > 0 , if we have � 1 ǫ ( d log 1 ǫ + log 1 � n = O δ ) , then with probability 1 − δ , the return of Reduction-based CAL ˆ h satisfies that err (ˆ h ) ≤ ǫ.

Proof Note that, with probability 1 − δ t , any h ∈ H consistent with Z t has error err ( h ) at most � 1 � d log t + log 1 �� O := r t , δ t t where δ t > 0 will be chosen later. (case when P n f n = 0 , Pf = 0). � 1 ǫ ( d log 1 ǫ + log 1 � This also implies that n = O δ )

Proof Note that, with probability 1 − δ t , any h ∈ H consistent with Z t has error err ( h ) at most � 1 � d log t + log 1 �� O := r t , δ t t where δ t > 0 will be chosen later. (case when P n f n = 0 , Pf = 0). � 1 ǫ ( d log 1 ǫ + log 1 � This also implies that n = O δ ) Let G t is the event that described above happens. Hence, condition on G t , we have { h ∈ H : h is consistent with Z t } ⊆ B ( h ∗ , r t ) .

Proof Note that, we query Y t + 1 if and only if � { ( X t + 1 , − h ∗ ( X t + 1 )) } , ∃ h ∈ H consistent with Z t (i.e., there is h disagree with h ∗ ) Hence, condition on G t , if we query Y t + 1 , then X t + 1 ∈ R ( h ∗ , r t ) . Therefore, we have � G t ) ≤ Pr ( X t + 1 ∈ R ( h ∗ , r t ) | G t ) . � Pr ( Y t + 1 is queried

Proof Let Q t = I { Y t is queried } . The expected total number of queries is n − 1 n � � E [ Q t ] ≤ 1 + Pr ( Q t + 1 = 1 ) t = 1 t = 1 n − 1 � � = 1 + Pr ( Q t + 1 = 1 � G t ) Pr ( G t ) t = 1 n − 1 � � not G t )( 1 − Pr ( G t )) � + Pr ( Q t + 1 = 1 t = 0 n − 1 � � ≤ 1 + Pr ( Q t + 1 = 1 � G t ) Pr ( G t ) + δ t t = 1 n − 1 � Pr ( X t + 1 ∈ R ( h ∗ , r t ) | G t ) Pr ( G t ) + δ t . ≤ 1 + t = 1

Proof By definition of the coefficient of disagreement, we have Pr ( X t + 1 ∈ R ( h ∗ , r t ) | G t ) Pr ( G t ) ≤ Pr ( X t + 1 ∈ R ( h ∗ , r t )) ≤ r t · θ ( h ∗ , H , D ) . Hence, we have n n − 1 � � r t · θ ( h ∗ , H , D ) + δ t E [ Q t ] ≤ 1 + t = 1 t = 1 n − 1 � θ ( h ∗ , H , D ) � � � d log t + log 1 � = O + δ t . t δ t t = 1

Proof By definition of the coefficient of disagreement, we have Pr ( X t + 1 ∈ R ( h ∗ , r t ) | G t ) Pr ( G t ) ≤ Pr ( X t + 1 ∈ R ( h ∗ , r t )) ≤ r t · θ ( h ∗ , H , D ) . Hence, we have n n − 1 � � r t · θ ( h ∗ , H , D ) + δ t E [ Q t ] ≤ 1 + t = 1 t = 1 n − 1 � θ ( h ∗ , H , D ) � � � d log t + log 1 � = O + δ t . t δ t t = 1 Choose δ t = 1 t , we have n � θ ( h ∗ , H , D ) d log 2 n � � ≤ E [ Q t ] O . t = 1

Selective Sampling (Realizable) Ji Xu October 2nd, 2017 Basic - PowerPoint PPT Presentation

Selective Sampling (Realizable) Ji Xu October 2nd, 2017 Basic Settings Model: D : a distribution over X Y where X is the input space and Y = { 1 } are the possible labels. ( X , Y ) X Y be a pair of random variables with

Selective Prediction Binary classifications Rong Zhou November 8, 2017 Table of contents 1.

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft

Texas Instruments & RFAB TI Information Selective Disclosure TI Information Selective

Cimzia Selective rebrand Concept A Cimzia Selective rebrand Logo Main / Colour Grayscale

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

Design-by-Contract for Reusable Components & Realizable Architectures Mert Ozkaya &

Approximating Optimal Bounds in Prompt-LTL Realizability in Doubly-exponential Time Joint work

Vitaliy Kurlin, http://kurlin.org Microsoft Research Cambridge and Durham University, UK The

Classification Finite Hypothesis Classes prof. dr Arno Siebes Algorithmic Data Analysis Group

Lecture 15 Economics UN3213 Intermediate Macroeconomics Professor Mart n Uribe Spring

A86045 Accoun,ng and Financial Repor,ng (2017/2018) Session 16 Review Session Paul G. Smith

Mapping Models to Code Chapter 10, Overview Object design is situated between system design and

On the physical realization of Seiberg dual phases in branes at singularities Mikel

Overview Realization of Models in Programming Languages: Achieving Non-Functional Properties

Selective Sampling (Realizable) Ji Xu October 2nd, 2017 Basic - PowerPoint PPT Presentation

Selective Sampling (Realizable) Ji Xu October 2nd, 2017 Basic Settings Model: D : a distribution over X Y where X is the input space and Y = { 1 } are the possible labels. ( X , Y ) X Y be a pair of random variables with

Selective Prediction Binary classifications Rong Zhou November 8, 2017 Table of contents 1.

Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in Selective Mixed Oxides in

Sampling Methods Oliver Schulte - CMPT 419/726 Bishop PRML Ch. 11 Sampling Rejection Sampling

Chapter 7. Sampling Chapter 7. Sampling methods? methods? Two types of sampling methods Two

Multiple importance sampling Slides for CS6630 lecture 6 sampling the BRDF sampling the

What is the strengths and weakness of these sampling methods? Sampling Strengths /

Sampling Overview R toy sampling Non-probability sampling Probability Methods (AKA random)

Sampling Sediment and Sampling Sediment and Sampling Sediment and Porewater Sampling Sediment

Sampling Methods CMSC 678 UMBC Outline Recap Monte Carlo methods Sampling Techniques Uniform

Selective sampling algorithms for cost-sensitive multiclass prediction Alekh Agarwal Microsoft

Texas Instruments &amp; RFAB TI Information Selective Disclosure TI Information Selective

Cimzia Selective rebrand Concept A Cimzia Selective rebrand Logo Main / Colour Grayscale

Newfound Water Quality Sampling: In Lake Sampling 8 Historic Sampling locations

Sampling Distributions Sampling Distribution of the Mean &amp; Hypothesis Testing Sampling

Overview of Sampling Topics (Shannon) sampling theorem Impulse-train sampling

Design-by-Contract for Reusable Components &amp; Realizable Architectures Mert Ozkaya &amp;

Approximating Optimal Bounds in Prompt-LTL Realizability in Doubly-exponential Time Joint work

Vitaliy Kurlin, http://kurlin.org Microsoft Research Cambridge and Durham University, UK The

Classification Finite Hypothesis Classes prof. dr Arno Siebes Algorithmic Data Analysis Group

Lecture 15 Economics UN3213 Intermediate Macroeconomics Professor Mart n Uribe Spring

A86045 Accoun,ng and Financial Repor,ng (2017/2018) Session 16 Review Session Paul G. Smith

Mapping Models to Code Chapter 10, Overview Object design is situated between system design and

On the physical realization of Seiberg dual phases in branes at singularities Mikel

Overview Realization of Models in Programming Languages: Achieving Non-Functional Properties

Texas Instruments & RFAB TI Information Selective Disclosure TI Information Selective

Sampling Distributions Sampling Distribution of the Mean & Hypothesis Testing Sampling

Design-by-Contract for Reusable Components & Realizable Architectures Mert Ozkaya &