Selective Sampling (Realizable) Ji Xu October 2nd, 2017
Basic Settings Model: ◮ D : a distribution over X × Y where X is the input space and Y = {± 1 } are the possible labels. ◮ ( X , Y ) ∈ X × Y be a pair of random variables with joint distribution D . ◮ H be a set of hypotheses mapping from X to Y . The error of a hypothesis h : X → Y is err ( h ) := Pr ( h ( X ) � = Y ) . ◮ Let h ∗ := argmin { err ( h ) : h ∈ H} be a hypothesis with minimum error in H .
Basic Settings Goal: with high probability, we return ˆ h ∈ H such that err (ˆ h ) ≤ err ( h ∗ ) + ǫ. In realizable case, we have err ( h ∗ ) = 0, hence, we want err (ˆ h ) ≤ ǫ.
Basic Settings Passive VS Active: ◮ Passive setting: ◮ At time t , observe X t and choose h t ∈ H . ◮ Make prediction h t ( X t ) and then observe feedback Y t . ◮ Minimize the total number of mistakes of h t ( X t ) � = Y t .
Basic Settings Passive VS Active: ◮ Active setting: ◮ At time t , observe X t . ◮ We choose whether we need the feedback Y t . ◮ Minimize the number of mistakes of ˆ h and the total number of queries of the correct label Y t .
Basic Settings Passive VS Active: ◮ Active setting: ◮ At time t , observe X t . ◮ We choose whether we need the feedback Y t . ◮ Minimize the number of mistakes of ˆ h and the total number of queries of the correct label Y t . Hence, intuitively, ( X t , Y t ) does not provide any information if h ( X t ) are the same for all the potential hypotheses at time t , and thus we should not query for such X t .
Concepts Definition For a set of hypotheses V , the region of disagreement R ( V ) is R ( V ) := { x ∈ X : ∃ h , h ′ ∈ V such that h ( x ) � = h ′ ( x ) } . Definition For a given set of hypotheses H and sample set Z T = { ( X t , Y t ) , t = 1 · · · T } , the uncertainty region U ( H , Z T ) is { x ∈ X : ∃ h , h ′ ∈ H such that h ( x ) � = h ′ ( x ) U ( H , Z T ) := and h ( X t ) = h ′ ( X t ) = Y t , ∀ t ∈ [ T ] } .
Remarks ◮ Let C = { h ∈ H : h ( X t ) = Y t , ∀ t ∈ [ T ] } . Then we have U ( H , Z T ) = R ( C ) . ◮ Ideally, the area of the uncertainty region will be monotonically non-increasing by more training samples. ◮ If we can control the sampling procedure over X t , it is better to only sample on U ( H , Z t ) . (Selective Sampling or Approximate Selective Sampling) ◮ Correctness of all labels Y t for X t not in the query. Need to query X t + 1 if X t + 1 ∈ U ( H , Z t ) . ◮ The complexity of finding a good set ˆ H such that h ∗ ∈ ˆ H ⊆ H can be intuitively measured by the ratio between H err ( h ) and Pr ( R ( ˆ sup h ∈ ˆ H )) .
Concepts Definition We redefine the region of disagreement by R ( h , r ) of radius r around a hypothesis h ∈ H in the disagreement metric space ( H , ρ ) is R ( h , r ) := { x ∈ X : ∃ h ′ ∈ B ( h , r ) such that h ( x ) � = h ′ ( x ) } . where the disagreement (pseudo) metric ρ on H is defined by ρ ( h , h ′ ) := Pr ( h ( X ) � = h ′ ( X )) . Hence, we have err ( h ) = ρ ( h , h ∗ ) .
Concepts Definition We redefine the region of disagreement by R ( h , r ) of radius r around a hypothesis h ∈ H in the disagreement metric space ( H , ρ ) is R ( h , r ) := { x ∈ X : ∃ h ′ ∈ B ( h , r ) such that h ( x ) � = h ′ ( x ) } . where the disagreement (pseudo) metric ρ on H is defined by ρ ( h , h ′ ) := Pr ( h ( X ) � = h ′ ( X )) . Hence, we have err ( h ) = ρ ( h , h ∗ ) . Remarks: We have R ( h ∗ , r ) ⊆ R ( B ( h ∗ , r )) , but the reverse may not be true.
Concepts Definition The disagreement coefficient θ ( h , H , D ) with respect to a hypothesis h ∈ H in the disagreement metric space ( H , ρ ) is Pr ( X ∈ R ( h , r )) θ ( h , H , D ) := sup . r r > 0
Concepts Definition The disagreement coefficient θ ( h , H , D ) with respect to a hypothesis h ∈ H in the disagreement metric space ( H , ρ ) is Pr ( X ∈ R ( h , r )) θ ( h , H , D ) := sup . r r > 0 Examples: ◮ X is uniform on [ 0 , 1 ] . H = { h = I X ≥ r , ∀ r > 0 } . Then θ ( h , H , D ) = 2 , ∀ h ∈ H . ◮ Replace H by H = { h = I X ∈ [ a , b ] , ∀ 0 < a < b < 1 } . Then θ ( h , H , D ) = max ( 4 , 1 / Pr ( h ( X ) = 1 )) , ∀ h ∈ H .
Examples Proposition Let P X be the uniform distribution on the unit sphere S d − 1 := { x ∈ R d : � x � 2 = 1 } ⊂ R d , and let H be the class of homogeneous linear threshold functions in R d , i.e, H = { h w : h w ( x ) = sign ( � w , x � ) , ∀ w ∈ S d − 1 } . There is an absolute constant C > 0 such that √ θ ( h , H , P X ) ≤ C · d .
Algorithm (CAL) ◮ Initialize: Z 0 := ∅ , V 0 := H . ◮ For t = 1 , 2 , · · · , n : ◮ Obtain unlabeled data point X t . ◮ If X t ∈ R ( V t − 1 ) : (a) Then: Query Y t , and set Z t := Z t − 1 � { ( X t , Y t ) } . (b) Else: Set ˜ Y t := h ( X t ) for any h ∈ V t − 1 , and set � { ( X t , ˜ Z t := Z t − 1 Y t ) } OR Set Z t := Z t − 1 . ◮ Set V t := { h ∈ H : h ( X i ) = Y i , ∀ ( X i , Y i ) ∈ Z t } . ◮ Return: any h ∈ V n .
Algorithm (Reduction-based CAL) ◮ Initialize: Z 0 := ∅ . ◮ For t = 1 , 2 , · · · , n : ◮ Obtain unlabeled data point X t . ◮ If there exists both: • h + ∈ H consistent with Z t − 1 � { ( X t , + 1 ) } • h − ∈ H consistent with Z t − 1 � { ( X t , − 1 ) } (a) Then: Query Y t , and set Z t := Z t − 1 � { ( X t , Y t ) } . (b) Else: only h y exists for some y ∈ {± 1 } : Set ˜ Y t := y and set � { ( X t , ˜ Z t := Z t − 1 Y t ) } ◮ Return: any h ∈ H consistent with Z n .
Algorithm (Reduction-based CAL) ◮ Initialize: Z 0 := ∅ . ◮ For t = 1 , 2 , · · · , n : ◮ Obtain unlabeled data point X t . ◮ If there exists both: • h + ∈ H consistent with Z t − 1 � { ( X t , + 1 ) } • h − ∈ H consistent with Z t − 1 � { ( X t , − 1 ) } (a) Then: Query Y t , and set Z t := Z t − 1 � { ( X t , Y t ) } . (b) Else: only h y exists for some y ∈ {± 1 } : Set ˜ Y t := y and set � { ( X t , ˜ Z t := Z t − 1 Y t ) } ◮ Return: any h ∈ H consistent with Z n . Remark: Reduction-based CAL is equivalent to CAL.
Label Complexity Analysis Theorem The expected number of labels queried by Reduction-based CAL after n iterations is at most � θ ( h ∗ , H , D ) d log 2 n � O , where d is the VC-dimension of class H . For any ǫ > 0 and δ > 0 , if we have � 1 ǫ ( d log 1 ǫ + log 1 � n = O δ ) , then with probability 1 − δ , the return of Reduction-based CAL ˆ h satisfies that err (ˆ h ) ≤ ǫ.
Proof Note that, with probability 1 − δ t , any h ∈ H consistent with Z t has error err ( h ) at most � 1 � d log t + log 1 �� O := r t , δ t t where δ t > 0 will be chosen later. (case when P n f n = 0 , Pf = 0). � 1 ǫ ( d log 1 ǫ + log 1 � This also implies that n = O δ )
Proof Note that, with probability 1 − δ t , any h ∈ H consistent with Z t has error err ( h ) at most � 1 � d log t + log 1 �� O := r t , δ t t where δ t > 0 will be chosen later. (case when P n f n = 0 , Pf = 0). � 1 ǫ ( d log 1 ǫ + log 1 � This also implies that n = O δ ) Let G t is the event that described above happens. Hence, condition on G t , we have { h ∈ H : h is consistent with Z t } ⊆ B ( h ∗ , r t ) .
Proof Note that, we query Y t + 1 if and only if � { ( X t + 1 , − h ∗ ( X t + 1 )) } , ∃ h ∈ H consistent with Z t (i.e., there is h disagree with h ∗ ) Hence, condition on G t , if we query Y t + 1 , then X t + 1 ∈ R ( h ∗ , r t ) . Therefore, we have � G t ) ≤ Pr ( X t + 1 ∈ R ( h ∗ , r t ) | G t ) . � Pr ( Y t + 1 is queried
Proof Let Q t = I { Y t is queried } . The expected total number of queries is n − 1 n � � E [ Q t ] ≤ 1 + Pr ( Q t + 1 = 1 ) t = 1 t = 1 n − 1 � � = 1 + Pr ( Q t + 1 = 1 � G t ) Pr ( G t ) t = 1 n − 1 � � not G t )( 1 − Pr ( G t )) � + Pr ( Q t + 1 = 1 t = 0 n − 1 � � ≤ 1 + Pr ( Q t + 1 = 1 � G t ) Pr ( G t ) + δ t t = 1 n − 1 � Pr ( X t + 1 ∈ R ( h ∗ , r t ) | G t ) Pr ( G t ) + δ t . ≤ 1 + t = 1
Proof By definition of the coefficient of disagreement, we have Pr ( X t + 1 ∈ R ( h ∗ , r t ) | G t ) Pr ( G t ) ≤ Pr ( X t + 1 ∈ R ( h ∗ , r t )) ≤ r t · θ ( h ∗ , H , D ) . Hence, we have n n − 1 � � r t · θ ( h ∗ , H , D ) + δ t E [ Q t ] ≤ 1 + t = 1 t = 1 n − 1 � θ ( h ∗ , H , D ) � � � d log t + log 1 � = O + δ t . t δ t t = 1
Proof By definition of the coefficient of disagreement, we have Pr ( X t + 1 ∈ R ( h ∗ , r t ) | G t ) Pr ( G t ) ≤ Pr ( X t + 1 ∈ R ( h ∗ , r t )) ≤ r t · θ ( h ∗ , H , D ) . Hence, we have n n − 1 � � r t · θ ( h ∗ , H , D ) + δ t E [ Q t ] ≤ 1 + t = 1 t = 1 n − 1 � θ ( h ∗ , H , D ) � � � d log t + log 1 � = O + δ t . t δ t t = 1 Choose δ t = 1 t , we have n � θ ( h ∗ , H , D ) d log 2 n � � ≤ E [ Q t ] O . t = 1
Recommend
More recommend