1 (1) learning algorithm such that: ability (PAC) as follows. In learning theory, these ideas are formalized in terms of probably approximately correct learn- (6) so that (5) (4) the minimizer of the true risk, By assumption, we have (3) Proof. Note that In that case, the following holds. (2) something much stronger than what we needed. We actually proved that the sample complexity of how we derived the sample complexity with Hoeffding’s inequality, note that we actually proved ensures that ECE 6254 - Spring 2020 - Lecture 4 v1.3 - revised April 7, 2020 PAC Learnability and Bayes Classifier Matthieu R. Bloch 1 PAC learnability Tie last question to answer is how R ( h ∗ ) , the true risk of the hypothesis we pick with empirical risk minimization, compares to R ( h ♯ ) , the true risk of the best hypothesis in the class. Upon inspection � � � � � � � � � ⩽ ϵ ⩾ 1 − δ. ∀ h j ∈ H R N ( h j ) − R ( h j ) P { ( x i ,y i ) } � � � � � � � ⩽ 2 ϵ . � � � R ( h ∗ ) − R ( h ♯ ) � ⩽ ϵ then Lemma 1.1. If ∀ h j ∈ H we have R N ( h j ) − R ( h j ) � � � � � � � R ( h ∗ ) − R ( h ♯ ) � = � R ( h ∗ ) − � R N ( h ∗ ) + � R N ( h ∗ ) − R ( h ♯ ) � � � � � � � � � � R ( h ∗ ) − � � � R N ( h ∗ ) R N ( h ∗ ) − R ( h ♯ ) ⩽ � + � . � � � � � ⩽ ϵ since h ∗ ∈ H . In addition, by definition of h ♯ as � R ( h ∗ ) − � R N ( h ∗ ) R ( h ♯ ) ⩽ R ( h ∗ ) ⩽ � R N ( h ∗ ) + ϵ. By definition of h ∗ as the minimizer of the empirical risk, we also have R N ( h ∗ ) ⩽ � � R N ( h ♯ ) ⩽ R ( h ♯ ) + ϵ. � � � � � � R N ( h ∗ ) − R ( h ♯ ) � ⩽ ϵ. ■ Definition 1.2. A hypothesis set H is PAC learnable if there exists a function N H :]0; 1[ 2 → N and a • for very ϵ, δ ∈ ]0; 1[ , • for every P x , P y | x ,
2 from the best true risk (approximately correct). Note that the definition of PAC learnability is quite One should also realize that it may not be possible to ever achieve zero risk learning. In fact, our empirical risk is very different from their true risk. Tiis behavior is representative of most if not all In our simple learning model, the last phenomenon happens because as we increase the size of need many sample if the class size grows large. Note, however, that this does not address the question of ensuring that the risk of the best hy- Remark 1.4. You might note that the sample complexity seems off by a factor of two compared to what forting that we can make such a fundamental statement about learning. algorithm and with sample complexity effectively already proved the following result. Perhaps surprisingly, if you trace back everything we proved so far (check for yourself!), we have is here slightly different from what we used earlier. Sample complexity is defined with respect to should assume is that they exist. ECE 6254 - Spring 2020 - Lecture 4 v1.3 - revised April 7, 2020 • when running the algorithm on at least N H ( ϵ, δ ) i.i.d. examples, the algorithm returns a hypothesis h ∈ H such that �� � � � ⩽ ϵ � R ( h ) − R ( h ♯ ) ⩾ 1 − δ P x y Tie function N H ( ϵ, δ ) is the sample complexity. Note that the definition of sample complexity the true risk of h ♯ , while we previously only worried about the true risk of h ∗ . Tie name probably �� � � � ⩽ ϵ � R ( h ) − R ( h ♯ ) ⩾ 1 − δ . In words, it says approximately correct comes from the bound P x y that with probability at least 1 − δ (probably), the true risk incurred by h is no more than ϵ away stringent because it requires the bound to hold irrespective of the what P x and P y | x really are. All we Proposition 1.3. A finite hypothesis set H is PAC learnable with the Empirical Risk Minimization N H ( ϵ, δ ) = ⌈ 2 ln (2 |H| / δ ) ⌉ ϵ 2 Although the caveats regarding the fact that we require |H| < ∞ still apply, it should be com- we derived earlier. Tiis is because the sample complexity as per Definition 1.2 requires the true risks of h ∗ and h ♯ to be close, instead of requiring the empirical risk of h ∗ to be close to the true risk of h ∗ . Proving the result of Proposition 1.3 requires you to use Lemma 1.1. pothesis h ∗ = argmin h ∈H � R N ( h ) we find is actually small. To have a small risk, we must ensure that the hypothesis class H is somehow “rich enough” to have a good chance of well approximating the unknown function f . With our current analysis, the size |H| of the class is the proxy for the richness of the class, and although the dependence of the sample complexity on |H| is only logarithmic, we In practice, the size of the dataset N is fixed, and three phenomena occur as we increase the richness of the class H . Recall that h ∗ ≜ argmin h ∈H � R N ( h ) and h ♯ ≜ argmin h ∈H R ( h ) . 1. Tie empirical risk of h ∗ decreases; 2. Tie true risk of h ♯ decreases; 3. Tie true risk of h ∗ decreases before it increases again (the curve has a U-shape). the class |H| for a fixed dataset size N , it becomes increasingly likely that there are hypotheses whose learning problems, and is summarized in Fig. 1. general learning model accounts for the presence of noise through P y | x . Tiis naturally prompts the question of what is the smallest risk R ( h ♯ ) that one can achieves and how to achieve it.
3 For ease of notation, let us revisit our learning model with a slight change in notation to clearly following: serve as the ultimate benchmark of performance. For notational convenience, we introduce the Risk To estimate the smallest risk that we can ever hope to achieve, we assume for now that we know (7) indicate the random variables. Our supervised learning problem consists of: ECE 6254 - Spring 2020 - Lecture 4 v1.3 - revised April 7, 2020 R ( h ♯ ) R ( h ∗ ) � R N ( h ∗ ) |H| Figure 1: Evolution of risk when richness of H increases 2 Bayes classifier 1. A dataset D ≜ { ( X 1 , Y 1 ) , · · · , ( X N , Y N ) } • { X i } N i =1 drawn i.i.d. from an unknown probability distribution P X on X ; • { Y i } N i =1 with Y = { 0 , 1 , · · · , K } . 2. An a priori unknown labeling probability P Y | X 3. A binary loss function ℓ : Y × Y → R + : ( y 1 , y 2 ) �→ 1 { y 1 � = y 2 } . Since our goal is to characterize the minimum true risk, we need to specify a class of hypotheses H at this point. Note that the (true) risk of a classifier h is R ( h ) ≜ E XY ( 1 { h ( X ) � = Y } ) = P XY ( h ( X ) � = Y ) P X and P Y | X . Tiis is not a realistic assumption since the whole point of learning is to figure out what P Y | X is and P X might never be learned at all; however, the risk of any realistic classifier can certainly be no less than the risk of the best classifier that knows P X and P Y | X , which can therefore • the a priori class probabilities are denoted π k ≜ P Y ( k ) . • the a posteriori class probabilities are denoted η k ( x ) ≜ P Y | X ( k | x ) for all x ∈ X . Lemma 2.1. Tie classifier h B ( x ) ≜ argmax k ∈ [0; K − 1] η k ( x ) is optimal, i.e., for any classifier h , we have R ( h B ) ⩽ R ( h ) . In addition � � R ( h B ) = E X 1 − max η k ( X ) k
4 (8) when defining the argmax. Note that we are implicitly assuming that ties have been broken with some arbitrary but fixed choice max In the last step, we have used that (13) max (12) (11) (10) therefore Tie case of equality can be broken arbitrarily. Tie classifier leading to these decision regions is (9) To minimize the risk, we should maximize (8). Tie expression is maximum when the regions are ECE 6254 - Spring 2020 - Lecture 4 v1.3 - revised April 7, 2020 Proof. For a classifier h and for each 0 ⩽ k ⩽ K − 1 , let us define the corresponding decision region Γ k ( h ) ≜ { x : h ( x ) = k } . Tien note that � K − 1 K − 1 � � 1 − R ( h ) = P ( h ( X ) = Y ) = π k P ( h ( X ) = k | Y = k ) = π k p X | Y ( x | k ) dx. Γ k ( h ) k =0 k =0 such that π k p X | Y ( x | k ) takes the maximum possible value (over the K possibilities) in the region Γ k ( h ) . Said differently, the region Γ k ( h ) must be defined as Γ k ( h ) = { x ∈ X : ∀ ℓ ∈ � 0 , K − 1 � π ℓ p X | Y ( x | ℓ ) ⩽ π k p X | Y ( x | k ) } . h B ( x ) = argmax π k p X | Y ( x | k ) = argmax η k ( x ) p X ( x ) = argmax η k ( x ) . k k k Tie risk associated with h B is then � � �� � � �� h B ( X ) � = Y h B ( X ) = Y R B = E XY 1 = 1 − E XY 1 � � �� = 1 − E XY 1 Y = argmax η k ( X ) k � � = 1 − E X η k ( X ) . k �� �� � � �� � Y = argmax η k ( X ) = E X P Y | X ( y | X ) 1 y = argmax η k ( X ) E XY 1 k k y � � = E X P Y | X ( argmax η k ( X ) | X ) k � � = E X P Y | X ( k | X ) . k ■ Tie classifier h B is called the Bayes classifier and R B ≜ R ( h B ) is called the Bayes risk .
Recommend
More recommend