1 Note that we do not specify a specific algorithm yet as we will be focusing on a more abstract learning (4) In addition observe that Observe that the empirical risk in (2) is a random variable since it is a function of the data set, which Tie first question we raised was the possibility of generalizing a hypothesis. Mathe- Generalizing (3) We will discuss this in more details later, but it is very natural for learning algorithms to attempt and the empirical risk becomes (1) x operation. (2) analysis. We will talk about the more general setting later in the semester. We consider the supervised learning model that consists of the following. Now that we have introduced a complete model for supervised learning, our objective is to show that some of the questions raised earlier have a chance of being answered. We proceed by analyzing a simplified model, which still captures the essence of the problem but is more easily amenable to ECE 6254 - Spring 2020 - Lecture 3 v1.2 - revised March 21, 2020 Learning may work Matthieu R. Bloch 1. A dataset D ≜ { ( x 1 , y 1 ) , · · · , ( x N , y N ) } • { x i } N i =1 drawn i.i.d. from an unknown probability distribution P x on X ; • { y i } N i =1 with Y = { 0 , 1 } (binary classification). 2. An unknown f : X → Y , no noise. 3. A finite set of hypotheses H , |H| = M < ∞ , denoted H ≜ { h i } M i =1 . 4. A binary loss function ℓ : Y × Y → R + : ( y 1 , y 2 ) �→ 1 { y 1 � = y 2 } . For this model and any hypothesis h ∈ H , the true risk simplifies as � � R ( h ) ≜ E x y ( 1 { h ( x ) � = y } ) = p x ,y ( x , y ) 1 { h ( x ) � = y } = P x y ( h ( x ) � = y ) . y � N R N ( h ) = 1 � 1 { h ( x i ) � = y i } . N i =1 to minimize the empirical risk and look for a hypothesis h ∗ that ensures a minimal risk h ∗ = argmin � R N ( h ) . h ∈H 1 Sample complexity matically, for a specific hypothesis h j ∈ H , this means assessing how � R N ( h j ) compares to R ( h j ) . is a random variable. More specifically, since every x i is generated independent and identically dis- tributed (i.i.d.), the empirical risk is actually the sample average of N i.i.d. variables 1 { h ( x i ) � = y } . � � � N � N = 1 E ( 1 { h ( x i ) � = y i } ) = 1 � P x ,y ( h ( x ) � = y i ) = R ( h j ) E R N ( h j ) N N i =1 i =1
2 (8) Tiat was a clean and fast proof, but you may be more comfortable going back to the definition Tie start of most if not all concentration inequalities is Markov’s lemma. fundamental ideas behind these bounds. applied probability and are known as concentration inequalities . We will now review some of the (7) is the probability that sample average of i.i.d. with the intuition that it is unlikely that a random variable takes a value very far away from its mean. (5) In spite of its relative simplicity, Markov’s inequality is a powerful tool because it can be “boosted.” (9) one. Applying Markov’s inequality we obtain (10) which is potentially a better bound than (5). Of course, the difficulty is in choosing the appropriate to Chebyshev’s inequality . (11) (6) ECE 6254 - Spring 2020 - Lecture 3 v1.2 - revised March 21, 2020 �� � � � � � � R N ( h j ) − R ( h j ) � > ϵ Tierefore, the quantity P random variables differ from their mean by more than ϵ . Such bounds are extremely common in Lemma 1.1. Let X be a non-negative real-valued random variable. Tien for all t > 0 P ( X ⩾ t ) ⩽ E ( X ) . t Proof. For t > 0 , let 1 { X ⩾ t } be the indicator function of the event { X ⩾ t } . Tien, E [ X ] ⩾ E [ X 1 { X ⩾ t } ] ⩾ t P [ X ⩾ t ] , where the first inequality follows because the indicator function is { 0 , 1 } -valued and X is non- negative; the second because X ⩾ t whenever 1 { X ⩾ t } = 1 and 0 else. of E ( X ) to prove the result. Note that � ∞ � t � ∞ � ∞ ( a ) ⩾ t E ( X ) = xp X ( x ) dx = xp X ( x ) dx + xp X ( x ) dx p X ( x ) dx 0 0 t t � �� � ⩾ 0 = t P ( X ⩾ t ) where ( a ) follows from the fact that x ⩾ t in the second integral. Note that the non-negative nature of X is crucial to lower bound the first integral. ■ By choosing t = ϵ E ( X ) for ϵ > 0 in (5), we obtain P ( X ⩾ ϵ E ( X )) ⩽ 1 ϵ , which is consistent For X ∈ X ⊂ R , consider ϕ : X → R + non-decreasing on X such that E ( | ϕ ( X ) | ) < ∞ . Tien, P [ X ⩾ t ] = E [ 1 { X ⩾ t } ] = E [ 1 { X ⩾ t } 1 { ϕ ( X ) ⩾ ϕ ( t ) } ] ⩽ P [ ϕ ( X ) ⩾ ϕ ( t )] , where we have used the definition of ϕ and the fact that an indicator function is upper bounded by P [ X ⩾ t ] ⩽ E [ ϕ ( X )] , ϕ ( t ) function ϕ to make the result meaningful. Tie most well-known application of this concept leads Lemma 1.2 (Chebyshev’s inequality) . Let X ∈ R . Tien, P [ | X − E ( X ) | ⩾ t ] ⩽ Var ( X ) . t 2
3 (12) (13) Tierefore, (14) (15) As an application of Chebyshev’s inequality, we derive the weak law of large numbers. Tie weak law of large numbers is essentially stating that Let us now go back to our learning problem. Applying (15), we know that Proof. First observe that (16) rather arbitrary way. that we obtain (17) inequality we obtain want the empirical risk to be, the more samples we need. and ECE 6254 - Spring 2020 - Lecture 3 v1.2 - revised March 21, 2020 Proof. Define Y ≜ | X − E ( X ) | and ϕ : R + → R + : t �→ t 2 . Tien, by the boosted Markov’s P [ | X − E ( X ) | ⩾ t ] = P [ Y ⩾ t ] ⩽ E [ Y 2 ] = Var ( X ) . t 2 t 2 ■ Lemma 1.3 (Weak law of large numbers) . Let X i ∼ p X i be independent with E [ | X i | ] < ∞ and � N Var ( X i ) < σ 2 for some σ 2 ∈ R + . Define Z = 1 i =1 X i for N ∈ N ∗ . Tien Z converges in N � N probability to 1 i =1 E ( X i ) . N N N � � E [ Z ] = 1 1 E [ X i ] Var ( Z ) = Var ( X i ) . N 2 N i =1 i =1 �� � � � � 2 � � � � N N N N � � � � 1 X i − 1 1 X i − 1 � � � � ⩾ ϵ 2 � ⩾ ϵ E [ X i ] = P E [ X i ] P � � � � � N N � N N � i =1 i =1 i =1 i =1 � N < σ 2 Var ( X i ) ⩽ Nϵ 2 . N 2 ϵ 2 i =1 ■ � N 1 i =1 X i concentrates around its N average. Note, however, that the convergence we proved in (15) is rather slow, on the order of 1/ N . �� � � ⩽ Var ( 1 { h j ( x 1 ) � = y 1 } ) 1 � � � � � ⩾ ϵ ⩽ ∀ ϵ > 0 R N ( h j ) − R ( h j ) P { ( x i ,y i ) } Nϵ 2 , Nϵ 2 where the last inequality comes from the observation that Var ( 1 { h j ( x 1 ) � = y } ) ⩽ 1 since the in- dicator function is a { 0 , 1 } -valued function. Notice that the bound that we obtain is universal in that it does not depend on P x anymore. Tiis is particularly pleasing because we introduced P x in a We can now compute the sample complexity for generalizing h j , defined as the number of samples � � � � � � � ⩽ ϵ with probability at least 1 − δ . From (16), note N ϵ,δ required to achieve R N ( h j ) − R ( h j ) 1 N ϵ,δ ⩾ δϵ 2 . Tie sample complexity behavior with δ and ϵ is consistent with our intuition, the more precise we
Recommend
More recommend