01/03/2012 CS485/685 Lecture 16: March 1, 2012 Agnostic Learning [BDSS] Chapters 2, 3 CS485/685 (c) 2012 P. Poupart 1 Agnostic PAC Learning • Definition: A learner that doesn’t assume that � contains an error free hypothesis and that simply finds the hypothesis with minimum training error is often called an agnostic learner CS485/685 (c) 2012 P. Poupart 2 1
01/03/2012 Agnostic PAC Learnability • A hypothesis class � is agnostic PAC learnable if for � � any � � 0 , � ∈ �0,1� , there exists an N � � � , � and a learning algorithm such that for any � and N i.i.d. samples it returns � ∈ � such that with probability 1 � � � � ∈� � � � � � � � � � � min CS485/685 (c) 2012 P. Poupart 3 � ‐ representative • Definition: A training set � is called � ‐ representative if ∀� ∈ �, |� � � � � � � | � � � • Lemma: Assume that a training set � is � ‐ representative. Then any output � � of an empirical risk minimizing algorithm satisfies � � � � � min �∈� � � � � � � • Proof: � � � � � � � � � � � � � � � � � � � � � � � � CS485/685 (c) 2012 P. Poupart 4 2
01/03/2012 Uniform Convergence • Definition: A hypothesis class � has the uniform convergence property if there exists a function �: � � � 0,1 → � such that for every probability distribution � , if � is a sample of � � ���, �� examples drawn i.i.d. according to � , then with probability at least 1 � � , � is � ‐ representative. CS485/685 (c) 2012 P. Poupart 5 Uniform Convergence • Corollary 2: If a class � has the uniform convergence property with a function � , then the class is agnostically PAC learnable with sample complexity � N � � � , � . Furthermore, an empirical risk minimization algorithm is a successful agnostic PAC learner for � . CS485/685 (c) 2012 P. Poupart 6 3
01/03/2012 Uniform Convergence • To show that uniform convergence holds, show that: |� � � � � � � | is likely to be small for any 1. fixed hypothesis (chosen before seeing the data) 2. Think of � � ��� as a random variable with mean � � ��� . Then the distribution of � � ��� is concentrated around its mean for all � ∈ � . CS485/685 (c) 2012 P. Poupart 7 Measure Concentration • Let � � be random variables with mean � . Then as � � � ∑ � → ∞ , � � → � ��� • Use measure concentration inequalities to quantify � � � ∑ � � from � for finite � the deviation of ��� CS485/685 (c) 2012 P. Poupart 8 4
01/03/2012 Markov’s Inequality • Markov’s inequality: � � ∀� � 0 Pr � � � � � � • Derivation: � � � � Pr � � � �� ��� � � � Pr � � � �� ��� � � � Pr � � � �� ��� � � Pr�� � �� CS485/685 (c) 2012 P. Poupart 9 Chebyshev’s Inequality • Bound deviation from the mean on both sides: � � � � � Pr � � � � � � � Pr � � � � � � � ��� � � � � � ��� � � � • Since ���� � � ��� � � � � ∑ � � � for i.i.d. � � ’s, ��� � � � � � ��� � � � � ∑ then Pr � � � � ��� �� � CS485/685 (c) 2012 P. Poupart 10 5
01/03/2012 Chebyshev’s Inequality • Lemma: Let � � , … , � � be i.i.d., � � � � � ∀� � and ��� � � � 1 ∀� � , then for any � ∈ �0,1� , with probability 1 � � , we have � � � ∑ � � � � � � �� � � • Proof: Let � � Pr � ∑ � � � � � � ��� ��� � � � Then � � � �� � �� � � � � � � ∑ Hence � � � � � � � �� and ��� �� CS485/685 (c) 2012 P. Poupart 11 Hoeffding’s Inequality • Tighter bound than Chebyshev’s inequality • Let � � , … , � � be i.i.d. variables with mean � • Assume that Pr � � � � � � � 1 � ���� � � � ∑ • Then Pr � � � � � � � 2� ��� � ��� � � � 2� ���� � • Hence Pr � � � � � � � CS485/685 (c) 2012 P. Poupart 12 6
01/03/2012 Agnostic PAC Learnability • Theorem: Let � be finite, � ∈ �0,1� , � � 0 and � ��� � � � � � , then with probability at least 1 � � , � � we have � � � � � min �∈� � � � � � CS485/685 (c) 2012 P. Poupart 13 Agnostic PAC Learnability • Proof : From Corollary 2, it suffices to show that � Pr ∃� ∈ �, � � � � � � � � � � � Using the union bound: � Pr ∃� ∈ �, � � � � � � � � � � � ∑ Pr � � � � � � � � �∈� � � 2 � � � ��� � � ��� � � � � since � � � � � CS485/685 (c) 2012 P. Poupart 14 7
Recommend
More recommend