Foundations of Machine Learning Learning with Finite Hypothesis Sets
Motivation Some computational learning questions • What can be learned efficiently? • What is inherently hard to learn? • A general model of learning? Complexity • Computational complexity: time and space. • Sample complexity: amount of training data needed to learn successfully. • Mistake bounds: number of mistakes before learning successfully. page Foundations of Machine Learning 2
This lecture PAC Model Sample complexity, finite H , consistent case Sample complexity, finite H , inconsistent case page Foundations of Machine Learning 3
Definitions and Notation : set of all possible instances or examples, e.g., X the set of all men and women characterized by their height and weight. : the target concept to learn; can be c : X → { 0 , 1 } identified with its support . { x ∈ X : c ( x )=1 } : concept class, a set of target concepts . C c : target distribution, a fixed probability D distribution over . Training and test examples are X drawn according to . D page 4
Definitions and Notation : training sample. S : set of concept hypotheses, e.g., the set of all H linear classifiers. The learning algorithm receives sample and S selects a hypothesis from approximating . h S H c page 5
Errors True error or generalization error of with h respect to the target concept and distribution : D c R ( h ) = Pr x � D [ h ( x ) � = c ( x )] = x � D [1 h ( x ) � = c ( x ) ] . E Empirical error: average error of on the training h sample drawn according to distribution , D S m � [1 h ( x ) � = c ( x ) ] = 1 � R S ( h ) = Pr [ h ( x ) � = c ( x )] = E 1 h ( x i ) � = c ( x i ) . m x � b x � b D D i =1 � � Note: � R ( h ) = E R S ( h ) . S ∼ D m page 6
PAC Model (Valiant, 1984) PAC learning: Probably Approximately Correct learning. Definition: concept class is PAC-learnable if there C exists a learning algorithm such that: L • for all and all distributions , c ∈ C, ⇥ > 0 , � > 0 , D S ∼ D m [ R ( h S ) ≤ � ] ≥ 1 − � , Pr • for samples of size for a m = poly (1 / ⇥ , 1 / � ) S fixed polynomial. page 7
Remarks Concept class is known to the algorithm. C Distribution-free model: no assumption on . D Both training and test examples drawn . ∼ D Probably: confidence . 1 − δ Approximately correct: accuracy . 1 − � Efficient PAC-learning: runs in time . poly (1 / � , 1 / � ) L What about the cost of the representation of ? c ∈ C page 8
PAC Model - New Definition Computational representation: • cost for in . O ( n ) x ∈ X • cost for in . O ( size ( c )) c ∈ C Extension: running time. O ( poly (1 / ⇥ , 1 / � )) − → O ( poly (1 / ⇥ , 1 / � , n, size ( c ))) . page 9
Example - Rectangle Learning Problem: learn unknown axis-aligned rectangle R using as small a labeled sample as possible. R’ R Hypothesis: rectangle R’. In general, there may be false positive and false negative points. page 10
Example - Rectangle Learning Simple method: choose tightest consistent rectangle R’ for a large enough sample. How large a sample? Is this class PAC-learnable? R’ R What is the probability that ? R ( R � ) > � page 11
Example - Rectangle Learning Fix and assume (otherwise the result Pr D [ R ] > � � > 0 is trivial). Let be four smallest rectangles along r 1 , r 2 , r 3 , r 4 the sides of such that . Pr D [ r i ] ≥ � R 4 r 1 R =[ l, r ] × [ b, t ] R’ r 4 =[ l, s 4 ] × [ b, t ] r 4 r 2 � � s 4 =inf { s : Pr [ l, s ] × [ b, t ] 4 } ≥ � R � � Pr [ l, s 4 [ × [ b, t ] < � r 3 4 D page 12
Example - Rectangle Learning Errors can only occur in . Thus (geometry), R − R � misses at least one region . R ( R � ) > � ⇒ R � r i Therefore, i =1 { R � misses r i } ] Pr[ R ( R � ) > � ] ≤ Pr[ ∪ 4 4 Pr[ { R � misses r i } ] � ≤ i =1 4 ) m ≤ 4 e � m � 4 . ≤ 4(1 − � r 1 R’ r 4 r 2 R r 3 13 page
Example - Rectangle Learning Set to match the upper bound: δ > 0 4 e − m � 4 ≤ δ ⇔ m ≥ 4 � log 4 � . Then, for , with probability at least , m ≥ 4 � log 4 1 − δ � R ( R � ) ≤ � . r 1 R’ r 4 r 2 R r 3 page 14
Notes Infinite hypothesis set, but simple proof. Does this proof readily apply to other similar concepts classes? Geometric properties: • key in this proof. • in general non-trivial to extend to other classes, e.g., non-concentric circles (see HW2, 2006) . Need for more general proof and results. page 15
This lecture PAC Model Sample complexity, finite H , consistent case Sample complexity, finite H , inconsistent case page 16
Learning Bound for Finite H - Consistent Case Theorem: let be a finite set of functions from H X to and an algorithm that for any target { 0 , 1 } L concept and sample returns a consistent S c ∈ H hypothesis : . Then, for any , with b R S ( h S )=0 δ > 0 h S probability at least , 1 − δ R ( h S ) ≤ 1 m (log | H | + log 1 δ ) . page 17
Learning Bound for Finite H - Consistent Case Proof: for any , define . ✏ > 0 H ✏ = { h ∈ H : R ( h ) > ✏ } Then, h i ∃ h ∈ H ✏ : b Pr R S ( h ) = 0 h i R S ( h 1 ) = 0 ∨ · · · ∨ b b = Pr R S ( h | H ✏ | ) = 0 h i X b Pr R S ( h ) = 0 ( union bound ) ≤ h ∈ H ✏ X (1 − ✏ ) m ≤ | H | (1 − ✏ ) m ≤ | H | e − m ✏ . ≤ h ∈ H ✏ page 18
Remarks The algorithm can be ERM if problem realizable. Error bound linear in and only logarithmic in . 1 1 δ m is the number of bits used for the log 2 | H | representation of . H Bound is loose for large . | H | Uninformative for infinite . | H | page 19
Conjunctions of Boolean Literals Example for . n =6 Algorithm: start with and rule x 1 ∧ x 1 ∧ · · · ∧ x n ∧ x n out literals incompatible with positive examples. 0 1 1 0 1 1 + 0 1 1 1 1 1 + 0 0 1 1 0 1 - 0 1 1 1 1 1 + 1 0 0 1 1 0 - 0 1 0 0 1 1 + 0 1 ? ? 1 1 x 1 ∧ x 2 ∧ x 5 ∧ x 6 . page 20
Conjunctions of Boolean Literals Problem: learning class of conjunctions of C n boolean literals with at most variables (e.g., n for , ). n =3 x 1 ∧ x 2 ∧ x 3 Algorithm: choose consistent with . h S • Since , sample complexity: | H | = | C n | =3 n m ≥ 1 ⇥ ((log 3) n + log 1 � ) . � = . 02 , ⇥ = . 1 , n =10 , m ≥ 149 . • Computational complexity: polynomial, since algorithmic cost per training example is in . O ( n ) page 21
This lecture PAC Model Sample complexity, finite H , consistent case Sample complexity, finite H , inconsistent case page 22
Inconsistent Case No is a consistent hypothesis. h ∈ H The typical case in practice: difficult problems, complex concept class. But, inconsistent hypotheses with a small number of errors on the training set can be useful. Need a more powerful tool: Hoeffding’s inequality. page 23
Hoeffding’s Inequality Corollary: for any and any hypothesis h : X → { 0 , 1 } � > 0 the following inequalities holds: R ( h ) ≥ � ] ≤ e − 2 m � 2 Pr[ R ( h ) − � R ( h ) − R ( h ) ≥ � ] ≤ e − 2 m � 2 . Pr[ � Combining these one-sided inequalities yields R ( h ) | ≥ � ] ≤ 2 e − 2 m � 2 . Pr[ | R ( h ) − � page 24
Application to Learning Algorithm? Can we apply that bound to the hypothesis h S returned by our learning algorithm when training on sample ? S No, because is not a fixed hypothesis, it depends h S on the training sample. Note also that E[ � R ( h S )] is not a simple quantity such as . R ( h S ) Instead, we need a bound that holds simultaneously for all hypotheses , a uniform convergence h ∈ H bound. page 25
Generalization Bound - Finite H Theorem: let be a finite hypothesis set, then, for H any , with probability at least , δ > 0 1 − δ � log | H | + log 2 ∀ h ∈ H, R ( h ) ≤ � δ R S ( h ) + . 2 m Proof: By the union bound, � � � � � R ( h ) − � � > � Pr max R S ( h ) h ∈ H �� � � � � � R ( h 1 ) − � � R ( h | H | ) − � � > � ∨ . . . ∨ � > � = Pr R S ( h 1 ) R S ( h | H | ) �� � � � � R ( h ) − � � > � Pr R S ( h ) ≤ h ∈ H ≤ 2 | H | exp( − 2 m � 2 ) . page 26
Remarks Thus, for a finite hypothesis set, whp, �� � log | H | ∀ h ∈ H, R ( h ) ≤ � R S ( h ) + O . m Error bound in (quadratically worse). 1 O ( √ m ) can be interpreted as the number of bits log 2 | H | needed to encode . H Occam’s Razor principle (theologian William of Occam): “plurality should not be posited without necessity”. page 27
Recommend
More recommend