l ecture 15
play

L ECTURE 15: Regrade requests: L EARNING T HEORY Send us email, and - PowerPoint PPT Presentation

CS446 Introduction to Machine Learning (Fall 2013) Announcements University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 Midterm grades are available on Compass. L ECTURE 15: Regrade requests: L EARNING T HEORY Send


  1. CS446 Introduction to Machine Learning (Fall 2013) Announcements University of Illinois at Urbana-Champaign http://courses.engr.illinois.edu/cs446 Midterm grades are available on Compass. L ECTURE 15: Regrade requests: L EARNING T HEORY Send us email, and come and see me next Tuesday. Prof. Julia Hockenmaier juliahmr@illinois.edu CS446 Machine Learning 2 Learning theory questions – Sample complexity: How many training examples are needed PAC learning for a learner to converge (with high probability) to a successful hypothesis? ( P robably – Computational complexity: How much computational effort is required A pproximately for a learner to converge (with high probability) to a successful hypothesis? C orrect) – Mistake bounds: How many training examples will the learner misclassify before converging to a successful hypothesis? CS446 Machine Learning 3

  2. Terminology What can a learner learn? The instance space X is the set of all instances x . We can’t expect to learn concepts exactly: Assume each x is of size n . – Many concepts may be consistent with the data Instances are drawn i.i.d. from an unknown probability – Unseen examples could have any label distribution D over X: x ~ D We can’t expect to always learn close A concept c: X � {0,1} is a Boolean function (it identifies a subset of X ) approximations to the target concept: A concept class C is a set of concepts – Sometimes the data will not be representative The hypothesis space H is the (sub)set of Boolean We can only expect to learn with high functions considered by the learner L probability a close approximation to the We evaluate L by its performance on new instances drawn i.i.d. from D target concept. 6 CS446 Machine Learning 5 True error of a hypothesis PAC learnability The true error ( error D ( h )) of hypothesis h Consider: – A concept class C over a set of instances X with respect to target concept c and (each x is of length n ) distribution D is the probability that h will – A learner L that uses hypothesis space H misclassify an instance drawn at random according to D: C is PAC-learnable by L if for all c � C and any distribution D over X , error D ( h ) = P x ~D (c( x ) ≠ h( x )) L will output with probability at least (1 −δ ) and in time that is polynomial in 1/ ε , 1/ δ , n and size(c), a hypothesis h � H with error D (h) ≤ ε (for 0 < δ < 0.5 and 0 < ε < 0.5) CS446 Machine Learning 7 CS446 Machine Learning 8

  3. Sample complexity PAC learnability in plain English (for finite hypothesis spaces and consistent learners) – L must with arbitrarily high probability (1 −δ ) output a hypothesis h with arbitrarily low error ε . Consistent learner: returns hypotheses – L must learn h efficiently that perfectly fit the training data (whenever (using a polynomial amount of time per example, possible). and a polynomial number of examples) CS446 Machine Learning 9 CS446 Machine Learning 10 Version space VS H,D Sample complexity (finite H) The version space VS H,D is the set of all hypotheses – The version space VS H,D is said to be ε -exhausted with respect to concept c and distribution D if every h � VS H,D h � H that correctly classify the training data D: has true error < ε with respect to c and distribution D – If H is finite, and the data D is a sequence of m i.i.d. VS H,D = { h � H | �� x ,c( x ) �� D: h ( x ) = c( x ) } samples of c, then for any 0 ≤ ε ≤ 1, the probability that VS H,D is not ε -exhausted with respect to c is ≤ | H | e - ε m Every consistent learner outputs a hypothesis h belonging – #training examples required to reduce probability of to the version space. failure below δ : Find m such that | H | e - ε m < δ We need to only bound the number of examples needed to – So, a consistent learner needs m ≥ 1/ ε (ln | H| + ln(1/ δ )) assure does not contain any unacceptable hypotheses examples to get an error below δ (often an overestimate; |H| can be very large) CS446 Machine Learning 11 CS446 Machine Learning 12

  4. PAC learning: intuition PAC learning: intuition A hypothesis h is bad if its true error > ε We want the probability that a bad hypothesis looks good to be smaller than δ ∀ x ∈ X: Pr D ( h (x) ≠ h* (x)) > ε Probability of one bad h getting one x ~ X D correct: A hypothesis h looks good if it is correct P D ( h (x) = h* (x)) ≤ 1- ε on our training set S Probability of one bad h getting m x ~ X D correct: ∀ s ∈ S : h (s) = h* (s) | S | = N P D ( h (x) = h* (x)) ≤ (1- ε ) m Prob’ty that any h gets m x~ X D correct: ≤ � H � (1- ε ) m We want the probability that a bad hypothesis looks good to be smaller than δ Exclusive union bound: P(A ∨ B) ≤ Pr(A) + Pr(B) Set � H � (1- ε ) m ≤ δ , solve for m 13 14 VC dimension (basic idea) The VC dimension of a hypothesis space H measures the complexity of H not by the number of distinct hypotheses (|H|), Vapnik-Chervonenkis but by the number of distinct instances from X that can be completely discriminated using H. (VC) dimension CS446 Machine Learning 15 CS446 Machine Learning 16

  5. Shattering a set of instances VC dimension of H A set of instances S is shattered by the The VC dimension of the hypothesis space hypothesis space H if and only if for every H, VC(H), is the size of the largest finite dichotomy of S there is a hypothesis h in H subset of the instance space X that can be that is consistent with this dichotomy. shattered by H. (dichotomy: label instances in S as + or -) If arbitrarily large finite subsets of X can be shattered by X then VC(H) = ∞ The ability of H to shatter S is a measure of its capacity to represent concepts over S CS446 Machine Learning 17 CS446 Machine Learning 18 VC Dimension of linear VC dimension if H is finite classifiers in 2 dimensions If H is finite: VC(H) ≤ log 2 |H| – H requires 2 d distinct hypotheses to shatter d instances. – If VC( H ) = d : 2 d ≤ | H | hence: d = VC( H ) ≤ log 2 |H| The VC dimension of a 2-d linear classifier is 3: The largest set of points that can be labeled arbitrarily Note that |H| is infinite, but expressiveness is quite low. CS446 Machine Learning 19 20

Recommend


More recommend