CSCE 478/878 Lecture 3: Computational Learning Theory Examines the - PDF document

Introduction • Combines machine learning with: – Algorithm design and analysis – Computational complexity CSCE 478/878 Lecture 3: Computational Learning Theory • Examines the worst-case minimum and maximum data and time requirements for learning – Number of examples needed, number of mistakes made before convergence Stephen D. Scott • Tries to relate: (Adapted from Tom Mitchell’s slides) – Probability of successful learning – Number of training examples – Complexity of hypothesis space September 8, 2003 – Accuracy to which target concept is approximated – Manner in which training examples presented • Some average case analyses done as well 1 2 PAC Learning: The Problem Setting Given: Outline • set of instances X • Probably approximately correct (PAC) learning • set of hypotheses H • Sample complexity • set of possible target concepts C (typically, C ⊆ H ) • Agnostic learning • training instances independently generated by a fixed, • Vapnik-Chervonenkis (VC) dimension unknown, arbitrary probability distribution D over X • Mistake bound model Learner observes a sequence D of training examples of form � x, c ( x ) � , for some target concept c ∈ C • Note: as with previous lecture, we assume no noise, though most of the results can be made to hold in a • instances x are drawn from distribution D noisy setting • teacher provides target value c ( x ) for each 3 4

True Error of a Hypothesis Instance space X - - c h PAC Learning: The Problem Setting + (cont’d) + Learner must output a hypothesis h ∈ H approximating - Where c c ∈ C and h disagree c △ h = symmetric difference between c and h • h is evaluated by its performance on subsequent instances drawn according to D Definition: The true error (denoted error D ( h ) ) of hypothesis h with respect to target concept c and distribution D is the probability that h will mis- Note: probabilistic instances, noise-free classifications classify an instance drawn at random according to D . error D ( h ) ≡ Pr x ∈D [ c ( x ) � = h ( x )] (example x ∈ X drawn randomly according to D ) 5 6 Two Notions of Error Training error of hypothesis h with respect to target concept c PAC Learning • How often h ( x ) � = c ( x ) over training instances Consider a class C of possible target concepts defined over a set of instances X of size n , and a learner L using True error of hypothesis h with respect to c hypothesis space H . • How often h ( x ) � = c ( x ) over future random instances Definition: C is PAC-learnable by L using H if for all c ∈ C , distributions D over X , ǫ such that 0 < ǫ < 1 / 2 , and δ such that 0 < δ < 1 / 2 , learner Our concern: L will, with probability at least (1 − δ ) , output a hypothesis h ∈ H such that error D ( h ) ≤ ǫ , in • Can we bound the true error of h given the training time that is polynomial in 1 /ǫ , 1 /δ , n and size ( c ) . error of h ? • First consider when training error of h is zero (i.e., h ∈ V S H,D ) 7 8

How many examples m will ǫ -exhaust the VS? Exhausting the Version Space • Let h 1 , . . . , h k ∈ H be all hyps. with true error > ǫ w.r.t. c and D (i.e. the ǫ -bad hyps.) Hypothesis space H • VS is not ǫ -exhausted iff at least one of these hyps. is consistent with all m examples error =.3 error =.1 r =.4 r =.2 error =.2 r =0 • Prob. that an ǫ -bad hyp consistent with one random VSH,D error =.2 example is ≤ (1 − ǫ ) error =.3 r =.3 error =.1 r =.1 r =0 • Since random draws are independent, the prob. that ( r = training error, error = true error) a particular ǫ -bad hyp is consistent with m exs. is ≤ (1 − ǫ ) m Definition: The version space V S H,D is said to be ǫ -exhausted with respect to c and D , if every • So the prob. any ǫ -bad hyp is in VS is hypothesis h ∈ V S H,D has error less than ǫ with ≤ k (1 − ǫ ) m ≤ | H | (1 − ǫ ) m respect to c and D . ( ∀ h ∈ V S H,D ) error D ( h ) < ǫ • Given (1 − ǫ ) ≤ 1 /e ǫ for ǫ ∈ [0 , 1] : | H | (1 − ǫ ) m ≤ | H | e − mǫ 9 10 How many examples m will ǫ -exhaust the VS? Learning Conjunctions of Boolean Literals (cont’d) How many examples are sufficient to assure with proba- Theorem: [Haussler, 1988] bility at least (1 − δ ) that If the hypothesis space H is finite, and D is a se- every h in V S H,D satisfies error D ( h ) ≤ ǫ quence of m ≥ 1 independent random examples of some target concept c , then for any 0 ≤ ǫ ≤ 1 , the probability that the version space with respect Use the theorem: to H and D is not ǫ -exhausted (with respect to c ) is m ≥ 1 ǫ (ln | H | + ln(1 /δ )) ≤ | H | e − mǫ Suppose H contains conjunctions of constraints on up to This bounds the probability that any consistent learner will n boolean attributes (i.e., n boolean literals). Then | H | = output a hypothesis h with error ( h ) ≥ ǫ 3 n (why?), and If we want this probability to be ≤ δ (for PAC): m ≥ 1 ǫ (ln 3 n + ln(1 /δ )) , | H | e − mǫ ≤ δ or then m ≥ 1 m ≥ 1 ǫ ( n ln 3 + ln(1 /δ )) ǫ (ln | H | + ln(1 /δ )) Still need to find a hyp. from VS! suffices 11 12

Unbiased Learners How About EnjoySport ? m ≥ 1 • Recall the unbiased concept class C = 2 X , i.e. set ǫ (ln | H | + ln(1 /δ )) of all subsets of X If H is as given in EnjoySport , then | H | = 973 and • If each instance x ∈ X is described by n boolean features, | X | = 2 n , so | C | = 2 2 n m ≥ 1 ǫ (ln 973 + ln(1 /δ )) • Also, to ensure c ∈ H , need H = C , so the theorem gives ... if want to assure that with probability 95%, V S contains m ≥ 1 ǫ (2 n ln 2 + ln(1 /δ )) , only hypotheses with error D ( h ) ≤ . 1 , then it is sufficient to have m examples, where i.e. exponentially large sample complexity m ≥ 1 . 1(ln 973 + ln(1 /. 05)) • Note the above is only sufficient, the theorem does not give necessary sample complexity m ≥ 10(ln 973 + ln 20) m ≥ 10(6 . 88 + 3 . 00) • (Necessary sample complexity is still exponential) m ≥ 98 . 8 ⇒ Further evidence for the need of bias (as if we need more) Again, how to find a consistent hypothesis? 13 14 Agnostic Learning So far, assumed c ∈ H Agnostic learning setting: don’t assume c ∈ H Vapnik-Chervonenkis Dimension Shattering a Set of Instances • What do we want then? – The hypothesis h that makes fewest errors on train- Definition: a dichotomy of a set S is a partition of ing data (i.e. the one that minimizes S into two disjoint subsets, i.e. into a set of + exs. disagreements, which can be harder than finding and a set of − exs. consistent hyp) Definition: a set of instances S is shattered by • What is sample complexity in this case? hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H 1 m ≥ 2 ǫ 2 (ln | H | + ln(1 /δ )) , consistent with this dichotomy. derived from Hoeffding bounds, bounding prob. of large deviation from expected value: Pr [ error D ( h ) > error D ( h ) + ǫ ] ≤ e − 2 mǫ 2 15 16

The Vapnik-Chervonenkis Dimension Example: Three Instances Shattered Definition: The Vapnik-Chervonenkis dimension, V C ( H ) , of hypothesis space H defined over instance space X , is the size of the largest finite Instance space X subset of X shattered by H . If arbitrarily large finite sets of X can be shattered by H , then V C ( H ) ≡ ∞ . • So to show that V C ( H ) = d , must show there exists some subset X ′ ⊂ X of size d that H can shatter and show that there exists no subset of X of size > d that H can shatter • Note that V C ( H ) ≤ log 2 | H | (why?) 17 18 VCD of Linear Decision Surfaces (Halfspaces) Example: Intervals on ℜ • Let H be the set of closed intervals on the real line (each hyp is a single interval), X = ℜ , and a point x ∈ X is positive iff it lies in the target interval c ( ) a ( ) b pos/pos Can’t shatter (b), so what is lower bound on VCD? n/n n/p Can shatter 2 pts, so VCD >= 2 What about upper bound? pos/neg pos pos Can’t shatter any 3 pts, so VCD < 3 neg • Thus V C ( H ) = 2 (also note that | H | is infinite) 19 20

CSCE 478/878 Lecture 3: Computational Learning Theory Examines the - PDF document

Introduction Combines machine learning with: Algorithm design and analysis Computational complexity CSCE 478/878 Lecture 3: Computational Learning Theory Examines the worst-case minimum and maximum data and time requirements for

Introduction CSCE CSCE In Homework 1, you are (supposedly) 478/878 478/878 Lecture 4:

Introduction CSCE CSCE If no label information is available, can still perform 478/878 478/878

Introduction CSCE CSCE Sometimes a single classifier (e.g., neural network, 478/878 478/878

Introduction Decision Tree for PlayTennis (Mitchell) CSCE CSCE 478/878 478/878 Outlook

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

CSCE 478/878 Lecture 2: Supervised Learning Supervised Learning Stephen Scott Introduction

CSCE 478/878 Lecture 3: Computational Learning Theory Stephen D. Scott (Adapted from Tom

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU 2 Shell CSCE 625 TAMU

CSCE 478/878 Lecture 2: Concept Learning General-to-specific ordering over hypotheses and the

CSCE 478/878 Lecture 8: Instance-Based Learning Stephen D. Scott (Adapted from Tom Mitchells

CSCE 478/878 Lecture 7: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchells slides)

CSCE 478/878 Lecture 4: Artificial Neural Networks Stephen D. Scott (Adapted from Tom

CSCE 478/878 Lecture 5: Evaluating will misclassify an instance drawn at random accord-

Why Are We Here? CSCE CSCE 496/896 496/896 Lecture 10: Lecture 10: CSCE 496/896 Lecture 10:

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU CSCE 625: Artificial

Foundation of Cryptography (0368-4162-01), Lecture 3 Hardcore Predicates for Any One-way Function

Computational Learning Theory: Agnostic Learning Machine Learning 1 Slides based on material

Generalization Bounds and Stability Lorenzo Rosasco Tomaso Poggio 9.520 Class 6 February, 23

Bootstrapping and Learning PDFA in Data Streams Borja Balle , Jorge Castro, Ricard Gavald` a

Computational Learning Theory 1 / 22 Decidability Computation Decidability which

Algorithms at Scale (Week 2) Puzzle of the Day: A bag contains a collection of blue and red

Learning sums of ridge functions in high dimension: a nonlinear compressed sensing model Massimo

Assume we are reading a stream of n distinct integers in { 1 , . . . , n + 1 } .