Computational Learning Theory For which tasks is successful learning - PDF document

Computational Learning Theory • For which tasks is successful learning possible? • Under what conditions is successful learning guaranteed? • What is successful learning? • Probably approximately correct (PAC) framework – Bounds on number of training examples needed • Mistake bound framework – Bounds on training errors for intermediate hypotheses 1

Problem • Given – Size or complexity of hypothesis space considered by learner – Accuracy to which target concept must be approximated – Probability that learner will output successful hypothesis – Manner in which training examples presented to learner • Find – Sample complexity ∗ Number of training examples needed for learner to con- verge (with high probability) to successful hypothesis – Computational complexity ∗ Amount of computational effort needed for learner to con- verge (with high probability) to successful hypothesis – Mistake bound ∗ Number of training examples misclassified by learner before converging to successful hypothesis 2

Problem Details • Successful hypothesis – Equals target concept – Usually agrees with target concept • How training examples obtained – Helpful teacher (near misses) – Learner-generated queries – Random sample 3

Probably Learning an Approximately Correct Hypothesis • Probably approximately correct (PAC) learning model • E.g., boolean-valued concepts from noise-free training data • Problem setting – X = set of all possible instances – C = set of possible target concepts ∗ Each c ∈ C corresponds to boolean-valued function c : X → { 0 , 1 } ∗ c ( x ) = 1 → positive example ∗ c ( x ) = 0 → negative example – Instances randomly sampled from X according to prob. dis- trib. D ∗ D is stationary (does not change over time) – Training examples consist of � x, c ( x ) � ∗ x randomly drawn from X according to D – Learner L considers possible hypotheses from H – Learner’s output h evaluated on randomly drawn test set from X by D – Looking for successful combinations of L , H and C – Worst case analysis for all possible C and D 4

Error of Hypothesis Instance space X - - c h + + - Where c and h disagree • True error ( error D ( h )) – Of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D – error D ( h ) = Pr x ∈D ( c ( x ) � = h ( x )) • D can be any distribution, not necessarily uniform • L can only see training examples • Training error = fraction of training examples misclassified by h • Analysis centers around how well training error estimates true error 5

PAC Learnability • What classes of target concepts can be reliably learned with a reasonable amount of time and training examples? • Learnability constraints – error D ( h ) = 0 ∗ Impossible unless we see entire X ∗ Small chance training sample is misleading – error D ( h ) ≤ ǫ ∗ Probability of failure ≤ δ ∗ I.e., probably learn approximately correct hypothesis (PAC) 6

Definition • Given concept class C over instances X of length n and learner L using hypothesis space H , C is PAC-Learnable by L using H if ∀ c ∈ C , ∀ distributions D over X , ∀ ǫ such that 0 < ǫ < 1 / 2, and ∀ δ such that 0 < δ < 1 / 2, learner L will with probability (1 − δ ) output a hypothesis h ∈ H such that error D ( h ) ≤ ǫ , in time polynomial in 1 /ǫ , 1 /δ , n and size ( c ). • n = size of an instance (e.g., number of boolean attributes) • size ( c ) = length of some encoding of elements in C • Definition limits number of training examples to be polynomial too 7

Sample Complexity for Finite Hypothesis Spaces • Sample complexity – Number of training examples needed for learner to produce PAC hypothesis • Sample complexity for consistent learner – Consistent learner ∗ Outputs hypothesis with no errors on training data (when possible) • Bound on sample complexity of ANY consistent learner – Recall version space V S H,D ∗ V S H,D = { h ∈ H |∀� x, c ( x ) � ∈ D ( h ( x ) = c ( x )) } – Every consistent learner outputs h ∈ V S H,D for any X , H and D – Bound number of examples to find consistent V S H,D 8

ǫ -Exhausted Version Space Hypothesis space H error =.3 error =.1 r =.4 r =.2 error =.2 r =0 VSH,D error =.2 error =.3 r =.3 error =.1 r =.1 r =0 ( r = training error, error = true error) • Given hypothesis space H , target concept c , instance distribution D , and set of training examples D of c , version space V S H,D is ǫ -exhausted with respect to c and D , if every hypothesis h ∈ V S H,D has error less that ǫ with respect to c and D . ∀ h ∈ V S H,D ( error D ( h ) < ǫ ) • Can bound the probability that V S H,D is ǫ -exhausted after some number of training examples 9

Thm. 7.1 ǫ -Exhausting the Version Space • If hypothesis space H is finite, and D is a sequence of m ≥ 1 independent randomly drawn examples of some target concept c , then for any 0 ≤ ǫ ≤ 1, the probability that the version space V S H,D is not ǫ -exhausted (with respect to c ) is ≤ | H | e ( − ǫm ) • Proof: – Let h 1 , ..., h k be hypotheses in H with error > ǫ w.r.t. c – To not ǫ -exhaust V S H,D , one of h i would be in V S H,D ∗ I.e., h i consistent with all m training examples ∗ Probability = (1 − ǫ ) m – Probability that one of h i ∈ V S H,D is k (1 − ǫ ) m – Since k ≤ | H | , k (1 − ǫ ) m ≤ | H | (1 − ǫ ) m – Since (1 − ǫ ) ≤ e − ǫ , | H | (1 − ǫ ) m ≤ | H | e ( − ǫm ) ✷ • Result: – Want | H | e ( − ǫm ) ≤ δ ∗ Sample complexity m ≥ (1 /ǫ )(ln | H | + ln(1 /δ )) – Given this many training examples, any consistent learner will output a hypothesis that is probably approximately correct ∗ Typically overestimates sample complexity due to | H | 10

Agnostic Learner • Finds hypothesis with minimum training error when c �∈ H • Pr [ error D ( h ) > error D ( h ) + ǫ ] ≤ e ( − 2 mǫ 2 ) • Pr [( ∃ h ∈ H )( error D ( h ) > error D ( h ) + ǫ )] ≤ | H | e ( − 2 mǫ 2 ) • Letting this probability be δ – m ≥ (1 / 2 ǫ 2 )(ln | H | + ln(1 /δ )) – m grows with square of 1 /ǫ instead of linearly as before 11

Example C = conjunctions of boolean literals ( a or ¬ a ) • Is C PAC learnable? – Show poly number of training examples for any c ∈ C – Design consistent learner using poly time per training example • | H | = 3 n for n boolean attributes – m ≥ (1 /ǫ )( n ln 3 + ln(1 /δ )) – E.g., n = 10, δ = 0 . 05, ǫ = 0 . 1, m ≥ 140 – E.g., n = 10, δ = 0 . 01, ǫ = 0 . 01, m ≥ 1560 • Algorithm Find-S is a consistent, poly time learner • Thus C is PAC-learnable by Find-S with H = C 12

How About EnjoySport ? m ≥ 1 ǫ (ln | H | + ln(1 /δ )) If H is as given in EnjoySport then | H | = 973, and m ≥ 1 ǫ (ln 973 + ln(1 /δ )) If want to assure that with probability 95%, V S contains only hypotheses with error D ( h ) ≤ . 1, then it is sufficient to have m examples, where m ≥ 1 . 1(ln 973 + ln(1 /. 05)) m ≥ 10(ln 973 + ln 20) m ≥ 10(6 . 88 + 3 . 00) m ≥ 98 . 8 13

PAC-Learnability of Other Concept Classes • Unbiased concept class | C | = 2 | X | – E.g., for n boolean attributes, | X | = 2 n – If H = C , then | H | = 2 (2 n ) – m ≥ (1 /ǫ )(2 n ln 2 + ln(1 /δ )) ∗ Exponential in n ⇒ not PAC learnable ∗ Can be proven that m = Θ(2 n ) • k -term DNF – Concept form T 1 ∨ T 2 ∨ ... ∨ T k ∗ Each T i conjunction of literals from n boolean attributes – | H | = (3 n ) k = 3 nk ∗ Overestimate: includes cases where T i = T j and T i > g T j – m ≥ (1 /ǫ )( nk ln 3 + ln(1 /δ )) – However, learning k -term DNF is NP-hard – Thus, not PAC-learnable when H = k -term DNF, but ... • k -CNF – Concept form T 1 ∧ ... ∧ T j for arbitrarily large j ∗ Each T i is a disjunction of k literals – k -CNF has poly time learner and sample complexity – Thus, H = k -CNF is PAC-learnable – Since any k -term DNF can be written as a k -CNF, k -term DNF is PAC-learnable by H = k -CNF 14

Sample Complexity for Infinite Hypothesis Spaces • Weakness in above result – Weak bound – Inapplicable for infinite H • Consider second measure of complexity of H (other than | H | ) – Vapnik-Chervonenkis (VC) dimension of H , V C ( H ) – Tighter than above bound – Finite for some infinite H ’s 15

Shattering a Set of Instances • Number of distinct instances of X that can be completely dis- criminated using H • Given sample S from X – There are 2 | S | possible dichotomies of S – I.e., 2 | S | different ways of assigning (+,-) classes to members of S • H shatters S if every possible dichotomy of S can be expressed by some hypothesis from H • Definition – A set of instances S is shattered by hypothesis space H iff for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy Instance space X 16

VC Dimension • Ability to shatter related to inductive bias • Unbiased hypothesis space shatters X • What if H can shatter only some large subset of X ? – The larger this subset, the more expressive H is • VC dimension measures this expressiveness • Definition – V C ( H ) of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H . If arbitrarily large subsets of X can be shattered by H , then V C ( H ) = ∞ • For any finite H , V C ( H ) ≤ lg | H | – 2 d ≤ | H | , where d = V C ( H ) 17

Computational Learning Theory For which tasks is successful learning - PDF document

Computational Learning Theory For which tasks is successful learning possible? Under what conditions is successful learning guaranteed? What is successful learning? Probably approximately correct (PAC) framework Bounds on number

Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning 1

Computational Learning Theory: Agnostic Learning Machine Learning 1 Slides based on material

Statistical and Computational Statistical and Computational Learning Theory Learning Theory

Computational Learning Theory: Positive and negative learnability results Machine Learning 1

Computational Learning Theory: Occams Razor Machine Learning 1 Slides based on material from

Computational Learning Theory: Shattering and VC Dimensions Machine Learning 1 Slides based on

Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning Slides

Applying Computational Learning Theory to Software Testing Neil Walkinshaw Computational

Computational Learning Theory Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch.

Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 7:

Computational Learning Theory: The Theory of Generalization Machine Learning 1 Slides based on

Chapter 2- -3 3 Chapter 2 Definition of Theory: A theory is a systematic Definition of

1. Computational Fluid a. Computational Fluid Dynamics is in the domain of Computational Science

Computational Learning Theory [read Chapter 7] [Suggested exercises: 7.1, 7.2, 7.5, 7.8]

Dennis Ryan Clark County School District Health Occupations ryandl@nv.ccsd.net Learning Theory

Computational Physics What is Computational Physics? Basic Computer Hardware Operating Systems

Dimensionality Reduction; PCA & SVD Kalev Kask Motivation High-dimensional data

Conjugate Phase Retrieval in the Paley-Wiener Space Eric Weber Iowa State University CodEx

Hard Lefschetz theorem and Hodge-Riemann relations for combinatorial geometries June Huh

P s rt r r

Convex hull: basic facts Convex hull: basic facts CG Lecture 1 CG Lecture 1 Problem : give a set

The Incredible Shrinking Genome Utricularia gibba Enrique Ibarra-Laclette 2 , Eric Lyons 1 ,

Re-Analysis of Radiation Epidemiologc Data 2018/10/1 ANS&HPS Joint Meeting

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

Computational Learning Theory For which tasks is successful learning - PDF document

Computational Learning Theory For which tasks is successful learning possible? Under what conditions is successful learning guaranteed? What is successful learning? Probably approximately correct (PAC) framework Bounds on number

Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning 1

Computational Learning Theory: Agnostic Learning Machine Learning 1 Slides based on material

Statistical and Computational Statistical and Computational Learning Theory Learning Theory

Computational Learning Theory: Positive and negative learnability results Machine Learning 1

Computational Learning Theory: Occams Razor Machine Learning 1 Slides based on material from

Computational Learning Theory: Shattering and VC Dimensions Machine Learning 1 Slides based on

Computational Learning Theory: An Analysis of a Conjunction Learner Machine Learning Slides

Applying Computational Learning Theory to Software Testing Neil Walkinshaw Computational

Computational Learning Theory Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch.

Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 7:

Computational Learning Theory: The Theory of Generalization Machine Learning 1 Slides based on

Chapter 2- -3 3 Chapter 2 Definition of Theory: A theory is a systematic Definition of

1. Computational Fluid a. Computational Fluid Dynamics is in the domain of Computational Science

Computational Learning Theory [read Chapter 7] [Suggested exercises: 7.1, 7.2, 7.5, 7.8]

Dennis Ryan Clark County School District Health Occupations ryandl@nv.ccsd.net Learning Theory

Computational Physics What is Computational Physics? Basic Computer Hardware Operating Systems

Dimensionality Reduction; PCA &amp; SVD Kalev Kask Motivation High-dimensional data

Conjugate Phase Retrieval in the Paley-Wiener Space Eric Weber Iowa State University CodEx

Hard Lefschetz theorem and Hodge-Riemann relations for combinatorial geometries June Huh

P s rt r r

Convex hull: basic facts Convex hull: basic facts CG Lecture 1 CG Lecture 1 Problem : give a set

The Incredible Shrinking Genome Utricularia gibba Enrique Ibarra-Laclette 2 , Eric Lyons 1 ,

Re-Analysis of Radiation Epidemiologc Data 2018/10/1 ANS&amp;HPS Joint Meeting

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

Dimensionality Reduction; PCA & SVD Kalev Kask Motivation High-dimensional data

Re-Analysis of Radiation Epidemiologc Data 2018/10/1 ANS&HPS Joint Meeting