Computational Learning Theory: Probably Approximately Correct (PAC) Learning Machine Learning 1 Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others
Computational Learning Theory • The Theory of Generalization • Probably Approximately Correct (PAC) learning • Positive and negative learnability results • Agnostic Learning • Shattering and the VC dimension 2
Where are we? • The Theory of Generalization • Probably Approximately Correct (PAC) learning • Positive and negative learnability results • Agnostic Learning • Shattering and the VC dimension 3
This section 1. Define the PAC model of learning 2. Make formal connections to the principle of Occam’s razor 4
This section 1. Define the PAC model of learning 2. Make formal connections to the principle of Occam’s razor 5
Recall: The setup Instance Space: 𝑌 , the set of examples • Concept Space: 𝐷 , the set of possible target functions: 𝑔 ∈ 𝐷 is the hidden • target function – Eg: all 𝑜 -conjunctions; all 𝑜 -dimensional linear functions, … Hypothesis Space: 𝐼 , the set of possible hypotheses • This is the set that the learning algorithm explores – Training instances: 𝑇×{−1,1} : positive and negative examples of the target • concept. ( 𝑇 is a finite subset of 𝑌 ) – Training instances are generated by a fixed unknown probability distribution 𝐸 over 𝑌 What we want: A hypothesis h ∈ 𝐼 such that ℎ 𝑦 = 𝑔(𝑦) • – Evaluate h on subsequent examples 𝑦 ∈ 𝑌 drawn according to 𝐸 6
Formulating the theory of prediction All the notation we have so far on one slide In the general case, we have – 𝑌 : instance space, 𝑍 : output space = {+1, -1} – 𝐸 : an unknown distribution over 𝑌 – 𝑔 : an unknown target function X → 𝑍 , taken from a concept class 𝐷 – ℎ : a hypothesis function X → 𝑍 that the learning algorithm selects from a hypothesis class 𝐼 – 𝑇 : a set of m training examples drawn from 𝐸 , labeled with f – err ! ℎ : The true error of any hypothesis ℎ – err " ℎ : The empirical error or training error or observed error of ℎ 7
Theoretical questions • Can we describe or bound the true error (err D ) given the empirical error (err S )? • Is a concept class C learnable? • Is it possible to learn C using only the functions in H using the supervised protocol? • How many examples does an algorithm need to guarantee good performance? 8
Expectations of learning • We cannot expect a learner to learn a concept exactly – There will generally be multiple concepts consistent with the available data (which represent a small fraction of the available instance space) – Unseen examples could potentially have any label – Let’s “agree” to misclassify uncommon examples that do not show up in the training set • We cannot always expect to learn a close approximation to the target concept – Sometimes (hopefully only rarely) the training set will not be representative (will contain uncommon examples) 9
Expectations of learning • We cannot expect a learner to learn a concept exactly – There will generally be multiple concepts consistent with the available data (which represent a small fraction of the available instance space) – Unseen examples could potentially have any label – Let’s “agree” to misclassify uncommon examples that do not show up in the training set • We cannot always expect to learn a close approximation to the target concept – Sometimes (hopefully only rarely) the training set will not be representative (will contain uncommon examples) 10
Expectations of learning • We cannot expect a learner to learn a concept exactly – There will generally be multiple concepts consistent with the available data (which represent a small fraction of the available instance space) – Unseen examples could potentially have any label The only realistic expectation of a good learner is – Let’s “agree” to misclassify uncommon examples that do not that with high probability it will learn a close show up in the training set approximation to the target concept • We cannot always expect to learn a close approximation to the target concept – Sometimes (hopefully only rarely) the training set will not be representative (will contain uncommon examples) 11
Probably approximately correctness The only realistic expectation of a good learner is that with high probability it will learn a close approximation to the target concept • In Probably Approximately Correct (PAC) learning, one requires that – given small parameters ² and ±, – With probability at least 1 - ±, a learner produces a hypothesis with error at most ² • The only reason we can hope for this is the consistent distribution assumption 12
Probably approximately correctness The only realistic expectation of a good learner is that with high probability it will learn a close approximation to the target concept • In Probably Approximately Correct (PAC) learning, one requires that – given small parameters 𝜗 and 𝜀 , – With probability at least 1 − 𝜀 , a learner produces a hypothesis with error at most 𝜗 • The only reason we can hope for this is the consistent distribution assumption 13
Probably approximately correctness The only realistic expectation of a good learner is that with high probability it will learn a close approximation to the target concept • In Probably Approximately Correct (PAC) learning, one requires that – given small parameters 𝜗 and 𝜀 , – With probability at least 1 − 𝜀 , a learner produces a hypothesis with error at most 𝜗 • The only reason we can hope for this is the consistent distribution assumption 14
PAC Learnability Consider a concept class 𝐷 defined over an instance space 𝑌 (containing instances of length 𝑜 ), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷 , for all distribution 𝐸 over 𝑌 , and fixed 0 < 𝜗, 𝜀 < 1 , given 𝑛 examples sampled independently according to 𝐸 , with probability at least (1 − 𝜀) , the algorithm 𝑀 produces a hypothesis ℎ ∈ 𝐼 that has error at most 𝜗 , where 𝑛 is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼) . recall that Err D (h) = Pr D [f(x) ≠ h(x)] The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼) . 15
PAC Learnability Consider a concept class 𝐷 defined over an instance space 𝑌 (containing instances of length 𝑜 ), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷 , for all distribution 𝐸 over 𝑌 , and fixed 0 < 𝜗, 𝜀 < 1 , given 𝑛 examples sampled independently according to 𝐸 , with probability at least (1 − 𝜀) , the algorithm 𝑀 produces a hypothesis ℎ ∈ 𝐼 that has error at most 𝜗 , where 𝑛 is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼) . recall that Err D (h) = Pr D [f(x) ≠ h(x)] The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼) . 16
PAC Learnability Consider a concept class 𝐷 defined over an instance space 𝑌 (containing instances of length 𝑜 ), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷 , for all distribution 𝐸 over 𝑌 , and fixed 0 < 𝜗, 𝜀 < 1 , given 𝑛 examples sampled independently according to 𝐸 , with probability at least (1 − 𝜀) , the algorithm 𝑀 produces a hypothesis ℎ ∈ 𝐼 that has error at most 𝜗 , where 𝑛 is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼) . recall that Err D (h) = Pr D [f(x) ≠ h(x)] The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼) . 17
PAC Learnability Consider a concept class 𝐷 defined over an instance space 𝑌 (containing instances of length 𝑜 ), and a learner 𝑀 using a hypothesis space 𝐼 The concept class 𝐷 is PAC learnable by 𝑀 using 𝐼 if for all 𝑔 ∈ 𝐷 , for all distribution 𝐸 over 𝑌 , and fixed 0 < 𝜗, 𝜀 < 1 , given 𝑛 examples sampled independently according to 𝐸 , with probability at least (1 − 𝜀) , the algorithm 𝑀 produces a hypothesis ℎ ∈ 𝐼 that has error at most 𝜗 , where 𝑛 is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼) . recall that Err D (h) = Pr D [f(x) ≠ h(x)] The concept class 𝐷 is efficiently learnable if 𝑀 can produce the hypothesis in time that is polynomial in ⁄ 1 𝜗 , ⁄ 1 𝜀 , 𝑜 and 𝑡𝑗𝑨𝑓(𝐼) . 18
Recommend
More recommend