Learning Theory CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016
Topics Feasibility of learning PAC learning VC dimension Structural Risk Minimization (SRM) 2
Feasibility of learning Does the training set tell us anything out of ? does not tells us something certain about 𝑔 outside of However, it can tell us something likely about 𝑔 outside of Probability helps us to find learning theory 3
Feasibility of learning These two questions: Can we make sure 𝐹 𝑢𝑠𝑣𝑓 (𝑔) is close to 𝐹 𝑢𝑠𝑏𝑗𝑜 (𝑔) ? Can we make 𝐹 𝑢𝑠𝑏𝑗𝑜 (𝑔) small enough? 4
Generalizability of Learning Generalization error is important to us Why should doing well on the training set tell us anything about generalization error? Can we relate error on training set to generalization error? Which are conditions under which we can actually prove that learning algorithms will work well? 5
A related example Pr picking a red marble = 𝜈 𝜈 Pr picking a green marble = 1 − 𝜈 Value of 𝜈 is unknown to us We pick 𝑂 marbles independently The fraction of red marbles in sample = 𝜈 6
Does 𝜈 say anything about 𝜈 ? No: Samples can be mostly green while bin is mostly red Yes: Sample frequency 𝜈 is likely close to bin frequency 𝜈 7
What does 𝜈 say about 𝜈 ? In a big sample (large 𝑂 ), 𝜉 is probably close to 𝜈 (within 𝜗 ): 𝜈 − 𝜈 > 𝜗 ≤ 2𝑓 −2𝜗 2 𝑂 Pr Hoeffding ’ s Inequality Valid for all 𝑂 and 𝜗 Bound does not depend on 𝜈 Tradeoff: 𝑂 , 𝜗 , and the bound In the other words, “ 𝜈 = 𝜈 ” is Probably Approximately Correct (PAC) 8
Recall: Learning diagram 𝑑 : 𝑦 1 , … , 𝑦 𝑂 𝑦 1 , 𝑧 (1) , … , 𝑦 𝑂 , 𝑧 (𝑂) 𝑑 ≈ g We assume that some random process proposes instances, and teacher labels them (i.e., instances drawn i.i.d. according to a distribution 𝑄(𝒚 )) [Y.S. Abou Mostafa, et. al, “ Learning From Data ” , 2012] 9
Learning: Problem settings Set of all instances 𝒴 Set of hypotheses ℋ Set of possible target functions 𝐷 = {𝑑: 𝒴 → 𝒵} 𝑂 Sequence of 𝑂 training instances = 𝒚 (𝑜) , 𝑑 𝒚 (𝑜) 𝑜=1 𝒚 drawn at random from unknown distribution 𝑄 𝒚 Teacher provides noise-free label 𝑑(𝒚) for it Learner observes a set of training examples for target function 𝑑 and outputs a hypothesis ℎ ∈ ℋ estimating 𝑑 10
Connection of Hoeffding inequality to learning In the bin example, the unknown is 𝜈 In the learning problem the unknown is a function 𝑑: 𝒴 → 𝒵 11
Two notions of error Training error of 𝒊 : how often ℎ(𝒚) ≠ 𝑑(𝒚) on training instances 𝐸 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ ≡ 𝐹 𝒚~ 𝐽 ℎ(𝒚) ≠ 𝑑(𝒚) = 1 𝐽 ℎ 𝒚 ≠ 𝑑 𝒚 𝒚∈ Training data T est error of 𝒊 : how often ℎ(𝒚) ≠ 𝑑(𝒚) over future instances drawn at random from 𝑄(𝑌) 𝐹 𝑢𝑠𝑣𝑓 (ℎ) ≡ 𝐹 𝒚~𝑄(𝑌) 𝐽 ℎ(𝒚) ≠ 𝑑(𝒚) Probability distribution 12
Notation for learning Both 𝜈 and 𝜈 depend on which hypothesis ℎ 𝜈 is “ in sample ” denoted by 𝐹 𝑢𝑠𝑏𝑗𝑜 (ℎ) 𝜈 is “ out of sample ” denoted by 𝐹 𝑢𝑠𝑣𝑓 (ℎ) 𝐹 𝑢𝑠𝑣𝑓 (ℎ) The Hoeffding inequality becomes: > 𝜗 ≤ 2𝑓 −2𝜗 2 𝑂 Pr 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ − 𝐹 𝑢𝑠𝑣𝑓 ℎ 13 𝐹 𝑢𝑠𝑏𝑗𝑜 (ℎ)
Are we done? We cannot use this bound for the learned 𝑔 from data. Indeed, ℎ is assumed fixed in this inequality and for this ℎ , 𝐹 𝑢𝑠𝑏𝑗𝑜 (ℎ) generalizes to 𝐹 𝑢𝑠𝑣𝑓 (ℎ) . “ verification ” of ℎ , not learning We need to choose from multiple ℎ 's and 𝑔 is not fixed and instead is found according to the samples. 14
Hypothesis space as multiple bins Generalizing the bin model to more than one hypothesis: 15
Hypothesis space: Coin example Question: if you toss a fair coin 10 times, what is the probability that it will get 10 heads? Answer: ≈ 0.1% Question: if you toss 1000 fair coins 10 times, what is the probability that some of them will get 10 heads? Answer: ≈ 63% 16
A bound for the learning problem: Using Hoeffding inequality Pr 𝐹 𝑢𝑠𝑣𝑓 𝑔 − 𝐹 𝑢𝑠𝑏𝑗𝑜 𝑔 > 𝜗 𝐹 𝑢𝑠𝑣𝑓 ℎ 1 − 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 1 > 𝜗 or 𝐹 𝑢𝑠𝑣𝑓 ℎ 2 − 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 2 > 𝜗 ≤ Pr … or 𝐹 𝑢𝑠𝑣𝑓 ℎ 𝑁 − 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 𝑁 > 𝜗 𝑁 ≤ Pr 𝐹 𝑢𝑠𝑣𝑓 ℎ 𝑗 − 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 𝑗 > 𝜗 𝑗=1 𝑁 2𝑓 −2𝜗 2 𝑂 ≤ 𝑗=1 ≤ 2 ℋ 𝑓 −2𝜗 2 𝑂 ℋ = 𝑁 17
PAC bound: Using Hoeffding inequality > 𝜗 ≤ 2 ℋ 𝑓 −2𝜗 2 𝑂 = 𝜀 Pr 𝐹 𝑢𝑠𝑣𝑓 ℎ − 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ ⇒ Pr 𝐹 𝑢𝑠𝑣𝑓 ℎ − 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ ≤ 𝜗 ≥ 1 − 𝜀 With probability at least ( 1 − 𝜀 ) every ℎ satisfies ln2 ℋ + ln 1 𝜀 𝐹 𝑢𝑠𝑣𝑓 ℎ < 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ + 2𝑂 Thus, we can we bound 𝐹 𝑢𝑠𝑣𝑓 ℎ − 𝐹 𝑢𝑠𝑏𝑗𝑜 (ℎ) that shows the amount of overfiting 18
Sample complexity How many training examples suffice? Given 𝜗 and 𝜀 , yields sample complexity: 𝑂 ≥ 1 2𝜗 2 ln 2 ℋ + ln 1 𝜀 Thus, we found a theory that relates Number of training examples Complexity of hypothesis space Accuracy to which target function is approximated Probability that learner outputs a successful hypothesis 19
An other problem setting Finite number of possible hypothesis (e.g., decision trees of depth 𝑒 0 ) A learner finds a hypothesis ℎ that is consistent with training data 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ = 0 What is the probability that the true error of ℎ will be more than 𝜗 ? 𝐹 𝑢𝑠𝑣𝑓 ℎ ≥ 𝜗 20
True error of a hypothesis Target 𝑑(𝒚) True error of ℎ : probability that it will misclassify an example drawn at random from 𝑄 𝒚 𝐹 𝑢𝑠𝑣𝑓 (ℎ) ≡ 𝐹 𝒚~𝑄(𝑌) 𝐽 ℎ(𝒚) ≠ 𝑑(𝒚) 21
How likely is a consistent learner to pick a bad hypothesis? Bound on the probability that any consistent learner will output ℎ with 𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗 Theorem [Haussler, 1988]: For target concept 𝑑 , ∀ 0 ≤ 𝜗 ≤ 1 If 𝐼 is finite and contains 𝑂 ≥ 1 independent random samples Pr ∃ℎ ∈ ℋ, 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ = 0 ∧ 𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗 ≤ ℋ 𝑓 −𝜗𝑂 22
Haussler bound: Proof What does the theorem mean? Pr ∃ℎ ∈ ℋ, 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ = 0 ∧ 𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗 ≤ ℋ 𝑓 −𝜗𝑂 For a fixed ℎ , how likely is a bad hypothesis (i.e., 𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗 ) to label 𝑂 training data points right? Pr(ℎ labels one data point correctly|𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗) ≤ (1 − 𝜗) Pr(ℎ labels 𝑂 i. i. d data points correctly|𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗) ≤ 1 − 𝜗 𝑂 23
Haussler bound: Proof (Cont ’ d) There may be many bad hypotheses ℎ 1 , … , ℎ 𝑙 (i.e., 𝐹 𝑢𝑓𝑡𝑢 ℎ 1 > 𝜗 , … , 𝐹 𝑢𝑓𝑡𝑢 ℎ 𝑙 > 𝜗 ) that are consistent with 𝑂 training data 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 1 = 0 , 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 2 = 0 , … , 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 𝑙 = 0 How likely is the learner pick a bad hypothesis ( 𝐹 𝑢𝑓𝑡𝑢 ℎ > 𝜗 ) among consistent ones {ℎ 1 , … , ℎ 𝑙 } ? Pr ∃ℎ ∈ 𝐼, 𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗 ∧ 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ = 0 = Pr 𝐹 𝑢𝑠𝑣𝑓 ℎ 1 > 𝜗 ∧ 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 1 = 0 or … or 𝐹 𝑢𝑠𝑣𝑓 ℎ 𝑙 > 𝜗 ∧ 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 𝑙 = 0 𝑙 ≤ 𝑗=1 Pr(𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 𝑗 = 0 ∧ 𝐹 𝑢𝑠𝑣𝑓 ℎ 𝑗 > 𝜗) [𝑄 A ∪ 𝐶 ≤ 𝑄 𝐵 + 𝑄 𝐶 ] 𝑙 𝑙 1 − 𝜗 𝑂 ≤ 𝑗=1 Pr(𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 𝑗 = 0|𝐹 𝑢𝑠𝑣𝑓 ℎ 𝑗 > 𝜗) ≤ 𝑗=1 ≤ ℋ 1 − 𝜗 𝑂 [𝑙 ≤ ℋ ] ≤ ℋ 𝑓 −𝜗𝑂 [1 − 𝜗 ≤ 𝑓 −𝜗 0 ≤ 𝜗 ≤ 1] 24
Haussler PAC Bound Theorem [Haussler ’ 88]: Consider finite hypothesis space 𝐼 , training set 𝐸 with m i.i.d. samples, 0 < 𝜗 < 1 : Pr ∃ℎ ∈ ℋ, 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ = 0|𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗 ≤ ℋ 𝑓 −𝜗𝑂 ≤ 𝜀 Suppose we want this probability to be at most 𝜀 . For any learned hypothesis ℎ ∈ ℋ that is consistent on the training set (i.e., 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ = 0 ), with probability at least (1 − 𝜀) : 𝐹 𝑢𝑠𝑣𝑓 (ℎ) ≤ ϵ 25
Recommend
More recommend