learning theory
play

Learning Theory CE-717: Machine Learning Sharif University of - PowerPoint PPT Presentation

Learning Theory CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Feasibility of learning PAC learning VC dimension Structural Risk Minimization (SRM) 2 Feasibility of learning Does the


  1. Learning Theory CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016

  2. Topics  Feasibility of learning  PAC learning  VC dimension  Structural Risk Minimization (SRM) 2

  3. Feasibility of learning  Does the training set 𝒠 tell us anything out of 𝒠 ?  𝒠 does not tells us something certain about 𝑔 outside of 𝒠  However, it can tell us something likely about 𝑔 outside of 𝒠  Probability helps us to find learning theory 3

  4. Feasibility of learning  These two questions:  Can we make sure 𝐹 𝑢𝑠𝑣𝑓 (𝑔) is close to 𝐹 𝑢𝑠𝑏𝑗𝑜 (𝑔) ?  Can we make 𝐹 𝑢𝑠𝑏𝑗𝑜 (𝑔) small enough? 4

  5. Generalizability of Learning  Generalization error is important to us  Why should doing well on the training set tell us anything about generalization error?  Can we relate error on training set to generalization error?  Which are conditions under which we can actually prove that learning algorithms will work well? 5

  6. A related example Pr picking a red marble = 𝜈 𝜈 Pr picking a green marble = 1 − 𝜈  Value of 𝜈 is unknown to us  We pick 𝑂 marbles independently  The fraction of red marbles in sample = 𝜈 6

  7. Does 𝜈 say anything about 𝜈 ?  No:  Samples can be mostly green while bin is mostly red  Yes:  Sample frequency 𝜈 is likely close to bin frequency 𝜈 7

  8. What does 𝜈 say about 𝜈 ?  In a big sample (large 𝑂 ), 𝜉 is probably close to 𝜈 (within 𝜗 ): 𝜈 − 𝜈 > 𝜗 ≤ 2𝑓 −2𝜗 2 𝑂 Pr Hoeffding ’ s Inequality  Valid for all 𝑂 and 𝜗  Bound does not depend on 𝜈  Tradeoff: 𝑂 , 𝜗 , and the bound  In the other words, “ 𝜈 = 𝜈 ” is Probably Approximately Correct (PAC) 8

  9. Recall: Learning diagram 𝑑 : 𝑦 1 , … , 𝑦 𝑂 𝑦 1 , 𝑧 (1) , … , 𝑦 𝑂 , 𝑧 (𝑂) 𝑑 ≈ g We assume that some random process proposes instances, and teacher labels them (i.e., instances drawn i.i.d. according to a distribution 𝑄(𝒚 )) [Y.S. Abou Mostafa, et. al, “ Learning From Data ” , 2012] 9

  10. Learning: Problem settings  Set of all instances 𝒴  Set of hypotheses ℋ  Set of possible target functions 𝐷 = {𝑑: 𝒴 → 𝒵} 𝑂  Sequence of 𝑂 training instances 𝒠 = 𝒚 (𝑜) , 𝑑 𝒚 (𝑜) 𝑜=1  𝒚 drawn at random from unknown distribution 𝑄 𝒚  Teacher provides noise-free label 𝑑(𝒚) for it  Learner observes a set of training examples 𝒠 for target function 𝑑 and outputs a hypothesis ℎ ∈ ℋ estimating 𝑑 10

  11. Connection of Hoeffding inequality to learning  In the bin example, the unknown is 𝜈  In the learning problem the unknown is a function 𝑑: 𝒴 → 𝒵 11

  12. Two notions of error  Training error of 𝒊 : how often ℎ(𝒚) ≠ 𝑑(𝒚) on training instances 𝐸 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ ≡ 𝐹 𝒚~𝒠 𝐽 ℎ(𝒚) ≠ 𝑑(𝒚) = 1 𝒠 𝐽 ℎ 𝒚 ≠ 𝑑 𝒚 𝒚∈𝒠 Training data  T est error of 𝒊 : how often ℎ(𝒚) ≠ 𝑑(𝒚) over future instances drawn at random from 𝑄(𝑌) 𝐹 𝑢𝑠𝑣𝑓 (ℎ) ≡ 𝐹 𝒚~𝑄(𝑌) 𝐽 ℎ(𝒚) ≠ 𝑑(𝒚) Probability distribution 12

  13. Notation for learning  Both 𝜈 and 𝜈 depend on which hypothesis ℎ 𝜈 is “ in sample ” denoted by 𝐹 𝑢𝑠𝑏𝑗𝑜 (ℎ)   𝜈 is “ out of sample ” denoted by 𝐹 𝑢𝑠𝑣𝑓 (ℎ) 𝐹 𝑢𝑠𝑣𝑓 (ℎ)  The Hoeffding inequality becomes: > 𝜗 ≤ 2𝑓 −2𝜗 2 𝑂 Pr 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ − 𝐹 𝑢𝑠𝑣𝑓 ℎ 13 𝐹 𝑢𝑠𝑏𝑗𝑜 (ℎ)

  14. Are we done?  We cannot use this bound for the learned 𝑔 from data.  Indeed, ℎ is assumed fixed in this inequality and for this ℎ , 𝐹 𝑢𝑠𝑏𝑗𝑜 (ℎ) generalizes to 𝐹 𝑢𝑠𝑣𝑓 (ℎ) .  “ verification ” of ℎ , not learning  We need to choose from multiple ℎ 's and 𝑔 is not fixed and instead is found according to the samples. 14

  15. Hypothesis space as multiple bins  Generalizing the bin model to more than one hypothesis: 15

  16. Hypothesis space: Coin example  Question: if you toss a fair coin 10 times, what is the probability that it will get 10 heads?  Answer: ≈ 0.1%  Question: if you toss 1000 fair coins 10 times, what is the probability that some of them will get 10 heads?  Answer: ≈ 63% 16

  17. A bound for the learning problem: Using Hoeffding inequality Pr 𝐹 𝑢𝑠𝑣𝑓 𝑔 − 𝐹 𝑢𝑠𝑏𝑗𝑜 𝑔 > 𝜗 𝐹 𝑢𝑠𝑣𝑓 ℎ 1 − 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 1 > 𝜗 or 𝐹 𝑢𝑠𝑣𝑓 ℎ 2 − 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 2 > 𝜗 ≤ Pr … or 𝐹 𝑢𝑠𝑣𝑓 ℎ 𝑁 − 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 𝑁 > 𝜗 𝑁 ≤ Pr 𝐹 𝑢𝑠𝑣𝑓 ℎ 𝑗 − 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 𝑗 > 𝜗 𝑗=1 𝑁 2𝑓 −2𝜗 2 𝑂 ≤ 𝑗=1 ≤ 2 ℋ 𝑓 −2𝜗 2 𝑂 ℋ = 𝑁 17

  18. PAC bound: Using Hoeffding inequality > 𝜗 ≤ 2 ℋ 𝑓 −2𝜗 2 𝑂 = 𝜀 Pr 𝐹 𝑢𝑠𝑣𝑓 ℎ − 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ ⇒ Pr 𝐹 𝑢𝑠𝑣𝑓 ℎ − 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ ≤ 𝜗 ≥ 1 − 𝜀  With probability at least ( 1 − 𝜀 ) every ℎ satisfies ln2 ℋ + ln 1 𝜀 𝐹 𝑢𝑠𝑣𝑓 ℎ < 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ + 2𝑂 Thus, we can we bound 𝐹 𝑢𝑠𝑣𝑓 ℎ − 𝐹 𝑢𝑠𝑏𝑗𝑜 (ℎ) that shows the amount of overfiting 18

  19. Sample complexity  How many training examples suffice?  Given 𝜗 and 𝜀 , yields sample complexity: 𝑂 ≥ 1 2𝜗 2 ln 2 ℋ + ln 1 𝜀  Thus, we found a theory that relates  Number of training examples  Complexity of hypothesis space  Accuracy to which target function is approximated  Probability that learner outputs a successful hypothesis 19

  20. An other problem setting  Finite number of possible hypothesis (e.g., decision trees of depth 𝑒 0 )  A learner finds a hypothesis ℎ that is consistent with training data  𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ = 0  What is the probability that the true error of ℎ will be more than 𝜗 ?  𝐹 𝑢𝑠𝑣𝑓 ℎ ≥ 𝜗 20

  21. True error of a hypothesis Target 𝑑(𝒚)  True error of ℎ : probability that it will misclassify an example drawn at random from 𝑄 𝒚 𝐹 𝑢𝑠𝑣𝑓 (ℎ) ≡ 𝐹 𝒚~𝑄(𝑌) 𝐽 ℎ(𝒚) ≠ 𝑑(𝒚) 21

  22. How likely is a consistent learner to pick a bad hypothesis?  Bound on the probability that any consistent learner will output ℎ with 𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗  Theorem [Haussler, 1988]: For target concept 𝑑 , ∀ 0 ≤ 𝜗 ≤ 1  If 𝐼 is finite and 𝒠 contains 𝑂 ≥ 1 independent random samples Pr ∃ℎ ∈ ℋ, 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ = 0 ∧ 𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗 ≤ ℋ 𝑓 −𝜗𝑂 22

  23. Haussler bound: Proof  What does the theorem mean? Pr ∃ℎ ∈ ℋ, 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ = 0 ∧ 𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗 ≤ ℋ 𝑓 −𝜗𝑂  For a fixed ℎ , how likely is a bad hypothesis (i.e., 𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗 ) to label 𝑂 training data points right?  Pr(ℎ labels one data point correctly|𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗) ≤ (1 − 𝜗)  Pr(ℎ labels 𝑂 i. i. d data points correctly|𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗) ≤ 1 − 𝜗 𝑂 23

  24. Haussler bound: Proof (Cont ’ d)  There may be many bad hypotheses ℎ 1 , … , ℎ 𝑙 (i.e., 𝐹 𝑢𝑓𝑡𝑢 ℎ 1 > 𝜗 , … , 𝐹 𝑢𝑓𝑡𝑢 ℎ 𝑙 > 𝜗 ) that are consistent with 𝑂 training data  𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 1 = 0 , 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 2 = 0 , … , 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 𝑙 = 0  How likely is the learner pick a bad hypothesis ( 𝐹 𝑢𝑓𝑡𝑢 ℎ > 𝜗 ) among consistent ones {ℎ 1 , … , ℎ 𝑙 } ?  Pr ∃ℎ ∈ 𝐼, 𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗 ∧ 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ = 0  = Pr 𝐹 𝑢𝑠𝑣𝑓 ℎ 1 > 𝜗 ∧ 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 1 = 0 or … or 𝐹 𝑢𝑠𝑣𝑓 ℎ 𝑙 > 𝜗 ∧ 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 𝑙 = 0 𝑙  ≤ 𝑗=1 Pr(𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 𝑗 = 0 ∧ 𝐹 𝑢𝑠𝑣𝑓 ℎ 𝑗 > 𝜗) [𝑄 A ∪ 𝐶 ≤ 𝑄 𝐵 + 𝑄 𝐶 ] 𝑙 𝑙 1 − 𝜗 𝑂  ≤ 𝑗=1 Pr(𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ 𝑗 = 0|𝐹 𝑢𝑠𝑣𝑓 ℎ 𝑗 > 𝜗) ≤ 𝑗=1  ≤ ℋ 1 − 𝜗 𝑂 [𝑙 ≤ ℋ ]  ≤ ℋ 𝑓 −𝜗𝑂 [1 − 𝜗 ≤ 𝑓 −𝜗 0 ≤ 𝜗 ≤ 1] 24

  25. Haussler PAC Bound  Theorem [Haussler ’ 88]: Consider finite hypothesis space 𝐼 , training set 𝐸 with m i.i.d. samples, 0 < 𝜗 < 1 : Pr ∃ℎ ∈ ℋ, 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ = 0|𝐹 𝑢𝑠𝑣𝑓 ℎ > 𝜗 ≤ ℋ 𝑓 −𝜗𝑂 ≤ 𝜀 Suppose we want this probability to be at most 𝜀 .  For any learned hypothesis ℎ ∈ ℋ that is consistent on the training set 𝒠 (i.e., 𝐹 𝑢𝑠𝑏𝑗𝑜 ℎ = 0 ), with probability at least (1 − 𝜀) : 𝐹 𝑢𝑠𝑣𝑓 (ℎ) ≤ ϵ 25

Recommend


More recommend