computational learning theory occam s razor
play

Computational Learning Theory: Occams Razor Machine Learning 1 - PowerPoint PPT Presentation

Computational Learning Theory: Occams Razor Machine Learning 1 Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others This lecture: Computational Learning Theory The Theory of Generalization Probably


  1. Computational Learning Theory: Occam’s Razor Machine Learning 1 Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others

  2. This lecture: Computational Learning Theory • The Theory of Generalization • Probably Approximately Correct (PAC) learning • Positive and negative learnability results • Agnostic Learning • Shattering and the VC dimension 2

  3. Where are we? • The Theory of Generalization • Probably Approximately Correct (PAC) learning • Positive and negative learnability results • Agnostic Learning • Shattering and the VC dimension 3

  4. This section 1. Define the PAC model of learning 2. Make formal connections to the principle of Occam’s razor 4

  5. This section ü Define the PAC model of learning 2. Make formal connections to the principle of Occam’s razor 5

  6. Occam’s Razor Named after William of Occam – AD 1300s Prefer simpler explanations over more complex ones “Numquam ponenda est pluralitas sine necessitate” (Never posit plurality without necessity.) Historically, a widely prevalent idea across different schools of philosophy 6

  7. Why would a consistent learner fail? Consistent learner: Suppose we have a learner that produces a hypothesis that is consistent with a training set... … but the training set is not a representative sample of the instance space. Then the hypothesis we learned could be bad even if it is consistent with the entire training set. We can try to 1. quantify the probability of such a bad situation occurring and, 2. then, bound the probability to be low. 7

  8. Why would a consistent learner fail? Consistent learner: Suppose we have a learner that produces a hypothesis that is consistent with a training set... … but the training set is not a representative sample of the instance space. Then the hypothesis we learned could be bad even if it is consistent with the entire training set. We can try to 1. quantify the probability of such a bad situation occurring and, 2. then, ask what will it take for this probability to be low. 8

  9. Towards formalizing Occam’s Razor Claim : The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠 ! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " Proof : Let ℎ be such a bad hypothesis that has an error > 𝜗 Probability that ℎ is consistent with one example is Pr 𝑔 𝑦 = ℎ 𝑦 < 1 − 𝜗 The training set consists of 𝑛 examples drawn independently So, probability that ℎ is consistent with 𝑛 examples < 1 − 𝜗 " Probability that some bad hypothesis in 𝐼 is consistent with 𝑛 examples is less than 𝐼 1 − 𝜗 " 9

  10. Towards formalizing Occam’s Razor (Assuming consistency) Claim : The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠 ! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " Proof : Let ℎ be such a bad hypothesis that has an error > 𝜗 Probability that ℎ is consistent with one example is Pr 𝑔 𝑦 = ℎ 𝑦 < 1 − 𝜗 The training set consists of 𝑛 examples drawn independently So, probability that ℎ is consistent with 𝑛 examples < 1 − 𝜗 " Probability that some bad hypothesis in 𝐼 is consistent with 𝑛 examples is less than 𝐼 1 − 𝜗 " 10

  11. Towards formalizing Occam’s Razor (Assuming consistency) Claim : The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and That is, consistent yet bad 2. Has 𝐹𝑠𝑠 ! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " Proof : Let ℎ be such a bad hypothesis that has an error > 𝜗 Probability that ℎ is consistent with one example is Pr 𝑔 𝑦 = ℎ 𝑦 < 1 − 𝜗 The training set consists of 𝑛 examples drawn independently So, probability that ℎ is consistent with 𝑛 examples < 1 − 𝜗 " Probability that some bad hypothesis in 𝐼 is consistent with 𝑛 examples is less than 𝐼 1 − 𝜗 " 11

  12. Towards formalizing Occam’s Razor (Assuming consistency) Claim : The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and That is, consistent yet bad 2. Has 𝐹𝑠𝑠 ! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " Proof : Let ℎ be such a bad hypothesis that has an error > 𝜗 Probability that ℎ is consistent with one example is Pr 𝑔 𝑦 = ℎ 𝑦 < 1 − 𝜗 The training set consists of 𝑛 examples drawn independently So, probability that ℎ is consistent with 𝑛 examples < 1 − 𝜗 " Probability that some bad hypothesis in 𝐼 is consistent with 𝑛 examples is less than 𝐼 1 − 𝜗 " 12

  13. Towards formalizing Occam’s Razor (Assuming consistency) Claim : The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and That is, consistent yet bad 2. Has 𝐹𝑠𝑠 ! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " Proof : Let ℎ be such a bad hypothesis that has an error > 𝜗 Probability that ℎ is consistent with one example is Pr 𝑔 𝑦 = ℎ 𝑦 < 1 − 𝜗 The training set consists of 𝑛 examples drawn independently So, probability that ℎ is consistent with 𝑛 examples < 1 − 𝜗 " Probability that some bad hypothesis in 𝐼 is consistent with 𝑛 examples is less than 𝐼 1 − 𝜗 " 13

  14. Towards formalizing Occam’s Razor (Assuming consistency) Claim : The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and That is, consistent yet bad 2. Has 𝐹𝑠𝑠 ! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " Proof : Let ℎ be such a bad hypothesis that has an error > 𝜗 Probability that ℎ is consistent with one example is Pr 𝑔 𝑦 = ℎ 𝑦 < 1 − 𝜗 The training set consists of 𝑛 examples drawn independently So, probability that ℎ is consistent with 𝑛 examples < 1 − 𝜗 " Probability that some bad hypothesis in 𝐼 is consistent with 𝑛 examples is less than 𝐼 1 − 𝜗 " 14

  15. Towards formalizing Occam’s Razor (Assuming consistency) Claim : The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and That is, consistent yet bad 2. Has 𝐹𝑠𝑠 ! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " Proof : Let ℎ be such a bad hypothesis that has an error > 𝜗 Probability that ℎ is consistent with one example is Pr 𝑔 𝑦 = ℎ 𝑦 < 1 − 𝜗 The training set consists of 𝑛 examples drawn independently So, probability that ℎ is consistent with 𝑛 examples < 1 − 𝜗 " Probability that some bad hypothesis in 𝐼 is consistent with 𝑛 examples is less than 𝐼 1 − 𝜗 " Union bound For a set of events, the probability that at least one of them 15 happens < the sum of the probabilities of the individual events

  16. Occam’s Razor The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and This situation is a bad one. Let us try to see what we need to do to 2. Has 𝐹𝑠𝑠 ! ℎ > 𝜗 ensure that this situation is rare. is less than 𝐼 1 − 𝜗 " We want to make this probability small, say smaller than 𝜀 𝐼 1 − 𝜗 " < 𝜀 log 𝐼 + 𝑛 log 1 − 𝜗 < log 𝜀 We know that 𝑓 #$ = 1 − 𝑦 + $ ! % − $ " & … > 1 − 𝑦 Let’s use log 1 − 𝜗 < −𝜗 to get a safer 𝜀 16

  17. Occam’s Razor The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠 ! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " We want to make this probability small, say smaller than 𝜀 𝐼 1 − 𝜗 " < 𝜀 log 𝐼 + 𝑛 log 1 − 𝜗 < log 𝜀 We know that 𝑓 #$ = 1 − 𝑦 + $ ! % − $ " & … > 1 − 𝑦 Let’s use log 1 − 𝜗 < −𝜗 to get a safer 𝜀 17

  18. Occam’s Razor The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠 ! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " We want to make this probability small, say smaller than 𝜀 𝐼 1 − 𝜗 " < 𝜀 log 𝐼 + 𝑛 log 1 − 𝜗 < log 𝜀 We know that 𝑓 #$ = 1 − 𝑦 + $ ! % − $ " & … > 1 − 𝑦 Let’s use log 1 − 𝜗 < −𝜗 to get a safer 𝜀 18

  19. Occam’s Razor The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠 ! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " We want to make this probability small, say smaller than 𝜀 𝐼 1 − 𝜗 " < 𝜀 log 𝐼 + 𝑛 log 1 − 𝜗 < log 𝜀 We know that 𝑓 #$ = 1 − 𝑦 + $ ! % − $ " & … > 1 − 𝑦 Let’s use log 1 − 𝜗 < −𝜗 to get a safer 𝜀 19

  20. Occam’s Razor The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠 ! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " We want to make this probability small, say smaller than 𝜀 𝐼 1 − 𝜗 " < 𝜀 log 𝐼 + 𝑛 log 1 − 𝜗 < log 𝜀 We know that 𝑓 #$ = 1 − 𝑦 + $ ! % − $ " & … > 1 − 𝑦 If 𝜀 is small, then the probability that there is a consistent, yet Let’s use log 1 − 𝜗 < −𝜗 to get a safer 𝜀 bad hypothesis would also be small (because of this inequality) 20

Recommend


More recommend