Computational Learning Theory: Occams Razor Machine Learning 1 - - PowerPoint PPT Presentation

computational learning theory occam s razor
SMART_READER_LITE
LIVE PREVIEW

Computational Learning Theory: Occams Razor Machine Learning 1 - - PowerPoint PPT Presentation

Computational Learning Theory: Occams Razor Machine Learning 1 Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others This lecture: Computational Learning Theory The Theory of Generalization Probably


slide-1
SLIDE 1

Machine Learning

Computational Learning Theory: Occam’s Razor

1

Slides based on material from Dan Roth, Avrim Blum, Tom Mitchell and others

slide-2
SLIDE 2

This lecture: Computational Learning Theory

  • The Theory of Generalization
  • Probably Approximately Correct (PAC) learning
  • Positive and negative learnability results
  • Agnostic Learning
  • Shattering and the VC dimension

2

slide-3
SLIDE 3

Where are we?

  • The Theory of Generalization
  • Probably Approximately Correct (PAC) learning
  • Positive and negative learnability results
  • Agnostic Learning
  • Shattering and the VC dimension

3

slide-4
SLIDE 4

This section

  • 1. Define the PAC model of learning
  • 2. Make formal connections to the principle of Occam’s razor

4

slide-5
SLIDE 5

This section

ü Define the PAC model of learning

  • 2. Make formal connections to the principle of Occam’s razor

5

slide-6
SLIDE 6

Occam’s Razor

Named after William of Occam

– AD 1300s

Prefer simpler explanations over more complex ones

“Numquam ponenda est pluralitas sine necessitate”

Historically, a widely prevalent idea across different schools of philosophy

6

(Never posit plurality without necessity.)

slide-7
SLIDE 7

Why would a consistent learner fail?

Consistent learner: Suppose we have a learner that produces a hypothesis that is consistent with a training set... … but the training set is not a representative sample of the instance space. Then the hypothesis we learned could be bad even if it is consistent with the entire training set. We can try to 1. quantify the probability of such a bad situation occurring and, 2. then, bound the probability to be low.

7

slide-8
SLIDE 8

Why would a consistent learner fail?

Consistent learner: Suppose we have a learner that produces a hypothesis that is consistent with a training set... … but the training set is not a representative sample of the instance space. Then the hypothesis we learned could be bad even if it is consistent with the entire training set. We can try to 1. quantify the probability of such a bad situation occurring and, 2. then, ask what will it take for this probability to be low.

8

slide-9
SLIDE 9

Towards formalizing Occam’s Razor

Claim: The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " Proof: Let ℎ be such a bad hypothesis that has an error > 𝜗 Probability that ℎ is consistent with one example is Pr 𝑔 𝑦 = ℎ 𝑦 < 1 − 𝜗 The training set consists of 𝑛 examples drawn independently So, probability that ℎ is consistent with 𝑛 examples < 1 − 𝜗 " Probability that some bad hypothesis in 𝐼 is consistent with 𝑛 examples is less than 𝐼 1 − 𝜗 "

9

slide-10
SLIDE 10

Towards formalizing Occam’s Razor

Claim: The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " Proof: Let ℎ be such a bad hypothesis that has an error > 𝜗 Probability that ℎ is consistent with one example is Pr 𝑔 𝑦 = ℎ 𝑦 < 1 − 𝜗 The training set consists of 𝑛 examples drawn independently So, probability that ℎ is consistent with 𝑛 examples < 1 − 𝜗 " Probability that some bad hypothesis in 𝐼 is consistent with 𝑛 examples is less than 𝐼 1 − 𝜗 "

10

(Assuming consistency)

slide-11
SLIDE 11

Towards formalizing Occam’s Razor

Claim: The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " Proof: Let ℎ be such a bad hypothesis that has an error > 𝜗 Probability that ℎ is consistent with one example is Pr 𝑔 𝑦 = ℎ 𝑦 < 1 − 𝜗 The training set consists of 𝑛 examples drawn independently So, probability that ℎ is consistent with 𝑛 examples < 1 − 𝜗 " Probability that some bad hypothesis in 𝐼 is consistent with 𝑛 examples is less than 𝐼 1 − 𝜗 "

11

(Assuming consistency) That is, consistent yet bad

slide-12
SLIDE 12

Towards formalizing Occam’s Razor

Claim: The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " Proof: Let ℎ be such a bad hypothesis that has an error > 𝜗 Probability that ℎ is consistent with one example is Pr 𝑔 𝑦 = ℎ 𝑦 < 1 − 𝜗 The training set consists of 𝑛 examples drawn independently So, probability that ℎ is consistent with 𝑛 examples < 1 − 𝜗 " Probability that some bad hypothesis in 𝐼 is consistent with 𝑛 examples is less than 𝐼 1 − 𝜗 "

12

(Assuming consistency) That is, consistent yet bad

slide-13
SLIDE 13

Towards formalizing Occam’s Razor

Claim: The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " Proof: Let ℎ be such a bad hypothesis that has an error > 𝜗 Probability that ℎ is consistent with one example is Pr 𝑔 𝑦 = ℎ 𝑦 < 1 − 𝜗 The training set consists of 𝑛 examples drawn independently So, probability that ℎ is consistent with 𝑛 examples < 1 − 𝜗 " Probability that some bad hypothesis in 𝐼 is consistent with 𝑛 examples is less than 𝐼 1 − 𝜗 "

13

(Assuming consistency) That is, consistent yet bad

slide-14
SLIDE 14

Towards formalizing Occam’s Razor

Claim: The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " Proof: Let ℎ be such a bad hypothesis that has an error > 𝜗 Probability that ℎ is consistent with one example is Pr 𝑔 𝑦 = ℎ 𝑦 < 1 − 𝜗 The training set consists of 𝑛 examples drawn independently So, probability that ℎ is consistent with 𝑛 examples < 1 − 𝜗 " Probability that some bad hypothesis in 𝐼 is consistent with 𝑛 examples is less than 𝐼 1 − 𝜗 "

14

(Assuming consistency) That is, consistent yet bad

slide-15
SLIDE 15

Towards formalizing Occam’s Razor

Claim: The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " Proof: Let ℎ be such a bad hypothesis that has an error > 𝜗 Probability that ℎ is consistent with one example is Pr 𝑔 𝑦 = ℎ 𝑦 < 1 − 𝜗 The training set consists of 𝑛 examples drawn independently So, probability that ℎ is consistent with 𝑛 examples < 1 − 𝜗 " Probability that some bad hypothesis in 𝐼 is consistent with 𝑛 examples is less than 𝐼 1 − 𝜗 "

15

(Assuming consistency) That is, consistent yet bad Union bound For a set of events, the probability that at least one of them happens < the sum of the probabilities of the individual events

slide-16
SLIDE 16

Occam’s Razor

The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " We want to make this probability small, say smaller than 𝜀 𝐼 1 − 𝜗 " < 𝜀 log 𝐼 + 𝑛 log 1 − 𝜗 < log 𝜀 We know that 𝑓#$ = 1 − 𝑦 + $!

% − $" & … > 1 − 𝑦

Let’s use log 1 − 𝜗 < −𝜗 to get a safer 𝜀

16

This situation is a bad one. Let us try to see what we need to do to ensure that this situation is rare.

slide-17
SLIDE 17

Occam’s Razor

The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " We want to make this probability small, say smaller than 𝜀 𝐼 1 − 𝜗 " < 𝜀 log 𝐼 + 𝑛 log 1 − 𝜗 < log 𝜀 We know that 𝑓#$ = 1 − 𝑦 + $!

% − $" & … > 1 − 𝑦

Let’s use log 1 − 𝜗 < −𝜗 to get a safer 𝜀

17

slide-18
SLIDE 18

Occam’s Razor

The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " We want to make this probability small, say smaller than 𝜀 𝐼 1 − 𝜗 " < 𝜀 log 𝐼 + 𝑛 log 1 − 𝜗 < log 𝜀 We know that 𝑓#$ = 1 − 𝑦 + $!

% − $" & … > 1 − 𝑦

Let’s use log 1 − 𝜗 < −𝜗 to get a safer 𝜀

18

slide-19
SLIDE 19

Occam’s Razor

The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " We want to make this probability small, say smaller than 𝜀 𝐼 1 − 𝜗 " < 𝜀 log 𝐼 + 𝑛 log 1 − 𝜗 < log 𝜀 We know that 𝑓#$ = 1 − 𝑦 + $!

% − $" & … > 1 − 𝑦

Let’s use log 1 − 𝜗 < −𝜗 to get a safer 𝜀

19

slide-20
SLIDE 20

Occam’s Razor

The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " We want to make this probability small, say smaller than 𝜀 𝐼 1 − 𝜗 " < 𝜀 log 𝐼 + 𝑛 log 1 − 𝜗 < log 𝜀 We know that 𝑓#$ = 1 − 𝑦 + $!

% − $" & … > 1 − 𝑦

Let’s use log 1 − 𝜗 < −𝜗 to get a safer 𝜀

20

If 𝜀 is small, then the probability that there is a consistent, yet bad hypothesis would also be small (because of this inequality)

slide-21
SLIDE 21

Occam’s Razor

The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " We want to make this probability small, say smaller than 𝜀 𝐼 1 − 𝜗 " < 𝜀 log 𝐼 + 𝑛 log 1 − 𝜗 < log 𝜀 We know that 𝑓#$ = 1 − 𝑦 + $!

% − $" & … > 1 − 𝑦

Let’s use log 1 − 𝜗 < −𝜗 to get a safer 𝜀

21

slide-22
SLIDE 22

Occam’s Razor

The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " We want to make this probability small, say smaller than 𝜀 𝐼 1 − 𝜗 " < 𝜀 log 𝐼 + 𝑛 log 1 − 𝜗 < log 𝜀 We know that 𝑓#$ = 1 − 𝑦 + $!

% − $" & … > 1 − 𝑦

Let’s use log 1 − 𝜗 < −𝜗 to get a safer 𝜀 log 𝐼 − 𝑛𝜗 < log 𝜀

22

slide-23
SLIDE 23

Occam’s Razor

The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " We want to make this probability small, say smaller than 𝜀 𝐼 1 − 𝜗 " < 𝜀 log 𝐼 + 𝑛 log 1 − 𝜗 < log 𝜀 We know that 𝑓#$ = 1 − 𝑦 + $!

% − $" & … > 1 − 𝑦

Let’s use log 1 − 𝜗 < −𝜗 to get a safer 𝜀

23

That is, if 𝑛 > '

( ln 𝐼 + ln ' )

then, the probability of getting a bad hypothesis is small

slide-24
SLIDE 24

Occam’s Razor

The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " We want to make this probability small, say smaller than 𝜀 𝐼 1 − 𝜗 " < 𝜀 log 𝐼 + 𝑛 log 1 − 𝜗 < log 𝜀 We know that 𝑓#$ = 1 − 𝑦 + $!

% − $" & … > 1 − 𝑦

Let’s use log 1 − 𝜗 < −𝜗 to get a safer 𝜀

24

That is, if 𝑛 > '

( ln 𝐼 + ln ' )

then, the probability of getting a bad hypothesis is small If this is true

slide-25
SLIDE 25

Occam’s Razor

The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " We want to make this probability small, say smaller than 𝜀 𝐼 1 − 𝜗 " < 𝜀 log 𝐼 + 𝑛 log 1 − 𝜗 < log 𝜀 We know that 𝑓#$ = 1 − 𝑦 + $!

% − $" & … > 1 − 𝑦

Let’s use log 1 − 𝜗 < −𝜗 to get a safer 𝜀

25

That is, if 𝑛 > '

( ln 𝐼 + ln ' )

then, the probability of getting a bad hypothesis is small If this is true Then, this holds

slide-26
SLIDE 26

Occam’s Razor

The probability that there is a hypothesis ℎ ∈ 𝐼 that: 1. Is Consistent with 𝑛 examples, and 2. Has 𝐹𝑠𝑠! ℎ > 𝜗 is less than 𝐼 1 − 𝜗 " We want to make this probability small, say smaller than 𝜀 𝐼 1 − 𝜗 " < 𝜀 log 𝐼 + 𝑛 log 1 − 𝜗 < log 𝜀 We know that 𝑓#$ = 1 − 𝑦 + $!

% − $" & … > 1 − 𝑦

Let’s use log 1 − 𝜗 < −𝜗 to get a safer 𝜀

26

That is, if 𝑛 > '

( ln 𝐼 + ln ' )

then, the probability of getting a bad hypothesis is small If this is true Then, this holds Then, this is improbable

slide-27
SLIDE 27

Occam’s Razor

Let 𝐼 be any hypothesis space. With probability 1 − 𝜀, a hypothesis ℎ ∈ 𝐼 that is consistent with a training set of size 𝑛 will have an error < 𝜗 on future examples if 𝑛 > 1 𝜗 ln 𝐼 + ln 1 𝜀 This is called the Occam’s Razor because it expresses a preference towards smaller hypothesis spaces. Shows when a m-consistent hypothesis generalizes well (i.e., error < 𝜗). Complicated/larger hypothesis spaces are not necessarily bad. But simpler

  • nes are unlikely to fool us by being consistent with many examples!

27

slide-28
SLIDE 28

Occam’s Razor

Let 𝐼 be any hypothesis space. With probability 1 − 𝜀, a hypothesis ℎ ∈ 𝐼 that is consistent with a training set of size 𝑛 will have an error < 𝜗 on future examples if 𝑛 > 1 𝜗 ln 𝐼 + ln 1 𝜀 This is called the Occam’s Razor because it expresses a preference towards smaller hypothesis spaces. Shows when a m-consistent hypothesis generalizes well (i.e., error < 𝜗). Complicated/larger hypothesis spaces are not necessarily bad. But simpler

  • nes are unlikely to fool us by being consistent with many examples!

28

  • 1. Expecting lower error

increases sample complexity (i.e more examples needed for the guarantee)

slide-29
SLIDE 29

Occam’s Razor

Let 𝐼 be any hypothesis space. With probability 1 − 𝜀, a hypothesis ℎ ∈ 𝐼 that is consistent with a training set of size 𝑛 will have an error < 𝜗 on future examples if 𝑛 > 1 𝜗 ln 𝐼 + ln 1 𝜀 This is called the Occam’s Razor because it expresses a preference towards smaller hypothesis spaces. Shows when a m-consistent hypothesis generalizes well (i.e., error < 𝜗). Complicated/larger hypothesis spaces are not necessarily bad. But simpler

  • nes are unlikely to fool us by being consistent with many examples!

29

  • 2. If we have a larger hypothesis

space, then we will make learning harder (i.e higher sample complexity)

  • 1. Expecting lower error

increases sample complexity (i.e more examples needed for the guarantee)

slide-30
SLIDE 30

Occam’s Razor

Let 𝐼 be any hypothesis space. With probability 1 − 𝜀, a hypothesis ℎ ∈ 𝐼 that is consistent with a training set of size 𝑛 will have an error < 𝜗 on future examples if 𝑛 > 1 𝜗 ln 𝐼 + ln 1 𝜀 This is called the Occam’s Razor because it expresses a preference towards smaller hypothesis spaces. Shows when a m-consistent hypothesis generalizes well (i.e., error < 𝜗). Complicated/larger hypothesis spaces are not necessarily bad. But simpler

  • nes are unlikely to fool us by being consistent with many examples!

30

  • 3. If we want a higher confidence

in the classifier we will produce, sample complexity will be higher.

  • 2. If we have a larger hypothesis

space, then we will make learning harder (i.e higher sample complexity)

  • 1. Expecting lower error

increases sample complexity (i.e more examples needed for the guarantee)

slide-31
SLIDE 31

Occam’s Razor

Let 𝐼 be any hypothesis space. With probability 1 − 𝜀, a hypothesis ℎ ∈ 𝐼 that is consistent with a training set of size 𝑛 will have an error < 𝜗 on future examples if 𝑛 > 1 𝜗 ln 𝐼 + ln 1 𝜀 This is called the Occam’s Razor because it expresses a preference towards smaller hypothesis spaces. Shows when a m-consistent hypothesis generalizes well (i.e., error < 𝜗). Complicated/larger hypothesis spaces are not necessarily bad. But simpler

  • nes are unlikely to fool us by being consistent with many examples!

31

slide-32
SLIDE 32

Occam’s Razor

Let 𝐼 be any hypothesis space. With probability 1 − 𝜀, a hypothesis ℎ ∈ 𝐼 that is consistent with a training set of size 𝑛 will have an error < 𝜗 on future examples if 𝑛 > 1 𝜗 ln 𝐼 + ln 1 𝜀 This is called the Occam’s Razor because it expresses a preference towards smaller hypothesis spaces. Shows when a m-consistent hypothesis generalizes well (i.e., error < 𝜗). Complicated/larger hypothesis spaces are not necessarily bad. But simpler

  • nes are unlikely to fool us by being consistent with many examples!

32

slide-33
SLIDE 33

Occam’s Razor

Let 𝐼 be any hypothesis space. With probability 1 − 𝜀, a hypothesis ℎ ∈ 𝐼 that is consistent with a training set of size 𝑛 will have an error < 𝜗 on future examples if 𝑛 > 1 𝜗 ln 𝐼 + ln 1 𝜀 This is called the Occam’s Razor because it expresses a preference towards smaller hypothesis spaces. Shows when a m-consistent hypothesis generalizes well (i.e., error < 𝜗). Complicated/larger hypothesis spaces are not necessarily bad. But simpler

  • nes are unlikely to fool us by being consistent with many examples!

33

slide-34
SLIDE 34

Consistent Learners and Occam’s Razor

From the definition, we get the following general scheme for PAC learning, given a set of 𝑛 training examples

  • Find some ℎ ∈ 𝐼 that is consistent with all m examples

– If 𝑛 is large enough, a consistent hypothesis must be close enough to 𝑔 – Check that 𝑛 does not have to be too large (i.e., polynomial in the relevant parameters): we showed that the “closeness” guarantee requires that

𝑛 > 1 𝜗 ln 𝐼 + ln 1 𝜀

  • Show that the consistent hypothesis ℎ ∈ 𝐼 can be computed

efficiently

34

slide-35
SLIDE 35

Exercises

  • 1. We have seen the decision tree learning algorithm.

Suppose our problem has 𝑜 binary features. What is the size of the hypothesis space?

  • 2. Are decision trees efficiently PAC learnable?
  • 3. Are conjunctions PAC learnable? Can you think of a

a PAC algorithm for monotone conjunctions?

35