the perceptron mistake bound
play

The Perceptron Mistake Bound Machine Learning 1 Some slides based - PowerPoint PPT Presentation

The Perceptron Mistake Bound Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim Blum and others Where are we? The Perceptron Algorithm Variants of Perceptron Perceptron Mistake Bound 2 Convergence Convergence


  1. The Perceptron Mistake Bound Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim Blum and others

  2. Where are we? • The Perceptron Algorithm • Variants of Perceptron • Perceptron Mistake Bound 2

  3. Convergence Convergence theorem – If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge. 3

  4. Convergence Convergence theorem – If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge. Cycling theorem – If the training data is not linearly separable, then the learning algorithm will eventually repeat the same set of weights and enter an infinite loop 4

  5. Perceptron Learnability • Obviously Perceptron cannot learn what it cannot represent – Only linearly separable functions • Minsky and Papert (1969) wrote an influential book demonstrating Perceptron’s representational limitations – Parity functions can’t be learned (XOR) • We have already seen that XOR is not linearly separable – In vision, if patterns are represented with local features, can’t represent symmetry, connectivity 5

  6. Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. + ++ - + - + - - - - - - + + + - - - - - - - - - Margin with respect to this hyperplane - 6

  7. Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. The margin of a data set ( 𝛿 ) is the maximum margin possible for that dataset using any weight vector. + ++ - + - + - - - - - - + + + - - - - - - - - - - Margin of the data 7

  8. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . 8

  9. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . We can always find such an 𝑆 . Just look for the farthest data point from the origin. 9

  10. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . 10

  11. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . The data has a margin 𝛿 . Importantly, the data is separable . 𝛿 is the complexity parameter that defines the separability of data. 11

  12. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . Then, the perceptron algorithm will make no more than 𝑆 " 𝛿 " mistakes on the training sequence. ⁄ 12

  13. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . Then, the perceptron algorithm will make no more than 𝑆 " 𝛿 " mistakes on the training sequence. ⁄ If u hadn’t been a unit vector, then we could scale ° in the mistake bound. This will change the final mistake bound to ( || u || R/°) 2 13

  14. Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose we have a binary classification dataset with n dimensional inputs. Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . If the data is separable,… Then, the perceptron algorithm will make no more than 𝑆 " 𝛿 " mistakes on the training sequence. ⁄ …then the Perceptron algorithm will find a separating hyperplane after making a finite number of mistakes 14

  15. • Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (preliminaries) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! The setting Initial weight vector 𝐱 is all zeros • Learning rate = 1 • – Effectively scales inputs, but does not change the behavior All training examples are contained in a ball of size 𝑆 . • – That is, for every example (𝐲 ! , 𝑧 ! ) , we have 𝐲 ! ≤ 𝑆 The training data is separable by margin 𝛿 using a unit vector 𝐯 . • – That is, for every example (𝐲 ! , 𝑧 ! ) , we have 𝑧 ! 𝐯 " 𝐲 ! ≥ 𝛿 15

  16. • Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (1/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! 1. Claim: After t mistakes, 𝐯 ! 𝐱 " ≥ 𝑢𝛿 16

  17. • Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (1/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! 1. Claim: After t mistakes, 𝐯 ! 𝐱 " ≥ 𝑢𝛿 Because the data is separable by a margin 𝛿 17

  18. • Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (1/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! 1. Claim: After t mistakes, 𝐯 ! 𝐱 " ≥ 𝑢𝛿 Because the data is separable by a margin 𝛿 Because 𝐱 # = 𝟏 (that is, 𝐯 ! 𝐱 # = 𝟏 ), straightforward induction gives us 𝐯 ! 𝐱 " ≥ 𝑢𝛿 18

  19. • Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (2/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! $ ≤ 𝑢𝑆 $ 2. Claim: After t mistakes, 𝐱 " 19

  20. • Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (2/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! $ ≤ 𝑢𝑆 $ 2. Claim: After t mistakes, 𝐱 " The weight is updated only 𝐲 ! ≤ 𝑆 , by definition of R when there is a mistake. That is # 𝐲 ! < 0. when 𝑧 ! 𝐱 " 20

  21. • Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (2/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! $ ≤ 𝑢𝑆 $ 2. Claim: After t mistakes, 𝐱 " Because 𝐱 # = 𝟏 (that is, 𝐯 ! 𝐱 # = 𝟏 ), $ ≤ 𝑆 $ straightforward induction gives us 𝐱 " 21

  22. Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 22

  23. Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 From (2) 23

  24. Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 From (2) 𝒗 𝑼 𝐱 " = 𝐯 𝐱 " 𝑑𝑝𝑡 angle between them But 𝐯 = 1 and cosine is less than 1 So 𝐯 # 𝐱 " ≤ 𝐱 " 24

  25. Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 From (2) 𝒗 𝑼 𝐱 " = 𝐯 𝐱 " 𝑑𝑝𝑡 angle between them But 𝐯 = 1 and cosine is less than 1 So 𝐯 # 𝐱 " ≤ 𝐱 " (Cauchy-Schwarz inequality) 25

  26. Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 From (1) From (2) 𝒗 𝑼 𝐱 " = 𝐯 𝐱 " 𝑑𝑝𝑡 angle between them But 𝐯 = 1 and cosine is less than 1 So 𝐯 # 𝐱 " ≤ 𝐱 " 26

  27. Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 Number of mistakes 27

  28. Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 Number of mistakes Bounds the total number of mistakes! 28

Recommend


More recommend