The Perceptron Mistake Bound Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim Blum and others
Where are we? • The Perceptron Algorithm • Variants of Perceptron • Perceptron Mistake Bound 2
Convergence Convergence theorem – If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge. 3
Convergence Convergence theorem – If there exist a set of weights that are consistent with the data (i.e. the data is linearly separable), the perceptron algorithm will converge. Cycling theorem – If the training data is not linearly separable, then the learning algorithm will eventually repeat the same set of weights and enter an infinite loop 4
Perceptron Learnability • Obviously Perceptron cannot learn what it cannot represent – Only linearly separable functions • Minsky and Papert (1969) wrote an influential book demonstrating Perceptron’s representational limitations – Parity functions can’t be learned (XOR) • We have already seen that XOR is not linearly separable – In vision, if patterns are represented with local features, can’t represent symmetry, connectivity 5
Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. + ++ - + - + - - - - - - + + + - - - - - - - - - Margin with respect to this hyperplane - 6
Margin The margin of a hyperplane for a dataset is the distance between the hyperplane and the data point nearest to it. The margin of a data set ( 𝛿 ) is the maximum margin possible for that dataset using any weight vector. + ++ - + - + - - - - - - + + + - - - - - - - - - - Margin of the data 7
Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . 8
Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . We can always find such an 𝑆 . Just look for the farthest data point from the origin. 9
Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . 10
Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . The data has a margin 𝛿 . Importantly, the data is separable . 𝛿 is the complexity parameter that defines the separability of data. 11
Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . Then, the perceptron algorithm will make no more than 𝑆 " 𝛿 " mistakes on the training sequence. ⁄ 12
Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . Then, the perceptron algorithm will make no more than 𝑆 " 𝛿 " mistakes on the training sequence. ⁄ If u hadn’t been a unit vector, then we could scale ° in the mistake bound. This will change the final mistake bound to ( || u || R/°) 2 13
Mistake Bound Theorem [Novikoff 1962, Block 1962] Let 𝐲 ! , 𝑧 ! , 𝐲 " , 𝑧 " , ⋯ be a sequence of training examples such that every feature vector 𝐲 # ∈ ℜ $ with 𝐲 # ≤ 𝑆 and the label 𝑧 # ∈ {−1, 1} . Suppose we have a binary classification dataset with n dimensional inputs. Suppose there is a unit vector 𝐯 ∈ ℜ $ (i.e., 𝐯 = 1 ) such that for some positive number 𝛿 ∈ ℜ, 𝛿 > 0 , we have 𝑧 # 𝐯 & 𝐲 # ≥ 𝛿 for every example (𝐲 # , 𝑧 # ) . If the data is separable,… Then, the perceptron algorithm will make no more than 𝑆 " 𝛿 " mistakes on the training sequence. ⁄ …then the Perceptron algorithm will find a separating hyperplane after making a finite number of mistakes 14
• Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (preliminaries) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! The setting Initial weight vector 𝐱 is all zeros • Learning rate = 1 • – Effectively scales inputs, but does not change the behavior All training examples are contained in a ball of size 𝑆 . • – That is, for every example (𝐲 ! , 𝑧 ! ) , we have 𝐲 ! ≤ 𝑆 The training data is separable by margin 𝛿 using a unit vector 𝐯 . • – That is, for every example (𝐲 ! , 𝑧 ! ) , we have 𝑧 ! 𝐯 " 𝐲 ! ≥ 𝛿 15
• Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (1/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! 1. Claim: After t mistakes, 𝐯 ! 𝐱 " ≥ 𝑢𝛿 16
• Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (1/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! 1. Claim: After t mistakes, 𝐯 ! 𝐱 " ≥ 𝑢𝛿 Because the data is separable by a margin 𝛿 17
• Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (1/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! 1. Claim: After t mistakes, 𝐯 ! 𝐱 " ≥ 𝑢𝛿 Because the data is separable by a margin 𝛿 Because 𝐱 # = 𝟏 (that is, 𝐯 ! 𝐱 # = 𝟏 ), straightforward induction gives us 𝐯 ! 𝐱 " ≥ 𝑢𝛿 18
• Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (2/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! $ ≤ 𝑢𝑆 $ 2. Claim: After t mistakes, 𝐱 " 19
• Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (2/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! $ ≤ 𝑢𝑆 $ 2. Claim: After t mistakes, 𝐱 " The weight is updated only 𝐲 ! ≤ 𝑆 , by definition of R when there is a mistake. That is # 𝐲 ! < 0. when 𝑧 ! 𝐱 " 20
• Receive an input 𝐲 ! , 𝑧 ! # 𝐲 ! ≠ 𝑧 ! : if sgn 𝐱 " • Proof (2/3) Update 𝐱 "$% ← 𝐱 " + 𝑧 ! 𝐲 ! $ ≤ 𝑢𝑆 $ 2. Claim: After t mistakes, 𝐱 " Because 𝐱 # = 𝟏 (that is, 𝐯 ! 𝐱 # = 𝟏 ), $ ≤ 𝑆 $ straightforward induction gives us 𝐱 " 21
Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 22
Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 From (2) 23
Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 From (2) 𝒗 𝑼 𝐱 " = 𝐯 𝐱 " 𝑑𝑝𝑡 angle between them But 𝐯 = 1 and cosine is less than 1 So 𝐯 # 𝐱 " ≤ 𝐱 " 24
Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 From (2) 𝒗 𝑼 𝐱 " = 𝐯 𝐱 " 𝑑𝑝𝑡 angle between them But 𝐯 = 1 and cosine is less than 1 So 𝐯 # 𝐱 " ≤ 𝐱 " (Cauchy-Schwarz inequality) 25
Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 From (1) From (2) 𝒗 𝑼 𝐱 " = 𝐯 𝐱 " 𝑑𝑝𝑡 angle between them But 𝐯 = 1 and cosine is less than 1 So 𝐯 # 𝐱 " ≤ 𝐱 " 26
Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 Number of mistakes 27
Proof (3/3) What we know: After t mistakes, 𝐯 & 𝐱 6 ≥ 𝑢𝛿 1. " ≤ 𝑢𝑆 " 2. After t mistakes, 𝐱 6 Number of mistakes Bounds the total number of mistakes! 28
Recommend
More recommend