The Perceptron Algorithm Machine Learning 1 Some slides based on lectures from Dan Roth, Avrim Blum and others
Outline • The Perceptron Algorithm • Variants of Perceptron • Perceptron Mistake Bound 2
Where are we? • The Perceptron Algorithm • Variants of Perceptron • Perceptron Mistake Bound 3
Recall: Linear Classifiers Inputs are 𝑒 dimensional vectors, denoted by 𝐲 Output is a label 𝑧 ∈ {−1, 1} Linear Threshold Units classify an example 𝐲 using parameters 𝐱 (a 𝑒 dimensional vector) and 𝑐 (a real number) according the following classification rule Output = sign(𝐱 ! 𝐲 + 𝑐) = sign(∑ " 𝑥 " 𝑦 " + 𝑐) 𝐱 ! 𝐲 + 𝑐 ≥ 0 ⇒ 𝑧 = +1 𝐱 ! 𝐲 + 𝑐 < 0 ⇒ 𝑧 = −1 𝑐 is called the bias term 4
Recall: Linear Classifiers Inputs are 𝑒 dimensional vectors, denoted by 𝐲 Output is a label 𝑧 ∈ {−1, 1} Linear Threshold Units classify an example 𝐲 using parameters 𝐱 (a 𝑒 dimensional vector) and 𝑐 (a real number) according the following classification rule Output = sign(𝐱 ! 𝐲 + 𝑐) = sign(∑ " 𝑥 " 𝑦 " + 𝑐) sgn 𝐱 ! 𝐲 + 𝑐 ≥ 0 ⇒ 𝑧 = +1 ∑ 𝐱 ! 𝐲 + 𝑐 < 0 ⇒ 𝑧 = −1 𝑥 ! 𝑥 " 𝑥 # 𝑥 $ 𝑥 % 𝑥 & 𝑥 ' 𝑥 ( 𝑐 𝑦 ! 𝑦 " 𝑦 # 𝑦 $ 𝑦 % 𝑦 & 𝑦 ' 𝑦 ( 1 𝑐 is called the bias term 5
The geometry of a linear classifier sgn(b +w 1 x 1 + w 2 x 2 ) We only care about the sign, not the magnitude b +w 1 x 1 + w 2 x 2 =0 +++ + + + + + [w 1 w 2 ] x 1 - - - - - - - - - In higher dimensions, - - - a linear classifier -- - - represents a hyperplane - that separates the space - into two half-spaces x 2 6
The Perceptron 7
The Perceptron algorithm • Rosenblatt 1958 – (Though there were some hints of a similar idea earlier, eg: Agmon 1954) • The goal is to find a separating hyperplane – For separable data, guaranteed to find one • An online algorithm – Processes one example at a time • Several variants exist – We will see these briefly at towards the end 8
The Perceptron algorithm Input: A sequence of training examples 𝐲 + , 𝑧 + , 𝐲 , , 𝑧 , , ⋯ where all 𝐲 " ∈ ℜ - , 𝑧 " ∈ {−1, 1} 1. Initialize 𝐱 . = 0 ∈ ℜ - 2. For each training example 𝐲 " , 𝑧 " : 1. Predict y / = sgn(𝐱 0 1 𝐲 " ) 2. If y / ≠ 𝑧 " : • Update 𝐱 02+ ← 𝐱 0 + 𝑠(𝑧 " 𝐲 " ) 3. Return final weight vector 9
The Perceptron algorithm Input: A sequence of training examples 𝐲 + , 𝑧 + , 𝐲 , , 𝑧 , , ⋯ where all 𝐲 " ∈ ℜ - , 𝑧 " ∈ {−1, 1} Remember: Prediction = sgn( w T x ) 1. Initialize 𝐱 . = 0 ∈ ℜ - There is typically a bias term 2. For each training example 𝐲 " , 𝑧 " : also ( w T x + b), but the bias 1. Predict y / = sgn(𝐱 0 1 𝐲 " ) may be treated as a constant feature and folded 2. If y / ≠ 𝑧 " : into w • Update 𝐱 02+ ← 𝐱 0 + 𝑠(𝑧 " 𝐲 " ) 3. Return final weight vector 10
The Perceptron algorithm Input: A sequence of training examples 𝐲 + , 𝑧 + , 𝐲 , , 𝑧 , , ⋯ where all 𝐲 " ∈ ℜ - , 𝑧 " ∈ {−1, 1} Remember: Prediction = sgn( w T x ) 1. Initialize 𝐱 . = 0 ∈ ℜ - There is typically a bias term 2. For each training example 𝐲 " , 𝑧 " : also ( w T x + b), but the bias 1. Predict y / = sgn(𝐱 0 1 𝐲 " ) may be treated as a constant feature and folded 2. If y / ≠ 𝑧 " : into w • Update 𝐱 02+ ← 𝐱 0 + 𝑠(𝑧 " 𝐲 " ) 3. Return final weight vector Footnote: For some algorithms it is mathematically easier to represent False as -1, and at other times, as 0. For the Perceptron algorithm, treat -1 as false and +1 as true. 11
The Perceptron algorithm Input: A sequence of training examples 𝐲 + , 𝑧 + , 𝐲 , , 𝑧 , , ⋯ where all 𝐲 " ∈ ℜ - , 𝑧 " ∈ {−1, 1} Mistake on positive: 𝐱 )*! ← 𝐱 ) + 𝑠𝐲 + Mistake on negative: 𝐱 )*! ← 𝐱 ) − 𝑠𝐲 + 1. Initialize 𝐱 . = 0 ∈ ℜ - 2. For each training example 𝐲 " , 𝑧 " : 1. Predict y / = sgn(𝐱 0 1 𝐲 " ) 2. If y / ≠ 𝑧 " : • Update 𝐱 02+ ← 𝐱 0 + 𝑠(𝑧 " 𝐲 " ) 3. Return final weight vector 12
The Perceptron algorithm Input: A sequence of training examples 𝐲 + , 𝑧 + , 𝐲 , , 𝑧 , , ⋯ where all 𝐲 " ∈ ℜ - , 𝑧 " ∈ {−1, 1} Mistake on positive: 𝐱 )*! ← 𝐱 ) + 𝑠𝐲 + Mistake on negative: 𝐱 )*! ← 𝐱 ) − 𝑠𝐲 + 1. Initialize 𝐱 . = 0 ∈ ℜ - r is the learning rate, a small positive 2. For each training example 𝐲 " , 𝑧 " : number less than 1 1. Predict y / = sgn(𝐱 0 1 𝐲 " ) 2. If y / ≠ 𝑧 " : • Update 𝐱 02+ ← 𝐱 0 + 𝑠(𝑧 " 𝐲 " ) 3. Return final weight vector 13
The Perceptron algorithm Input: A sequence of training examples 𝐲 + , 𝑧 + , 𝐲 , , 𝑧 , , ⋯ where all 𝐲 " ∈ ℜ - , 𝑧 " ∈ {−1, 1} Mistake on positive: 𝐱 )*! ← 𝐱 ) + 𝑠𝐲 + Mistake on negative: 𝐱 )*! ← 𝐱 ) − 𝑠𝐲 + 1. Initialize 𝐱 . = 0 ∈ ℜ - r is the learning rate, a small positive 2. For each training example 𝐲 " , 𝑧 " : number less than 1 1. Predict y / = sgn(𝐱 0 1 𝐲 " ) Update only on error. A mistake-driven 2. If y / ≠ 𝑧 " : algorithm • Update 𝐱 02+ ← 𝐱 0 + 𝑠(𝑧 " 𝐲 " ) 3. Return final weight vector 14
The Perceptron algorithm Input: A sequence of training examples 𝐲 + , 𝑧 + , 𝐲 , , 𝑧 , , ⋯ where all 𝐲 " ∈ ℜ - , 𝑧 " ∈ {−1, 1} Mistake on positive: 𝐱 )*! ← 𝐱 ) + 𝑠𝐲 + Mistake on negative: 𝐱 )*! ← 𝐱 ) − 𝑠𝐲 + 1. Initialize 𝐱 . = 0 ∈ ℜ - r is the learning rate, a small positive 2. For each training example 𝐲 " , 𝑧 " : number less than 1 1. Predict y / = sgn(𝐱 0 1 𝐲 " ) Update only on error. A mistake-driven 2. If y / ≠ 𝑧 " : algorithm • Update 𝐱 02+ ← 𝐱 0 + 𝑠(𝑧 " 𝐲 " ) , 𝐲 + ≤ 0 Mistake can be written as y + 𝐱 ) 3. Return final weight vector 15
The Perceptron algorithm Input: A sequence of training examples 𝐲 + , 𝑧 + , 𝐲 , , 𝑧 , , ⋯ where all 𝐲 " ∈ ℜ - , 𝑧 " ∈ {−1, 1} Mistake on positive: 𝐱 )*! ← 𝐱 ) + 𝑠𝐲 + Mistake on negative: 𝐱 )*! ← 𝐱 ) − 𝑠𝐲 + 1. Initialize 𝐱 . = 0 ∈ ℜ - r is the learning rate, a small positive 2. For each training example 𝐲 " , 𝑧 " : number less than 1 1. Predict y / = sgn(𝐱 0 1 𝐲 " ) Update only on error. A mistake-driven 2. If y / ≠ 𝑧 " : algorithm • Update 𝐱 02+ ← 𝐱 0 + 𝑠(𝑧 " 𝐲 " ) , 𝐲 + ≤ 0 Mistake can be written as y + 𝐱 ) 3. Return final weight vector This is the simplest version. We will see more robust versions shortly 16
Intuition behind the update Mistake on positive: 𝐱 )*! ← 𝐱 ) + 𝑠𝐲 + Mistake on negative: 𝐱 )*! ← 𝐱 ) − 𝑠𝐲 + Suppose we have made a mistake on a positive example " 𝐲 ≤ 0 That is, 𝑧 = +1 and 𝐱 ! Call the new weight vector 𝐱 !#$ = 𝐱 ! + 𝐲 (say r = 1) " 𝐲 = 𝐱 ! + 𝐲 " 𝐲 = 𝐱 ! " 𝐲 + 𝐲 𝐔 𝐲 ≥ 𝐱 𝐮 𝐔 𝐲 The new dot product is 𝐱 %#$ For a positive example, the Perceptron update will increase the score assigned to the same input Similar reasoning for negative examples 17
Mistake on positive: 𝐱 )*! ← 𝐱 ) + 𝑠𝐲 + Mistake on negative: 𝐱 )*! ← 𝐱 ) − 𝑠𝐲 + Geometry of the perceptron update Predict w old 18
Mistake on positive: 𝐱 )*! ← 𝐱 ) + 𝑠𝐲 + Mistake on negative: 𝐱 )*! ← 𝐱 ) − 𝑠𝐲 + Geometry of the perceptron update Predict w old (x, +1) 19
Mistake on positive: 𝐱 )*! ← 𝐱 ) + 𝑠𝐲 + Mistake on negative: 𝐱 )*! ← 𝐱 ) − 𝑠𝐲 + Geometry of the perceptron update Predict w old (x, +1) For a mistake on a positive example 20
Mistake on positive: 𝐱 )*! ← 𝐱 ) + 𝑠𝐲 + Mistake on negative: 𝐱 )*! ← 𝐱 ) − 𝑠𝐲 + Geometry of the perceptron update Predict Update 𝐱 ← 𝐱 + 𝑧𝐲 w old (x, +1) (x, +1) For a mistake on a positive example 21
Mistake on positive: 𝐱 )*! ← 𝐱 ) + 𝑠𝐲 + Mistake on negative: 𝐱 )*! ← 𝐱 ) − 𝑠𝐲 + Geometry of the perceptron update Predict Update 𝐱 ← 𝐱 + 𝑧𝐲 w old y x (x, +1) (x, +1) For a mistake on a positive example 22
Mistake on positive: 𝐱 )*! ← 𝐱 ) + 𝑠𝐲 + Mistake on negative: 𝐱 )*! ← 𝐱 ) − 𝑠𝐲 + Geometry of the perceptron update Predict Update 𝐱 ← 𝐱 + 𝑧𝐲 w old y x (x, +1) (x, +1) For a mistake on a positive example 23
Mistake on positive: 𝐱 )*! ← 𝐱 ) + 𝑠𝐲 + Mistake on negative: 𝐱 )*! ← 𝐱 ) − 𝑠𝐲 + Geometry of the perceptron update Predict Update After 𝐱 ← 𝐱 + 𝑧𝐲 w old y x w new (x, +1) (x, +1) (x, +1) For a mistake on a positive example 24
Geometry of the perceptron update Predict w old 25
Geometry of the perceptron update Predict (x, -1) w old For a mistake on a negative example 26
Recommend
More recommend