Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang some slides from Alex Smola (CMU)
Perceptron Frank Rosenblatt
deep learning multilayer perceptron perceptron linear regression SVM CRF structured perceptron
Brief History of Perceptron batch 1997 +soft-margin Cortes/Vapnik SVM s l e online approx. n r e k + subgradient descent max margin +max margin 2007--2010* minibatch Singer group Pegasos minibatch online 2003 2006 Crammer/Singer Singer group c o n s e r v a t i v e u p d a t e s MIRA aggressive 1959 1962 1969* 1999 DEAD Rosenblatt Novikoff Minsky/Papert Freund/Schapire invention proof book killed it voted/avg: revived i n s e p a r a b l e c a s e 2002 2005* Collins McDonald/Crammer/Pereira structured MIRA structured *mentioned in lectures but optional (others papers all covered in detail) AT&T Research ex-AT&T and students
Neurons • Soma (CPU) Cell body - combines signals • Dendrite (input bus) Combines the inputs from several other nerve cells • Synapse (interface) Interface and parameter store between neurons • Axon (output cable) May be up to 1m long and will transport the activation signal to neurons at different locations
Neurons x n x 1 x 2 x 3 . . . w n w 1 synaptic weights output σ ( ) X f ( x ) = w i x i = h w, x i i
Frank Rosenblatt’s Perceptron
Multilayer Perceptron (Neural Net)
Perceptron w/ bias x 3 x n x 1 x 2 . . . • Weighted linear combination w n w 1 synaptic • Nonlinear weights decision function • Linear offset (bias) output f ( x ) = σ ( h w, x i + b ) • Linear separating hyperplanes • Learning: w and b
Perceptron w/o bias x 0 = 1 x 3 x n x 1 x 2 . . . • Weighted linear combination w 0 w n w 1 synaptic • Nonlinear weights decision function • No Linear offset (bias): output hyperplane through the origin f ( x ) = σ ( h w, x i + b ) • Linear separating hyperplanes • Learning: w
Augmented Space can separate in 3D from the origin can separate in 2D from the origin 1 O can’t separate in 1D can’t separate in 2D from the origin from the origin
Perceptron Ham Spam
The Perceptron w/o bias initialize w = 0 and b = 0 repeat if y i [ h w, x i i + b ] 0 then w w + y i x i and b b + y i end if until all classified correctly • Nothing happens if classified correctly • Weight vector is linear combination X w = y i x i i ∈ I • Classifier is linear combination of inner products σ ( ) X f ( x ) = y i h x i , x i + b i ∈ I
The Perceptron w/ bias initialize w = 0 and b = 0 repeat if y i [ h w, x i i + b ] 0 then w w + y i x i and b b + y i end if until all classified correctly • Nothing happens if classified correctly • Weight vector is linear combination X w = y i x i i ∈ I • Classifier is linear combination of inner products σ ( ) X f ( x ) = y i h x i , x i + b i ∈ I
Demo x i w (bias=0)
Demo
Demo
Demo
Convergence Theorem • If there exists some oracle unit vector u : k u k = 1 y i ( u · x i ) ≥ δ for all i then the perceptron converges to a linear separator after a number of updates bounded by R 2 / δ 2 where R = max i k x i k • Dimensionality independent • Order independent (but order matters in output) • Dataset size independent • Scales with ‘difficulty’ of problem
Geometry of the Proof • part 1: progress (alignment) on oracle projection assume w i is the weight vector before the i th update (on h x i , y i i ) and assume initial w 0 = 0 w i +1 = w i + y i x i u · w i +1 = u · w i + y i ( u · x i ) y i ( u · x i ) ≥ δ for all i u · w i +1 ≥ u · w i + δ δ δ u · w i +1 ≥ i δ x i ⊕ projection on u increases! w i +1 ⊕ (more agreement w/ oracle) u : k u k = 1 k w i +1 k = k u kk w i +1 k � u · w i +1 � i δ w i ⊕
Geometry of the Proof • part 2: bound the norm of the weight vector w i +1 = w i + y i x i k w i +1 k 2 = k w i + y i x i k 2 = k w i k 2 + k x i k 2 + 2 y i ( w i x i ) “mistake on x_i” k w i k 2 + R 2 (radius) iR 2 δ δ x i ⊕ Combine with part 1 w i +1 ⊕ k w i +1 k = k u kk w i +1 k � u · w i +1 � i δ u : k u k = 1 i ≤ R 2 / δ 2 w i ⊕
Convergence Bound R 2 / δ 2 w • is independent of: • but test accuracy is • dimensionality dependent of: • order of examples • number of examples (shuffling helps) • starting weight vector • variable learning rate • order of examples (1/total#error helps) • constant learning rate • can you still prove • and is dependent of: convergence? • separation difficulty • feature scale
Hardness margin vs. size hard easy
XOR • XOR - not linearly separable • Nonlinear separation is trivial • Caveat from “Perceptrons” (Minsky & Papert, 1969) Finding the minimum error linear separator is NP hard (this killed Neural Networks in the 70s).
Brief History of Perceptron batch 1997 +soft-margin Cortes/Vapnik SVM s l e online approx. n r e k + subgradient descent max margin +max margin 2007--2010* minibatch Singer group Pegasos minibatch online 2003 2006 Crammer/Singer Singer group c o n s e r v a t i v e u p d a t e s MIRA aggressive 1959 1962 1969* 1999 DEAD Rosenblatt Novikoff Minsky/Papert Freund/Schapire invention proof book killed it voted/avg: revived i n s e p a r a b l e c a s e 2002 2005* Collins McDonald/Crammer/Pereira structured MIRA structured *mentioned in lectures but optional (others papers all covered in detail) AT&T Research ex-AT&T and students
Extensions of Perceptron • Problems with Perceptron • doesn’t converge with inseparable data • update might often be too “bold” • doesn’t optimize margin • is sensitive to the order of examples • Ways to alleviate these problems • voted perceptron and average perceptron • MIRA (margin-infused relaxation algorithm)
Voted/Avged Perceptron • motivation: updates on later examples taking over! • voted perceptron (Freund and Schapire, 1999) • record the weight vector after each example in D • (not just after each update) • and vote on a new example using |D| models • shown to have better generalization power • averaged perceptron (from the same paper) • an approximation of voted perceptron • just use the average of all weight vectors • can be implemented efficiently
Voted Perceptron
Voted/Avged Perceptron (low dim - less separable) test error
Voted/Avged Perceptron (high dim - more separable) test error
Averaged Perceptron • voted perceptron is not scalable • and does not output a single model • avg perceptron is an approximation of voted perceptron w 0 = 0 initialize w = 0 and b = 0 repeat c ← c + 1 if y i [ h w, x i i + b ] 0 then w w + y i x i and b b + y i after each example, not each update w 0 ← w 0 + w end if until all classified correctly w 0 /c output 32
Efficient Implementation of Averaging • naive implementation (running sum) doesn’t scale • very clever trick from Daume (2006, PhD thesis) ∆ w t w t initialize w = 0 and b = 0 w a = 0 repeat c ← c + 1 w (0) = if y i [ h w, x i i + b ] 0 then w w + y i x i and b b + y i w (1) = ∆ w (1) w a ← w a + cy i x i end if w (2) = ∆ w (1) ∆ w (2) until all classified correctly output w − w a /c w (3) = ∆ w (1) ∆ w (2) ∆ w (3) w (4) = ∆ w (1) ∆ w (2) ∆ w (3) ∆ w (4) 33
MIRA • perceptron often makes too bold updates • but hard to tune learning rate • the smallest update to correct the mistake? w i +1 = w i + y i � w i · x i x i k x i k 2 easy to show: y i ( w i +1 · x i ) = y i ( w i + y i � w i · x i x i ⊕ x i ) · x i = 1 k x i k 2 perceptron � w i · x i k x i k k x i k 1 margin-infused relaxation MIRA algorithm (MIRA) 1 � w i · x i perceptron over- k x i k w i corrects this mistake
Perceptron x i perceptron perceptron under- corrects this mistake w (bias=0)
MIRA w 0 k w 0 � w k 2 min x i MIRA makes sure s.t. w 0 · x � 1 after update, dot- MIRA minimal change product w ∙ x_i = 1 to ensure margin margin of 1/|x_i| perceptron MIRA ≈ 1-step SVM perceptron under- corrects this mistake w (bias=0)
Aggressive MIRA • aggressive version of MIRA • also update if correct but margin isn’t big enough w • functional margin: y i ( w · x i ) • geometric margin: y i ( w · x i ) k w k • update if functional margin is <= p (0<= p <1) • update rule is same as MIRA • called p- aggressive MIRA (MIRA: p =0) • larger p leads to a larger geometric margin • but slower convergence
Aggressive MIRA p=0.9 p=0.2 n o p t r e e r c p
Demo • perceptron vs. 0.2-aggressive vs. 0.9-aggressive p=0.9 p=0.2 n o p t r e e r c p
Demo • perceptron vs. 0.2-aggressive vs. 0.9-aggressive • why does this dataset so slow to converge? • perceptron: 22, p=0.2: 87, p=0.9: 2,518 epochs answer: margin shrinks in augmented space! small margin in 2D 1 O big margin in 1D
Recommend
More recommend