machine learning
play

Machine Learning A Geometric Approach Linear Classification: - PowerPoint PPT Presentation

Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang some slides from Alex Smola (CMU) Perceptron Frank Rosenblatt deep learning multilayer perceptron perceptron linear regression SVM CRF


  1. Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang some slides from Alex Smola (CMU)

  2. Perceptron Frank Rosenblatt

  3. deep learning multilayer perceptron perceptron linear regression SVM CRF structured perceptron

  4. Brief History of Perceptron batch 1997 +soft-margin Cortes/Vapnik SVM s l e online approx. n r e k + subgradient descent max margin +max margin 2007--2010* minibatch Singer group Pegasos minibatch online 2003 2006 Crammer/Singer Singer group c o n s e r v a t i v e u p d a t e s MIRA aggressive 1959 1962 1969* 1999 DEAD Rosenblatt Novikoff Minsky/Papert Freund/Schapire invention proof book killed it voted/avg: revived i n s e p a r a b l e c a s e 2002 2005* Collins McDonald/Crammer/Pereira structured MIRA structured *mentioned in lectures but optional (others papers all covered in detail) AT&T Research ex-AT&T and students

  5. Neurons • Soma (CPU) Cell body - combines signals • Dendrite (input bus) Combines the inputs from several other nerve cells • Synapse (interface) Interface and parameter store between neurons • Axon (output cable) May be up to 1m long and will transport the activation signal to neurons at different locations

  6. Neurons x n x 1 x 2 x 3 . . . w n w 1 synaptic weights output σ ( ) X f ( x ) = w i x i = h w, x i i

  7. Frank Rosenblatt’s Perceptron

  8. Multilayer Perceptron (Neural Net)

  9. Perceptron w/ bias x 3 x n x 1 x 2 . . . • Weighted linear combination w n w 1 synaptic • Nonlinear weights decision function • Linear offset (bias) output f ( x ) = σ ( h w, x i + b ) • Linear separating hyperplanes • Learning: w and b

  10. Perceptron w/o bias x 0 = 1 x 3 x n x 1 x 2 . . . • Weighted linear combination w 0 w n w 1 synaptic • Nonlinear weights decision function • No Linear offset (bias): output hyperplane through the origin f ( x ) = σ ( h w, x i + b ) • Linear separating hyperplanes • Learning: w

  11. Augmented Space can separate in 3D from the origin can separate in 2D from the origin 1 O can’t separate in 1D can’t separate in 2D from the origin from the origin

  12. Perceptron Ham Spam

  13. The Perceptron w/o bias initialize w = 0 and b = 0 repeat if y i [ h w, x i i + b ]  0 then w w + y i x i and b b + y i end if until all classified correctly • Nothing happens if classified correctly • Weight vector is linear combination X w = y i x i i ∈ I • Classifier is linear combination of inner products σ ( ) X f ( x ) = y i h x i , x i + b i ∈ I

  14. The Perceptron w/ bias initialize w = 0 and b = 0 repeat if y i [ h w, x i i + b ]  0 then w w + y i x i and b b + y i end if until all classified correctly • Nothing happens if classified correctly • Weight vector is linear combination X w = y i x i i ∈ I • Classifier is linear combination of inner products σ ( ) X f ( x ) = y i h x i , x i + b i ∈ I

  15. Demo x i w (bias=0)

  16. Demo

  17. Demo

  18. Demo

  19. Convergence Theorem • If there exists some oracle unit vector u : k u k = 1 y i ( u · x i ) ≥ δ for all i then the perceptron converges to a linear separator after a number of updates bounded by R 2 / δ 2 where R = max i k x i k • Dimensionality independent • Order independent (but order matters in output) • Dataset size independent • Scales with ‘difficulty’ of problem

  20. Geometry of the Proof • part 1: progress (alignment) on oracle projection assume w i is the weight vector before the i th update (on h x i , y i i ) and assume initial w 0 = 0 w i +1 = w i + y i x i u · w i +1 = u · w i + y i ( u · x i ) y i ( u · x i ) ≥ δ for all i u · w i +1 ≥ u · w i + δ δ δ u · w i +1 ≥ i δ x i ⊕ projection on u increases! w i +1 ⊕ (more agreement w/ oracle) u : k u k = 1 k w i +1 k = k u kk w i +1 k � u · w i +1 � i δ w i ⊕

  21. Geometry of the Proof • part 2: bound the norm of the weight vector w i +1 = w i + y i x i k w i +1 k 2 = k w i + y i x i k 2 = k w i k 2 + k x i k 2 + 2 y i ( w i x i ) “mistake on x_i”  k w i k 2 + R 2 (radius)  iR 2 δ δ x i ⊕ Combine with part 1 w i +1 ⊕ k w i +1 k = k u kk w i +1 k � u · w i +1 � i δ u : k u k = 1 i ≤ R 2 / δ 2 w i ⊕

  22. Convergence Bound R 2 / δ 2 w • is independent of: • but test accuracy is • dimensionality dependent of: • order of examples • number of examples (shuffling helps) • starting weight vector • variable learning rate • order of examples (1/total#error helps) • constant learning rate • can you still prove • and is dependent of: convergence? • separation difficulty • feature scale

  23. Hardness margin vs. size hard easy

  24. XOR • XOR - not linearly separable • Nonlinear separation is trivial • Caveat from “Perceptrons” (Minsky & Papert, 1969) Finding the minimum error linear separator is NP hard (this killed Neural Networks in the 70s).

  25. Brief History of Perceptron batch 1997 +soft-margin Cortes/Vapnik SVM s l e online approx. n r e k + subgradient descent max margin +max margin 2007--2010* minibatch Singer group Pegasos minibatch online 2003 2006 Crammer/Singer Singer group c o n s e r v a t i v e u p d a t e s MIRA aggressive 1959 1962 1969* 1999 DEAD Rosenblatt Novikoff Minsky/Papert Freund/Schapire invention proof book killed it voted/avg: revived i n s e p a r a b l e c a s e 2002 2005* Collins McDonald/Crammer/Pereira structured MIRA structured *mentioned in lectures but optional (others papers all covered in detail) AT&T Research ex-AT&T and students

  26. Extensions of Perceptron • Problems with Perceptron • doesn’t converge with inseparable data • update might often be too “bold” • doesn’t optimize margin • is sensitive to the order of examples • Ways to alleviate these problems • voted perceptron and average perceptron • MIRA (margin-infused relaxation algorithm)

  27. Voted/Avged Perceptron • motivation: updates on later examples taking over! • voted perceptron (Freund and Schapire, 1999) • record the weight vector after each example in D • (not just after each update) • and vote on a new example using |D| models • shown to have better generalization power • averaged perceptron (from the same paper) • an approximation of voted perceptron • just use the average of all weight vectors • can be implemented efficiently

  28. Voted Perceptron

  29. Voted/Avged Perceptron (low dim - less separable) test error

  30. Voted/Avged Perceptron (high dim - more separable) test error

  31. Averaged Perceptron • voted perceptron is not scalable • and does not output a single model • avg perceptron is an approximation of voted perceptron w 0 = 0 initialize w = 0 and b = 0 repeat c ← c + 1 if y i [ h w, x i i + b ]  0 then w w + y i x i and b b + y i after each example, not each update w 0 ← w 0 + w end if until all classified correctly w 0 /c output 32

  32. Efficient Implementation of Averaging • naive implementation (running sum) doesn’t scale • very clever trick from Daume (2006, PhD thesis) ∆ w t w t initialize w = 0 and b = 0 w a = 0 repeat c ← c + 1 w (0) = if y i [ h w, x i i + b ]  0 then w w + y i x i and b b + y i w (1) = ∆ w (1) w a ← w a + cy i x i end if w (2) = ∆ w (1) ∆ w (2) until all classified correctly output w − w a /c w (3) = ∆ w (1) ∆ w (2) ∆ w (3) w (4) = ∆ w (1) ∆ w (2) ∆ w (3) ∆ w (4) 33

  33. MIRA • perceptron often makes too bold updates • but hard to tune learning rate • the smallest update to correct the mistake? w i +1 = w i + y i � w i · x i x i k x i k 2 easy to show: y i ( w i +1 · x i ) = y i ( w i + y i � w i · x i x i ⊕ x i ) · x i = 1 k x i k 2 perceptron � w i · x i k x i k k x i k 1 margin-infused relaxation MIRA algorithm (MIRA) 1 � w i · x i perceptron over- k x i k w i corrects this mistake

  34. Perceptron x i perceptron perceptron under- corrects this mistake w (bias=0)

  35. MIRA w 0 k w 0 � w k 2 min x i MIRA makes sure s.t. w 0 · x � 1 after update, dot- MIRA minimal change product w ∙ x_i = 1 to ensure margin margin of 1/|x_i| perceptron MIRA ≈ 1-step SVM perceptron under- corrects this mistake w (bias=0)

  36. Aggressive MIRA • aggressive version of MIRA • also update if correct but margin isn’t big enough w • functional margin: y i ( w · x i ) • geometric margin: y i ( w · x i ) k w k • update if functional margin is <= p (0<= p <1) • update rule is same as MIRA • called p- aggressive MIRA (MIRA: p =0) • larger p leads to a larger geometric margin • but slower convergence

  37. Aggressive MIRA p=0.9 p=0.2 n o p t r e e r c p

  38. Demo • perceptron vs. 0.2-aggressive vs. 0.9-aggressive p=0.9 p=0.2 n o p t r e e r c p

  39. Demo • perceptron vs. 0.2-aggressive vs. 0.9-aggressive • why does this dataset so slow to converge? • perceptron: 22, p=0.2: 87, p=0.9: 2,518 epochs answer: margin shrinks in augmented space! small margin in 2D 1 O big margin in 1D

Recommend


More recommend