applied machine learning
play

Applied Machine Learning CIML Chap 4 (A Geometric Approach) - PowerPoint PPT Presentation

Applied Machine Learning CIML Chap 4 (A Geometric Approach) Equations are just the boring part of mathematics. I attempt to see things in terms of geometry. Stephen Hawking Week 4: Linear Classification: Perceptron Professor Liang


  1. Applied Machine Learning CIML Chap 4 (A Geometric Approach) “ Equations are just the boring part of mathematics. I attempt to see things in terms of geometry. ” ― Stephen Hawking Week 4: Linear Classification: Perceptron Professor Liang Huang some slides from Alex Smola (CMU/Amazon)

  2. Roadmap for Unit 2 (Weeks 4-5) • Week 4: Linear Classifier and Perceptron • Part I: Brief History of the Perceptron • Part II: Linear Classifier and Geometry (testing time) • Part III: Perceptron Learning Algorithm (training time) • Part IV: Convergence Theorem and Geometric Proof • Part V: Limitations of Linear Classifiers, Non-Linearity, and Feature Maps • Week 5: Extensions of Perceptron and Practical Issues • Part I: My Perceptron Demo in Python • Part II: Voted and Averaged Perceptrons • Part III: MIRA and Aggressive MIRA • Part IV: Practical Issues • Part V: Perceptron vs. Logistic Regression (hard vs. soft); Gradient Descent 2

  3. Part I • Brief History of the Perceptron 3

  4. Perceptron (1959-now) Frank Rosenblatt

  5. deep learning 
 ~1986; 2006-now multilayer perceptron logistic regression 
 perceptron 
 SVM 
 1958 
 1964;1995 1958 kernels 
 1964 cond. random fields 
 structured SVM 
 structured perceptron 
 2001 
 2003 
 2002 5

  6. Neurons • Soma (CPU) 
 Cell body - combines signals • Dendrite (input bus) 
 Combines the inputs from 
 several other nerve cells • Synapse (interface) 
 Interface and parameter store between neurons • Axon (output cable) 
 May be up to 1m long and will transport the activation signal to neurons at different locations 6

  7. Frank Rosenblatt’s Perceptron 7

  8. Multilayer Perceptron (Neural Net) 8

  9. Brief History of Perceptron batch 1997 +soft-margin Cortes/Vapnik SVM +kernels online approx. 
 subgradient descent max margin +max margin 2007--2010 minibatch Singer group Pegasos minibatch online 2003 2006 Crammer/Singer Singer group conservative updates MIRA aggressive 1958 1962 1969* 1999 DEAD Rosenblatt Novikoff Minsky/Papert Freund/Schapire invention proof book killed it voted/avg: revived inseparable case 2002 2005* Collins McDonald/Crammer/Pereira structured structured MIRA *mentioned in lectures but optional 
 (others papers all covered in detail) AT&T Research ex-AT&T and students 9

  10. Part II • Linear Classifier and Geometry (testing time) • decision boundary and normal vector w • not separable through the origin: add bias b • geometric review of linear algebra • augmented space (no explicit bias; implicit as w 0 = b ) Test Time Input x Prediction σ ( w · x ) Linear Classifier Model w Training Time Input x Model w Perceptron Learner Output y 10

  11. Linear Classifier and Geometry linear classifiers: perceptron, logistic regression, (linear) SVMs, etc. x n x 1 x 2 x 3 . . . x 2 k x k cos θ = w · x w n w 1 k w k weights x output f ( x ) = σ ( w · x ) σ θ O negative x 1 w w · x < 0 positive weight vector w is a “prototype” of positive examples w · x > 0 it’s also the normal vector of the decision boundary meaning of w · x : agreement with positive direction 
 separating hyperplane 
 test : input: x , w ; output:1 if w · x >0 else -1 (decision boundary) training : input: ( x , y ) pairs; output: w 11 w · x = 0

  12. What if not separable through origin? solution: add bias b x n x 1 x 2 x 3 . . . x 2 = w · x k x k cos θ w n w 1 k w k weights x output f ( x ) = σ ( w · x + b ) θ O negative x 1 w w · x + b < 0 positive w · x + b > 0 w · x + b = 0 | b | k w k 12

  13. Geometric Review of Linear Algebra line in 2D ( n -1)-dim hyperplane in n -dim x 2 x 2 w · x + b = 0 w 1 x 1 + w 2 x 2 + b = 0 x ∗ ( x ∗ 1 , x ∗ 2 ) | b | k ( w 1 , w 2 ) k ( w 1 , w 2 ) O O x 1 x 1 required: algebraic and geometric meanings of dot product x 3 | w 1 x ∗ 2 + b | | w · x + b | 1 + w 2 x ∗ = | ( w 1 , w 2 ) · ( x 1 , x 2 ) + b | p k ( w 1 , w 2 ) k w 2 1 + w 2 k w k 2 point-to-line distance point-to-hyperplane distance 13 LA-geom

  14. Augmented Space: dimensionality+1 explicit bias x n x 1 x 2 x 3 . . . f ( x ) = σ ( w · x + b ) w n w 1 can’t separate in 1D 
 weights from the origin O output x 0 = 1 augmented space x 3 x n x 1 x 2 . . . w 0 = b f ( x ) = σ (( b ; w ) · (1; x )) w n w 1 weights 1 can separate in 2D 
 from the origin O output 14

  15. Augmented Space: dimensionality+1 explicit bias x n x 1 x 2 x 3 . . . f ( x ) = σ ( w · x + b ) w n w 1 can’t separate in 2D 
 weights from the origin output x 0 = 1 augmented space x 3 x n x 1 x 2 . . . w 0 = b f ( x ) = σ (( b ; w ) · (1; x )) w n w 1 weights can separate in 3D 
 from the origin output 15

  16. Part III • The Perceptron Learning Algorithm (training time) • the version without bias (augmented space) • side note on mathematical notations • mini-demo Test Time Input x Prediction σ ( w · x ) Linear Classifier Model w Training Time Input x Model w Perceptron Learner Output y 16

  17. Perceptron Ham Spam 17

  18. The Perceptron Algorithm input: training data D x output: weights w initialize w ← 0 while not converged w 0 aa for ( x , y ) ∈ D aaaa if y ( w · x ) ≤ 0 aaaaaa w ← w + y x w • the simplest machine learning algorithm • keep cycling through the training data • update w if there is a mistake on example ( x , y ) • until all examples are classified correctly 18

  19. Side Note on Mathematical Notations • I’ll try my best to be consistent in notations • e.g., bold-face for vectors, italic for scalars, etc. • avoid unnecessary superscripts and subscripts by using a “Pythonic” rather than a “C” notational style • most textbooks have consistent but bad notations initialize w = 0 and b = 0 initialize w ← 0 repeat while not converged if y i [ h w, x i i + b ]  0 then aa for ( x , y ) ∈ D w w + y i x i and b b + y i aaaa if y ( w · x ) ≤ 0 end if aaaaaa w ← w + y x until all classified correctly good notations: 
 bad notations: 
 consistent, Pythonic style inconsistent, unnecessary i and b 19

  20. Demo ← while not converged aa for ( x , y ) ∈ D aaaa if y ( w · x ) ≤ 0 aaaaaa w ← w + y x x w 0 w (bias=0) 20

  21. Demo ← while not converged aa for ( x , y ) ∈ D aaaa if y ( w · x ) ≤ 0 aaaaaa w ← w + y x x w 21

  22. Demo ← while not converged aa for ( x , y ) ∈ D aaaa if y ( w · x ) ≤ 0 aaaaaa w ← w + y x x w 0 w 22

  23. Demo ← while not converged aa for ( x , y ) ∈ D aaaa if y ( w · x ) ≤ 0 aaaaaa w ← w + y x w 23

  24. 24

  25. Part IV • Linear Separation, Convergence Theorem and Proof • formal definition of linear separation • perceptron convergence theorem • geometric proof • what variables affect convergence bound? 25

  26. Linear Separation; Convergence Theorem • dataset D is said to be “linearly separable” if there exists some unit oracle vector u : ∣∣ u || = 1 which correctly classifies every example ( x , y ) with a margin at least ẟ : 
 y ( u · x ) ≥ δ for all ( x , y ) ∈ D • then the perceptron must converge to a linear separator after at most R 2 / ẟ 2 mistakes (updates) where ( x ,y ) ∈ D k x k R = max • convergence rate R 2 / ẟ 2 δ δ x ⊕ ⊕ • dimensionality independent u · x ≥ δ R • dataset size independent u : k u k = 1 • order independent (but order matters in output) • scales with ‘difficulty’ of problem

  27. Geometric Proof, part 1 • part 1: progress (alignment) on oracle projection assume w (0) = 0 , and w ( i ) is the weight before the i th update (on ( x , y )) w ( i +1) = w ( i ) + y x u · w ( i +1) = u · w ( i ) + y ( u · x ) u · w ( i +1) ≥ u · w ( i ) + δ y ( u · x ) ≥ δ for all ( x , y ) ∈ D u · w ( i +1) ≥ i δ δ δ x ⊕ ⊕ projection on u increases! u · x ≥ δ (more agreement w/ oracle direction) u : k u k = 1 � w ( i +1) � � � � w ( i +1) � � � u · w ( i +1) � i δ w ( i +1) w � = k u k � � � � ( i ) u · w ( i ) 27 u · w ( i +1)

  28. Geometric Proof, part 2 • part 2: upperbound of the norm of the weight vector w ( i +1) = w ( i ) + y x iR √ i δ 2 2 � w ( i +1) � � � w ( i ) + y x � � = � � � � � � 2 + k x k 2 + 2 y ( w ( i ) · x ) � w ( i ) � � = � � � mistake on x 2 � w ( i ) � � + R 2  x � � � ⊕ ⊕ ( x ,y ) ∈ D k x k R = max  iR 2 R Combine with part 1: w ( i +1) w � w ( i +1) � � � w ( i +1) � � � � u · w ( i +1) � i δ ( i ) � = k u k � � � � � θ ≥ 90 cos θ ≤ 0 i ≤ R 2 / δ 2 w ( i ) · x ≤ 0 28

  29. Convergence Bound R 2 / δ 2 • is independent of: • dimensionality • number of examples • order of examples • constant learning rate narrow margin: 
 wide margin: 
 hard to separate easy to separate • and is dependent of: • separation difficulty (margin ẟ ) • feature scale (radius R ) • initial weight w (0) • changes how fast it converges, but not whether it’ll converge 29

  30. Part V • Limitations of Linear Classifiers and Feature Maps • XOR: not linearly separable • perceptron cycling theorem • solving XOR: non-linear feature map • “preview demo”: SVM with non-linear kernel • redefining “linear” separation under feature map 30

Recommend


More recommend