Lecture 26: Support Vector Classifjcation, Unsupervised Learning - PowerPoint PPT Presentation

. . . . . . . . . . . . . . . . . Lecture 26: Support Vector Classifjcation, Unsupervised Learning Instructor: Prof. Ganesh Ramakrishnan October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . 1 / 28

. . . . . . . . . . . . . . . . . . Support Vector Classifjcation October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . 2 / 28

. . . . . . . . . . . . . . . Perceptron does not fjnd the best seperating hyperplane, it fjnds any seperating hyperplane. In case the initial w does not classify all the examples, the seperating hyperplane The seperating hyperplane does not provide enough breathing space – this is what SVMs address and we already saw that for regression! We now quickly do the same for classifjcation October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . 3 / 28 corresponding to the fjnal w ∗ will often pass through an example.

. . . . . . . . . . . . . . . . Perceptron does not fjnd the best seperating hyperplane, it fjnds any seperating hyperplane. In case the initial w does not classify all the examples, the seperating hyperplane The seperating hyperplane does not provide enough breathing space – this is what SVMs address and we already saw that for regression! October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . . 3 / 28 corresponding to the fjnal w ∗ will often pass through an example. ▶ We now quickly do the same for classifjcation

. . . . . . . . . . . . . . . . . Support Vector Classifjcation: Separable Case R m There is large margin to seperate the +ve and -ve examples October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . 4 / 28 w ⊤ φ ( x ) + b ≥ +1 for y = +1 w ⊤ φ ( x ) + b ≤ − 1 for y = − 1 w , φ ∈ I

x i i ( for y i x i i ( for y i Multiplying y i on both sides, we get: y i w x i fjed point is from the seperating hyperplane): When the examples are not linearly seperable, Support Vector Classifjcation: Non-separable Case . . . . . . . w ) b . w b ) b i , i n October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 / 28 we need to consider the slackness ξ i (always +ve) of each example x ( i ) (how far a misclassi-

. . . . . . . . . . . . . . . . . Support Vector Classifjcation: Non-separable Case When the examples are not linearly seperable, fjed point is from the seperating hyperplane): October 27, 2016 . . . . . . . . . . . . . . . . . . 5 / 28 . . . . . we need to consider the slackness ξ i (always +ve) of each example x ( i ) (how far a misclassi- w ⊤ φ ( x ( i ) ) + b ≥ +1 − ξ i ( for y ( i ) = +1 ) w ⊤ φ ( x ( i ) ) + b ≤ − 1 + ξ i ( for y ( i ) = − 1 ) Multiplying y ( i ) on both sides, we get: y ( i ) ( w ⊤ φ ( x ( i ) ) + b ) ≥ 1 − ξ i , ∀ i = 1 , . . . , n

. . . . . . . . . . . . . . . . . Maximize the margin Recall that w is perpendicular to the separating surface concerned with the direction of w and not its magnitude October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . 6 / 28 We maximize the margin ( φ ( x + ) − φ ( x − )) ⊤ [ w ∥ w ∥ ] Here, x + and x − lie on boundaries of the margin. We project the vectors φ ( x + ) and φ ( x − ) on w , and normalize by w as we are only

. . . . . . . . . . . . . . . Simplifying the margin expression Adding 2 to 1 , w x x Thus, the margin expression to maximize is: w October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . . . 7 / 28 Maximize the margin ( φ ( x + ) − φ ( x − )) ⊤ [ w ∥ w ∥ ] At x + : y + = 1 , ξ + = 0 hence, ( w ⊤ φ ( x + ) + b ) = 1 — 1 At x − : y − = 1 , ξ − = 0 hence, − ( w ⊤ φ ( x − ) + b ) = 1 — 2

. . . . . . . . . . . . . . . . . Simplifying the margin expression Adding 2 to 1 , Thus, the margin expression to maximize is: October 27, 2016 . . . . . . . . . . . . . . . . . . 7 / 28 . . . . . Maximize the margin ( φ ( x + ) − φ ( x − )) ⊤ [ w ∥ w ∥ ] At x + : y + = 1 , ξ + = 0 hence, ( w ⊤ φ ( x + ) + b ) = 1 — 1 At x − : y − = 1 , ξ − = 0 hence, − ( w ⊤ φ ( x − ) + b ) = 1 — 2 w ⊤ ( φ ( x + ) − φ ( x − )) = 2 2 ∥ w ∥

. . . . . . . . . . . . . . . Formulating the objective Thus, with arbitrarily large values of i , the constraints become easily satisfjable for any w , which defeats the purpose. Hence, we also want to minimize the i ’s. E.g. , minimize i October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . 8 / 28 . . . Problem at hand: Find w ∗ , b ∗ that maximize the margin. ( w ∗ , b ∗ ) = arg max w , b 2 ∥ w ∥ s.t. y ( i ) ( w ⊤ φ ( x ( i ) ) + b ) ≥ 1 − ξ i and ξ i ≥ 0 , ∀ i = 1 , . . . , n However, as ξ i → ∞ , 1 − ξ i → −∞

. . . . . . . . . . . . . . . . . Formulating the objective w , which defeats the purpose. October 27, 2016 . . . . . . . . . . . . . . . . . . 8 / 28 . . . . . Problem at hand: Find w ∗ , b ∗ that maximize the margin. ( w ∗ , b ∗ ) = arg max w , b 2 ∥ w ∥ s.t. y ( i ) ( w ⊤ φ ( x ( i ) ) + b ) ≥ 1 − ξ i and ξ i ≥ 0 , ∀ i = 1 , . . . , n However, as ξ i → ∞ , 1 − ξ i → −∞ Thus, with arbitrarily large values of ξ i , the constraints become easily satisfjable for any Hence, we also want to minimize the ξ i ’s. E.g. , minimize ∑ ξ i

. . . . . . . . . . . . . . . . . Objective n Instead of maximizing October 27, 2016 . . . . . . . . . . . . . 9 / 28 . . . . . . . . . . 1 2 ∥ w ∥ 2 + C ∑ ( w ∗ , b ∗ , ξ ∗ i ) = arg min ξ i w , b ,ξ i i =1 s.t. y ( i ) ( w ⊤ φ ( x ( i ) ) + b ) ≥ 1 − ξ i and ξ i ≥ 0 , ∀ i = 1 , . . . , n 2 ∥ w ∥ 2 ∥ w ∥ , minimize 1 2 2 ∥ w ∥ 2 is monotonically decreasing with respect to ( 1 2 ∥ w ∥ ) C determines the trade-ofg between the error ∑ ξ i and the margin 2 ∥ w ∥

. . . . . . . . . . . . . . . . . Support Vector Machines Dual Objective October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . 10 / 28

. . . . . . . . . . . . . . . . . 2 Approaches to Showing Kernelized Form for Dual (Generalized from derivation of Kernel Logistic Regression, Tutorial 7, Problem 3) See http://qwone.com/~jason/writing/kernel.pdf for list of kernelized objectives October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . 11 / 28 1 Approach 1: The Reproducing Kernel Hilbert Space and Representer theorem 2 Approach 2: Derive using First principles (provided for completeness in Tutorial 9)

. . . . . . . . . . . . . . Approach 1: Special case of Representer Theorem & Reproducing Kernel Hilbert Space (RKHS) http://qwone.com/~jason/writing/kernel.pdf for list of kernelized objectives m E f Reproducing (RKHS) Kernel 1 Proof provided in optional slide deck at the end October 27, 2016 . . . . . . . . . . . . . . 12 / 28 . . . . . . . . . . . . 1 Generalized from derivation of Kernel Logistic Regression, Tutorial 7, Problem 3. See { x (1) , x (2) , . . . , x ( m ) } 2 Let X be the space of examples such that D = ⊆ X and for any x ∈ X , K ( ., x ) : X → ℜ 3 (Optional) 1 The solution f ∗ ∈ H (Hilbert space) to the following problem ( ) f ∗ = argmin ( x ( i ) ) ∑ , y ( i ) + Ω( ∥ f ∥ K ) f ∈H i =1 can be always written as f ∗ ( x ) = ∑ m i =1 α i K ( x , x ( i ) ) , provided Ω( ∥ f ∥ K ) is a monotonically increasing function of ∥ f ∥ K . H is the Hilbert space and K ( ., x ) : X → ℜ is called the

. . . . . . . . . . . . . . Approach 1: Special case of Representer Theorem & Reproducing Kernel . m E f m E f October 27, 2016 . Hilbert Space (RKHS) . . . . . . . . . . . . 13 / 28 . . . . . . . . . . . . 1 (Optional) The solution f ∗ ∈ H (Hilbert space) to the following problem ( ) f ∗ = argmin ( x ( i ) ) ∑ , y ( i ) + Ω( ∥ f ∥ K ) f ∈H i =1 can be always written as f ∗ ( x ) = ∑ m i =1 α i K ( x , x ( i ) ) , provided Ω( ∥ f ∥ K ) is a .... 2 More specifjcally, if f ( x ) = w T φ ( x ) + b and K ( x ′ , x ) = φ T ( x ) φ ( x ′ ) then the solution w ∗ ∈ ℜ n to the following problem ( ) ( x ( i ) ) ∑ ( w ∗ , b ∗ ) = argmin , y ( i ) + Ω( ∥ w ∥ 2 ) w , b i =1 can be always written as φ T ( x ) w ∗ + b = ∑ m i =1 α i K ( x , x ( i ) ) , provided Ω( ∥ w ∥ 2 ) is a monotonically increasing function of ∥ w ∥ 2 . ℜ n +1 is the Hilbert space and K ( ., x ) : X → ℜ is the Reproducing (RKHS) Kernel

Lecture 26: Support Vector Classifjcation, Unsupervised Learning - PowerPoint PPT Presentation

. . . . . . . . . . . . . . . . . Lecture 26: Support Vector Classifjcation, Unsupervised Learning Instructor: Prof. Ganesh Ramakrishnan October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . 1 / 28

A Persistent WeisfeilerLehman Procedure for Graph Classifjcation Bastian Rieck Christian Bock

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Classifjcation / Division 07.08.10 || English 1301: Composition & Rhetoric I || D. Glen Smith,

Statistical Natural Language Processing Classifjcation ar ltekin University of

Machine Learning for Computational Linguistics Classifjcation ar ltekin University of

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Why Deep Learning Is More Natural Questions Efficient than Support Support Vector . . . Support

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Contents 1. Introduction 2. (Un)decidability on modal MTL logics Reducing to PCP The Global

Empirical Comparisons of Fast Methods Dustin Lang and Mike Klaas { dalang, klaas } @cs.ubc.ca

Collective Annotation: From Crowdsourcing to Social Choice Ulle Endriss Institute for Logic,

The Split Delivery Vehicle Routing Problem Hande Yaman Joint work with Gizem Ozbaygn, Oya

Artificial Neural Networks (Part 2) Gradient Descent Learning and Backpropagation Christian Jacob

Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 565

Supervised Learning in Neural Networks Keith L. Downing The Norwegian University of Science and

Community structure in networks Argimiro Arratia & Marta Arias Universitat Polit` ecnica de

Lecture 26: Support Vector Classifjcation, Unsupervised Learning - PowerPoint PPT Presentation

. . . . . . . . . . . . . . . . . Lecture 26: Support Vector Classifjcation, Unsupervised Learning Instructor: Prof. Ganesh Ramakrishnan October 27, 2016 . . . . . . . . . . . . . . . . . . . . . . . 1 / 28

A Persistent WeisfeilerLehman Procedure for Graph Classifjcation Bastian Rieck Christian Bock

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Classifjcation / Division 07.08.10 || English 1301: Composition &amp; Rhetoric I || D. Glen Smith,

Statistical Natural Language Processing Classifjcation ar ltekin University of

Machine Learning for Computational Linguistics Classifjcation ar ltekin University of

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Why Deep Learning Is More Natural Questions Efficient than Support Support Vector . . . Support

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Contents 1. Introduction 2. (Un)decidability on modal MTL logics Reducing to PCP The Global

Empirical Comparisons of Fast Methods Dustin Lang and Mike Klaas { dalang, klaas } @cs.ubc.ca

Collective Annotation: From Crowdsourcing to Social Choice Ulle Endriss Institute for Logic,

The Split Delivery Vehicle Routing Problem Hande Yaman Joint work with Gizem Ozbaygn, Oya

Artificial Neural Networks (Part 2) Gradient Descent Learning and Backpropagation Christian Jacob

Feedforward Networks Gradient Descent Learning and Backpropagation Christian Jacob CPSC 565

Supervised Learning in Neural Networks Keith L. Downing The Norwegian University of Science and

Community structure in networks Argimiro Arratia &amp; Marta Arias Universitat Polit` ecnica de

Classifjcation / Division 07.08.10 || English 1301: Composition & Rhetoric I || D. Glen Smith,

Community structure in networks Argimiro Arratia & Marta Arias Universitat Polit` ecnica de