SVMs, Duality and the Kernel Trick (cont.) Machine Learning - PowerPoint PPT Presentation

Two SVM tutorials linked in class website (please, read both): � High-level presentation with applications (Hearst 1998) � Detailed tutorial (Burges 1998) SVMs, Duality and the Kernel Trick (cont.) Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University March 1 st , 2006 � � 2006 Carlos Guestrin

SVMs reminder � � 2006 Carlos Guestrin

Today’s lecture � Learn one of the most interesting and exciting recent advancements in machine learning � The “kernel trick” � High dimensional feature spaces at no extra cost! � But first, a detour � Constrained optimization! � � 2006 Carlos Guestrin

Dual SVM interpretation 0 = b + x . w � � 2006 Carlos Guestrin

Dual SVM formulation – the linearly separable case � � 2006 Carlos Guestrin

Reminder from last time: What if the data is not linearly separable? Use features of features of features of features…. Feature space can get really large really quickly! � � 2006 Carlos Guestrin

Higher order polynomials m – input features d – degree of polynomial number of monomial terms d=4 d=3 d=2 grows fast! d = 6, m = 100 number of input dimensions about 1.6 billion terms � � 2006 Carlos Guestrin

Dual formulation only depends on dot-products, not on w ! � � 2006 Carlos Guestrin

Finally: the “kernel trick”! � Never represent features explicitly � Compute dot products in closed form � Constant-time high-dimensional dot- products for many classes of features � Very interesting theory – Reproducing Kernel Hilbert Spaces � Not covered in detail in 10701/15781, more in 10702 � � 2006 Carlos Guestrin

Common kernels � Polynomials of degree d � Polynomials of degree up to d � Gaussian kernels � Sigmoid �� 2006 Carlos Guestrin

Overfitting? � Huge feature space with kernels, what about overfitting??? � Maximizing margin leads to sparse set of support vectors � Some interesting theory says that SVMs search for simple hypothesis with large margin � Often robust to overfitting �� 2006 Carlos Guestrin

What about at classification time � For a new input x , if we need to represent Φ ( x ), we are in trouble! � Recall classifier: sign( w . Φ ( x )+b) � Using kernels we are cool! �� 2006 Carlos Guestrin

SVMs with kernels � Choose a set of features and kernel function � Solve dual problem to obtain support vectors α i � At classification time, compute: Classify as �� 2006 Carlos Guestrin

Remember kernel regression Remember kernel regression??? w i = exp(-D(x i , query) 2 / K w 2 ) 1. How to fit with the local points? 2. Predict the weighted average of the outputs: predict = � w i y i / � w i �� 2006 Carlos Guestrin

SVMs v. Kernel Regression SVMs Kernel Regression or �� 2006 Carlos Guestrin

SVMs v. Kernel Regression SVMs Kernel Regression or Differences: � SVMs: � Learn weights \alpha_i (and bandwidth) � Often sparse solution � KR: � Fixed “weights”, learn bandwidth � Solution may not be sparse � Much simpler to implement �� 2006 Carlos Guestrin

What’s the difference between SVMs and Logistic Regression? SVMs Logistic Regression Loss function Hinge loss Log-loss High dimensional Yes! No features with kernels �� 2006 Carlos Guestrin

Kernels in logistic regression � Define weights in terms of support vectors: � Derive simple gradient descent rule on α i �� 2006 Carlos Guestrin

What’s the difference between SVMs and Logistic Regression? (Revisited) SVMs Logistic Regression Loss function Hinge loss Log-loss High dimensional Yes! Yes! features with kernels Solution sparse Often yes! Almost always no! Semantics of “margin” Real probabilities output �� 2006 Carlos Guestrin

What you need to know � Dual SVM formulation � How it’s derived � The kernel trick � Derive polynomial kernel � Common kernels � Kernelized logistic regression � Differences between SVMs and logistic regression �� 2006 Carlos Guestrin

Acknowledgment � SVM applet: � http://www.site.uottawa.ca/~gcaron/applets.htm �� 2006 Carlos Guestrin

More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based Bounds Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University March 1 st , 2005 �� 2006 Carlos Guestrin

What now… � We have explored many ways of learning from data � But… � How good is our classifier, really? � How much data do I need to make it “good enough”? �� 2006 Carlos Guestrin

A simple setting… � Classification � m data points � Finite number of possible hypothesis (e.g., dec. trees of depth d) � A learner finds a hypothesis h that is consistent with training data � Gets zero error in training – error train ( h ) = 0 � What is the probability that h has more than ε true error? � error true ( h ) � ε �� 2006 Carlos Guestrin

How likely is a bad hypothesis to get m data points right? � Hypothesis h that is consistent with training data � got m i.i.d. points right � Prob. h with error true (h) � ε gets one data point right � Prob. h with error true (h) � ε gets m data points right �� 2006 Carlos Guestrin

But there are many possible hypothesis that are consistent with training data �� 2006 Carlos Guestrin

How likely is learner to pick a bad hypothesis � Prob. h with error true (h) � ε gets m data points right � There are k hypothesis consistent with data � How likely is learner to pick a bad one? �� 2006 Carlos Guestrin

Union bound � P(A or B or C or D or …) �� 2006 Carlos Guestrin

How likely is learner to pick a bad hypothesis � Prob. h with error true (h) � ε gets m data points right � There are k hypothesis consistent with data � How likely is learner to pick a bad one? �� 2006 Carlos Guestrin

Review: Generalization error in finite hypothesis spaces [Haussler ’88] � Theorem : Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data: �� 2006 Carlos Guestrin

Using a PAC bound � Typically, 2 use cases: � 1: Pick ε and δ , give you m � 2: Pick m and δ , give you ε �� 2006 Carlos Guestrin

Review: Generalization error in finite hypothesis spaces [Haussler ’88] � Theorem : Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h that is consistent on the training data: Even if h makes zero errors in training data, may make errors in test �� 2006 Carlos Guestrin

Limitations of Haussler ‘88 bound � Consistent classifier � Size of hypothesis space �� 2006 Carlos Guestrin

What if our classifier does not have zero error on the training data? � A learner with zero training errors may make mistakes in test set � What about a learner with error train ( h ) in training set? �� 2006 Carlos Guestrin

Simpler question: What’s the expected error of a hypothesis? � The error of a hypothesis is like estimating the parameter of a coin! � Chernoff bound: for m i.d.d. coin flips, x 1 ,…,x m , where x i � {0,1}. For 0< ε <1: �� 2006 Carlos Guestrin

Using Chernoff bound to estimate error of a single hypothesis �� 2006 Carlos Guestrin

But we are comparing many hypothesis: Union bound For each hypothesis h i : What if I am comparing two hypothesis, h 1 and h 2 ? �� 2006 Carlos Guestrin

Generalization bound for |H| hypothesis � Theorem : Hypothesis space H finite, dataset D with m i.i.d. samples, 0 < ε < 1 : for any learned hypothesis h : �� 2006 Carlos Guestrin

PAC bound and Bias-Variance tradeoff or, after moving some terms around, with probability at least 1- δ δ δ δ: : : : � Important: PAC bound holds for all h, but doesn’t guarantee that algorithm finds best h !!! �� 2006 Carlos Guestrin

What about the size of the hypothesis space? � How large is the hypothesis space? �� 2006 Carlos Guestrin

Boolean formulas with n binary features �� 2006 Carlos Guestrin

Number of decision trees of depth k Recursive solution Given n attributes H k = Number of decision trees of depth k H 0 =2 H k+1 = (#choices of root attribute) * (# possible left subtrees) * (# possible right subtrees) = n * H k * H k Write L k = log 2 H k L 0 = 1 L k+1 = log 2 n + 2L k So L k = (2 k -1)(1+log 2 n) +1 �� 2006 Carlos Guestrin

PAC bound for decision trees of depth k � Bad!!! � Number of points is exponential in depth! � But, for m data points, decision tree can’t get too big… Number of leaves never more than number data points �� 2006 Carlos Guestrin

Number of decision trees with k leaves H k = Number of decision trees with k leaves H 0 =2 Loose bound: Reminder: �� 2006 Carlos Guestrin

PAC bound for decision trees with k leaves – Bias-Variance revisited �� 2006 Carlos Guestrin

SVMs, Duality and the Kernel Trick (cont.) Machine Learning - PowerPoint PPT Presentation

Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) SVMs, Duality and the Kernel Trick (cont.) Machine Learning 10701/15781 Carlos

Multiclass Classification using SVMs on GPUs Sergio Herrero 6.338J Applied Parallel Computing

Review of duality so far LP/QP duality, cone duality, set duality All are halfspace bounds

Support Vector Machines (SVMs). Semi-Supervised Learning. Semi-Supervised SVMs.

BBM406 Fundamentals of Machine Learning Lecture 17: Kernel Trick for SVMs Risk and Loss

Skoltech Skolkovo Institute of Science and Technology Kernel Methods Refresher Kernel trick:

CS675: Convex and Combinatorial Optimization Spring 2018 Duality of Convex Sets and Functions

CS675: Convex and Combinatorial Optimization Fall 2019 Geometric Duality of Convex Sets and

Computational Geometry Lecture 11: Arrangements and Duality Computational Geometry Lecture 11:

Stone duality, more duality, and dynamics in Will Brian May 22, 2014 Will Brian Stone

Duality of abelian groups stacks and T -duality U. Bunke September 6, 2006 String

T-duality Invariant Formalisms at the Quantum Level Daniel Thompson Queen Mary University of

Add Steak to Exploratory Add Steak to Exploratory Testing's Parlor Parlor- -Trick Sizzle Trick

Learning From Data Lecture 25 The Kernel Trick Learning with only inner products The Kernel M.

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Cleani ning C ng Cont ontract Cleani ning C ng Cont ontract Cleani ning C ng Cont

Why LINEX Our Explanation (cont-d) Our Explanation (cont-d) (Linear Exponential) Our

Statistical NLP Spring 2011 Lecture 11: Classification Dan Klein UC Berkeley Classification

Classification and Measure for Algebraic Fields Russell Miller Queens College & CUNY

In Influ luencin ing behaviour to reduce food waste a desig ign-approach Odile Le Bolloch,

Auslanders formula and the MacPherson-Vilonen Construction Samuel Dean (some joint with Jeremy

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring

Set-Valued Extensions It Is Sufficient to . . . Main Result of Fuzzy Logic: Discussion and . .

Probabilistic Modelling and Reasoning Exam Info Michael Gutmann Probabilistic Modelling

Global Political Economy Technology Demand and FDIs Lecture 1 Antonello Zanfei

SVMs, Duality and the Kernel Trick (cont.) Machine Learning - PowerPoint PPT Presentation

Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) SVMs, Duality and the Kernel Trick (cont.) Machine Learning 10701/15781 Carlos

Multiclass Classification using SVMs on GPUs Sergio Herrero 6.338J Applied Parallel Computing

Review of duality so far LP/QP duality, cone duality, set duality All are halfspace bounds

Support Vector Machines (SVMs). Semi-Supervised Learning. Semi-Supervised SVMs.

BBM406 Fundamentals of Machine Learning Lecture 17: Kernel Trick for SVMs Risk and Loss

Skoltech Skolkovo Institute of Science and Technology Kernel Methods Refresher Kernel trick:

CS675: Convex and Combinatorial Optimization Spring 2018 Duality of Convex Sets and Functions

CS675: Convex and Combinatorial Optimization Fall 2019 Geometric Duality of Convex Sets and

Computational Geometry Lecture 11: Arrangements and Duality Computational Geometry Lecture 11:

Stone duality, more duality, and dynamics in Will Brian May 22, 2014 Will Brian Stone

Duality of abelian groups stacks and T -duality U. Bunke September 6, 2006 String

T-duality Invariant Formalisms at the Quantum Level Daniel Thompson Queen Mary University of

Add Steak to Exploratory Add Steak to Exploratory Testing's Parlor Parlor- -Trick Sizzle Trick

Learning From Data Lecture 25 The Kernel Trick Learning with only inner products The Kernel M.

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Cleani ning C ng Cont ontract Cleani ning C ng Cont ontract Cleani ning C ng Cont

Why LINEX Our Explanation (cont-d) Our Explanation (cont-d) (Linear Exponential) Our

Statistical NLP Spring 2011 Lecture 11: Classification Dan Klein UC Berkeley Classification

Classification and Measure for Algebraic Fields Russell Miller Queens College &amp; CUNY

In Influ luencin ing behaviour to reduce food waste a desig ign-approach Odile Le Bolloch,

Auslanders formula and the MacPherson-Vilonen Construction Samuel Dean (some joint with Jeremy

CMU-10701 Support Vector Machines Barnabs Pczos &amp; Aarti Singh 2014 Spring

Set-Valued Extensions It Is Sufficient to . . . Main Result of Fuzzy Logic: Discussion and . .

Probabilistic Modelling and Reasoning Exam Info Michael Gutmann Probabilistic Modelling

Global Political Economy Technology Demand and FDIs Lecture 1 Antonello Zanfei

Classification and Measure for Algebraic Fields Russell Miller Queens College & CUNY

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring