CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 3 - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 3 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 28, 2014

Methods to Learn Matrix Data Set Data Sequence Time Series Graph & Data Network Classification Decision Tree; Naïve HMM Label Propagation Bayes; Logistic Regression SVM; kNN K-means; hierarchical SCAN; Spectral Clustering clustering; DBSCAN; Clustering Mixture Models; kernel k-means Apriori; GSP; Frequent FP-growth PrefixSpan Pattern Mining Prediction Linear Regression Autoregression Similarity DTW P-PageRank Search Ranking PageRank 2

Matrix Data: Classification: Part 3 • SVM (Support Vector Machine) • kNN (k Nearest Neighbor) • Other Issues • Summary 3

Classification: A Mathematical Mapping • Classification: predicts categorical class labels • E.g., Personal homepage classification • x i = (x 1 , x 2 , x 3 , …), y i = +1 or – 1 • x 1 : # of word “homepage” x • x 2 : # of word “welcome” x x x x • Mathematically, x  X =  n , y  Y = {+1, – 1}, x x o x x • We want to derive a function f: X  Y o o x o o o o o o o o o o 4

SVM — Support Vector Machines • A relatively new classification method for both linear and nonlinear data • It uses a nonlinear mapping to transform the original training data into a higher dimension • With the new dimension, it searches for the linear optimal separating hyperplane (i.e., “decision boundary”) • With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a hyperplane • SVM finds this hyperplane using support vectors (“essential” training tuples) and margins (defined by the support vectors) 5

SVM — History and Applications • Vapnik and colleagues (1992) — groundwork from Vapnik & Chervonenkis’ statistical learning theory in 1960s • Features: training can be slow but accuracy is high owing to their ability to model complex nonlinear decision boundaries (margin maximization ) • Used for: classification and numeric prediction • Applications: • handwritten digit recognition, object recognition, speaker identification, benchmarking time-series prediction tests 6

SVM — Margins and Support Vectors Small Margin Large Margin Support Vectors 7

SVM — When Data Is Linearly Separable m Let data D be ( X 1 , y 1 ), …, ( X |D| , y |D| ), where X i is the set of training tuples associated with the class labels y i There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data) SVM searches for the hyperplane with the largest margin , i.e., maximum marginal hyperplane (MMH) 8

SVM — Linearly Separable A separating hyperplane can be written as  W ● X + b = 0 where W ={w 1 , w 2 , …, w n } is a weight vector and b a scalar (bias) For 2-D it can be written as  w 0 + w 1 x 1 + w 2 x 2 = 0 The hyperplane defining the sides of the margin:  H 1 : w 0 + w 1 x 1 + w 2 x 2 ≥ 1 for y i = +1, and H 2 : w 0 + w 1 x 1 + w 2 x 2 ≤ – 1 for y i = – 1 Any training tuples that fall on hyperplanes H 1 or H 2 (i.e., the  sides defining the margin) are support vectors This becomes a constrained (convex) quadratic optimization problem:  Quadratic objective function and linear constraints  Quadratic Programming (QP)  Lagrangian multipliers 9

Maximum Margin Calculation • w : decision hyperplane normal vector • x i : data point i • y i : class of data point i (+1 or -1) w T x a + b = 1 ρ 2 w T x b + b = -1 𝜍 = ||𝒙|| w T x + b = 0 10

SVM as a Quadratic Programming • QP 2 Objective: Find w and b such that 𝜍 = ||𝒙|| is maximized; Constraints: For all { ( x i , y i )} w T x i + b ≥ 1 if y i =1; w T x i + b ≤ - 1 if y i = -1 • A better form Objective: Find w and b such that Φ ( w ) =½ w T w is minimized; Constraints: for all { ( x i , y i )} : y i ( w T x i + b ) ≥ 1 11

Solve QP • This is now optimizing a quadratic function subject to linear constraints • Quadratic optimization problems are a well- known class of mathematical programming problem, and many (intricate) algorithms exist for solving them (with many special ones built for SVMs) • The solution involves constructing a dual problem where a Lagrange multiplier α i is associated with every constraint in the primary problem: 12

Primal Form and Dual Form Objective: Find w and b such that Φ ( w ) =½ w T w is minimized; Primal Constraints: for all { ( x i , y i )} : y i ( w T x i + b ) ≥ 1 Equivalent under some conditions: KKT conditions Objective: Find α 1 …α n such that T x j is maximized and Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i Dual Constraints (1) Σ α i y i = 0 (2) α i ≥ 0 for all α i • More derivations: http://cs229.stanford.edu/notes/cs229-notes3.pdf 13

The Optimization Problem Solution • The solution has the form: w = Σ α i y i x i b = y k - w T x k for any x k such that α k  0 • Each non-zero α i indicates that corresponding x i is a support vector. • Then the classifying function will have the form: f ( x ) = Σ α i y i x i T x + b • Notice that it relies on an inner product between the test point x and the support vectors x i • We will return to this later. • Also keep in mind that solving the optimization problem involved computing T x j between all pairs of training points. the inner products x i 14

Sec. 15.2.1 Soft Margin Classification • If the training data is not linearly separable, slack variables ξ i can be added to allow misclassification of difficult or noisy examples. • Allow some errors • Let some points be ξ i moved to where they ξ j belong, at a cost • Still, try to minimize training set errors, and to place hyperplane “ far ” from each class (large margin) 15

Sec. 15.2.1 Soft Margin Classification Mathematically • The old formulation: Find w and b such that Φ ( w ) =½ w T w is minimized and for all { ( x i , y i )} y i ( w T x i + b) ≥ 1 • The new formulation incorporating slack variables: Find w and b such that Φ ( w ) =½ w T w + C Σ ξ i is minimized and for all { ( x i , y i )} y i ( w T x i + b ) ≥ 1- ξ i and ξ i ≥ 0 for all i • Parameter C can be viewed as a way to control overfitting • A regularization term (L1 regularization) 16

Sec. 15.2.1 Soft Margin Classification – Solution • The dual problem for soft margin classification: Find α 1 …α N such that Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i • Neither slack variables ξ i nor their Lagrange multipliers appear in the dual problem! • Again, x i with non-zero α i will be support vectors. • Solution to the dual problem is: w is not needed explicitly w = Σ α i y i x i for classification! b = y k (1- ξ k ) - w T x k where k = argmax α k ’ f ( x ) = Σ α i y i x i T x + b k ’ 17

Sec. 15.1 Classification with SVMs • Given a new point x , we can score its projection onto the hyperplane normal: • I.e., compute score: w T x + b = Σ α i y i x i T x x + + b • Decide class based on whether < or > 0 • Can set confidence threshold t . Score > t : yes Score < - t : no 1 Else: don ’ t know -10 18

Sec. 15.2.1 Linear SVMs: Summary • The classifier is a separating hyperplane. • The most “ important ” training points are the support vectors; they define the hyperplane. • Quadratic optimization algorithms can identify which training points x i are support vectors with non-zero Lagrangian multipliers α i . • Both in the dual formulation of the problem and in the solution, training points appear only inside inner products: f ( x ) = Σ α i y i x i Find α 1 …α N such that T x + b Q ( α ) = Σ α i - ½ ΣΣ α i α j y i y j x i T x j is maximized and (1) Σ α i y i = 0 (2) 0 ≤ α i ≤ C for all α i 19

Sec. 15.2.3 Non-linear SVMs • Datasets that are linearly separable (with some noise) work out great: x 0 • But what are we going to do if the dataset is just too hard? x 0 • How about … mapping data to a higher -dimensional space: x 2 x 0 20

Sec. 15.2.3 Non-linear SVMs: Feature spaces • General idea: the original feature space can always be mapped to some higher- dimensional feature space where the training set is separable: Φ : x → φ ( x ) 21

Sec. 15.2.3 The “ Kernel Trick ” • The linear classifier relies on an inner product between vectors K ( x i , x j )= x i T x j • If every data point is mapped into high-dimensional space via some transformation Φ : x → φ ( x ), the inner product becomes: K ( x i , x j )= φ ( x i ) T φ ( x j ) • A kernel function is some function that corresponds to an inner product in some expanded feature space. • Example: T x j ) 2 2-dimensional vectors x =[ x 1 x 2 ]; let K ( x i , x j )=(1 + x i , Need to show that K ( x i , x j )= φ ( x i ) T φ ( x j ): T x j ) 2 2 x j1 2 + 2 x i1 x j1 x i2 x j2 + x i2 2 x j2 2 + 2 x i1 x j1 + 2 x i2 x j2 = K ( x i , x j )=(1 + x i , = 1+ x i1 2 √ 2 x i1 x i2 x i2 2 √ 2 x i1 √ 2 x i2 ] T [1 x j1 2 √ 2 x j1 x j2 x j2 2 √ 2 x j1 √ 2 x j2 ] = [1 x i1 = φ ( x i ) T φ ( x j ) where φ ( x ) = [1 x 1 2 √ 2 x 1 x 2 x 2 2 √ 2 x 1 √ 2 x 2 ] 22

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 3 - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 3 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 28, 2014 Methods to Learn Matrix Data Set Data Sequence Time Series Graph & Data Network Classification Decision

Link Analysis Lecture 7 Link Analysis November 29, 2017 1 CS6220 Data Mining Techniques

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data: Part I Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Sequential and Time Series Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou

Classification and Prediction 3 Cengiz Gunay Partial slide credits: Li Xiong, Han, Kamber, and

Review of classification methods for fraud detection Charlotte Werger Data Scientist DataCamp

A Semi-supervised Stacked Autoencoder Approach for Network Traffic Classification Ons Aouedi,

CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods Sham M Kakade 2018 c

Weakly Supervised Classification Weakly Supervised Classification and Robust Learning and Robust

Off-Policy Evaluation via Off- Policy Classification Alex Irpan, Kanishka Rao, Konstantinos