MIRA, SVM, k-NN Lirong Xia Linear Classifiers (perceptrons) - PowerPoint PPT Presentation

MIRA, SVM, k-NN Lirong Xia

Linear Classifiers (perceptrons) • Inputs are feature values • Each feature has a weight • Sum is the activation ( ) = ( ) = w i f x ( ) activation w x ∑ w i i f i x i • If the activation is: • Positive: output +1 • Negative, output -1 2

Classification: Weights • Binary case: compare features to a weight vector • Learning: figure out the weight vector from examples 3

Binary Decision Rule • In the space of feature vectors • Examples are points • Any weight vector is a hyperplane • One side corresponds to Y = +1 • Other corresponds to Y = -1 4

Learning: Binary Perceptron • Start with weights = 0 • For each training instance: • Classify with current weights # ( ) ≥ 0 + 1 if w i f x % y = $ ( ) < 0 − 1 if w i f x % & • If correct (i.e. y=y*), no change! • If wrong: adjust the weight vector by adding or subtracting the feature vector. Subtract if y* is -1. w = w + y * i f 5

Multiclass Decision Rule • If we have multiple classes: • A weight vector for each class: w y • Score (activation) of a class y: ( ) w y i f x • Prediction highest score wins ( ) y = argmax w y i f x Binary = multiclass where the negative class has weight zero y 6

Learning: Multiclass Perceptron • Start with all weights = 0 • Pick up training examples one by one • Predict with current weights ( ) y = argmax y w y i f x ( ) = argmax y ∑ w y , i i f i x i • If correct, no change! • If wrong: lower score of wrong answer, raise score of right answer w w f x ( ) = − y y w w f x ( ) = + 7 y * y *

Today • Fixing the Perceptron: MIRA • Support Vector Machines • k-nearest neighbor (KNN) 8

Properties of Perceptrons • Separability: some parameters get the training set perfectly correct • Convergence: if the training is separable, perceptron will eventually converge (binary case) 9

Examples: Perceptron • Non-Separable Case 10

Problems with the Perceptron • Noise: if the data isn’t separable, weights might thrash • Averaging weight vectors over time can help (averaged perceptron) • Mediocre generalization: finds a “barely” separating solution • Overtraining: test / held-out accuracy usually rises, then falls • Overtraining is a kind of overfitting 11

Fixing the Perceptron • Idea: adjust the weight update to mitigate these effects • MIRA*: choose an update size that fixes the current mistake • …but, minimizes the change to w 1 2 min w y − w ' y ∑ 2 w y ( ) ≥ w y i f x ( ) + 1 w y * i f x Guessed y instead of y * on ( ) example x with features f x • The +1 helps to generalize ( ) w y = w ' y − τ f x ( ) w y * = w ' y * + τ f x 12 *Margin Infused Relaxed Algorithm

Minimum Correcting Update 1 w w ' f x ( ) 2 = − τ min ∑ w y − w y ' y y 2 w w ' f x ( ) y w = + τ y * y * w y * i f ≥ w y i f + 1 2 min τ f τ w y * i f ≥ w y i f + 1 min τ τ 2 ( ) f ≥ w ' y − τ f ( ) f + 1 w ' y * + τ f min not τ =0, or would not ( ) f + 1 w ' y − w ' y * have made an error, so min τ = will be where equality holds 2 f i f 13

Maximum Step Size • In practice, it’s also bad to make updates that are too large • Example may be labeled incorrectly • You may not have enough features • Solution: cap the maximum possible value of τ with some constant C " % ( ) f + 1 w ' y − w ' y * $ ' τ * = min , C $ 2 f i f ' # & • Corresponds to an optimization that assumes non-separable data • Usually converges faster than perceptron 14 • Usually better, especially on noisy data

Outline • Fixing the Perceptron: MIRA • Support Vector Machines • k-nearest neighbor (KNN) 15

Linear Separators • Which of these linear separators is optimal? 16

Support Vector Machines • Maximizing the margin: good according to intuition, theory, practice • Only support vectors matter; other training examples are ignorable • Support vector machines (SVMs) find the separator with max margin • Basically, SVMs are MIRA where you optimize over all examples at once MIRA 1 2 min w − w ' ∑ 2 w y ( ) ≥ w y i f x i ( ) + 1 w y * i f x i SVM 1 2 min w ∑ 2 w y ( ) ≥ w y i f x i ( ) + 1 ∀ i , y w y * i f x i 17

Classification: Comparison • Naive Bayes • Builds a model training data • Gives prediction probabilities • Strong assumptions about feature independence • One pass through data (counting) • Perceptrons / MIRA: • Makes less assumptions about data • Mistake-driven learning • Multiple passes through data (prediction) • Often more accurate 18

Outline • Fixing the Perceptron: MIRA • Support Vector Machines • k-nearest neighbor (KNN) 19

Case-Based Reasoning . • Similarity for classification • Case-based reasoning . . • Predict an instance’s label using similar instances • Nearest-neighbor classification Generated data • 1-NN: copy the label of the most similar data point . • K-NN: let the k nearest neighbors . . vote (have to devise a weighting scheme) • Key issue: how to define similarity • Trade-off: 1-NN • Small k gives relevant neighbors 20 • Large k gives smoother functions

Parametric / Non-parametric • Parametric models: • Fixed set of parameters • More data means better settings • Non-parametric models: • Complexity of the classifier increases with data • Better in the limit, often worse in the non-limit • (K)NN is non-parametric 21

Nearest-Neighbor Classification • Nearest neighbor for digits: • Take new image • Compare to all training images • Assign based on closest example • Encoding: image is vector of intensities: = 0.0 0.0 0.3 0.8 0.7 0.1  0.0 • What’s the similarity function? • Dot product of two images vectors? ( ) = x i x ' = sim x , x ' ∑ x i x ' i i • Usually normalize vectors so ||x||=1 • min = 0 (when?), max = 1(when?) 22

Basic Similarity • Many similarities based on feature dot products: ( ) = f x ( ) i f x ' ( ) = ( ) f i x ' ( ) sim x , x ' ∑ f i x i • If features are just the pixels: ( ) = x i x ' = sim x , x ' ∑ x i x ' i i • Note: not all similarities are of this form 23

Invariant Metrics • Better distances use knowledge about vision • Invariant metrics: • Similarities are invariant under certain transformations • Rotation, scaling, translation, stroke-thickness… • E.g.: • 16*16=256 pixels; a point in 256-dim space • Small similarity in R 256 (why?) • How to incorporate invariance into similarities? 24 This and next few slides adapted from Xiao Hu, UIUC

Invariant Metrics • Each example is now a curve in R 256 • Rotation invariant similarity: s’=max s(r( ),r( )) • E.g. highest similarity between images’ rotation lines 25

MIRA, SVM, k-NN Lirong Xia Linear Classifiers (perceptrons) - PowerPoint PPT Presentation

MIRA, SVM, k-NN Lirong Xia Linear Classifiers (perceptrons) Inputs are feature values Each feature has a weight Sum is the activation ( ) = ( ) = w i f x ( ) activation w x w i i f i x i If the activation is:

Technical Overview MIRA MIRA 2013 - Pakistan Outline Survey of Surveys MIRA Overview

SVM-flexible discriminant analysis Huimin Peng November 20, 2014 Outline SVM Nonlinear SVM =

Overview SVM theoretical framework ORACLE data mining technology SVM parameter

SVM on Intel Graphics Jesse Barnes Intel Open Source Technology Center 1 What is SVM?

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

Machine Learning Theory CS 446 1. SVM risk SVM risk Consider the empirical and true/population

Lecture 5: SVM II Princeton University COS 495 Instructor: Yingyu Liang Review: SVM objective

Mira Presentation The Mira Foundation believes in the equality for all bodies. Our goal is to

Fitting SVM models in Matlab mdl = fitcsvm(X,y) fit a classifier using SVM X is a

An SVM- -based Masquerade Detection based Masquerade Detection An SVM Method with Online Update

SVM Kernels COMPSCI 371D Machine Learning COMPSCI 371D Machine Learning SVM Kernels 1 /

Classication SVM algorithms with interval-valued training data using triangular and

Eye-blink Detection Based on SVM Wang Xiaoxing Shanghai Jiao Tong University figure1

Climate mitigation in the least carbon emitting countries: What role for ODA? Mira Kknen,

Management Information Systems and Information Systems Management Miguel Mira da Silva

while developing MIRA By: Dr Noureddin Sadawi 26/06/2019 Collaboration with Dr Alina Calin -

Does Training Affect Match Performance? A Study Using Data Mining And Tracking Devices

CSE 573: Artificial Intelligence Autumn 2010 Lecture 16: Machine Learning Topics 12/7/2010 Luke

Charming new results from STAR! NSD Staff Meeting, January 22, 2019 Sooraj Radhakrishnan

Multivariate Data Analysis with T MVA Andreas Hoecker ( * ) (CERN) Statistical Tools Workshop,

Learning a Distance Metric for Structured Network Prediction Stuart Andrews and Tony Jebara

Perceptrons Jonathan Mugan jonathanwilliammugan@gmail.com www.jonathanmugan.com @jmugan April

CS 4100: Artificial Intelligence Perceptrons and Logistic Regression Jan-Willem van de Meent,

Search for top Squarks Using Multivariate Methods Jonas Graw Max Planck Institute for Physics