Multi-class Classifiers Machine Learning Hamid Beigy Sharif - PowerPoint PPT Presentation

Multi-class Classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Multi-class Classifiers Fall 1396 1 / 14

Table of contents Introduction 1 One-against-all classification 2 One-against-one classification 3 C − class discriminant function 4 Hierarchical classification 5 Error correcting coding classification 6 Hamid Beigy (Sharif University of Technology) Multi-class Classifiers Fall 1396 2 / 14

Introduction In classification, the goal is to find a mapping from inputs X to outputs t ∈ { 1 , 2 , . . . , C } given a labeled set of input-output pairs. We can extend the binary classifiers to C class classification problems or use the binary classifiers. For C -class, we have four extensions for using binary classifiers. One-against-all: This approach is a straightforward extension of two-class problem and considers it as a of C two-class problems. One-against-one: In this approach, C ( C − 1) / 2 binary classifiers are trained and each classifier separates a pair of classes. The decision is made on the basis of a majority vote. Single C − class discriminant: In this approach, a single C − class discriminant function comprising C linear functions are used. Hierarchical classification: In this approach, the output space is hierarchically divided i.e. the classes are arranged into a tree. Error correcting coding: For a C − class problem a number of L binary classifiers are used,where L is appropriately chosen by the designer. Each class is now represented by a binary code word of length L . Hamid Beigy (Sharif University of Technology) Multi-class Classifiers Fall 1396 3 / 14

One-against-all classification The extension is to consider a set of C two-class problems. For each class, we seek to design an optimal discriminant function, g i ( x ) (for i = 1 , 2 , . . . , C ) so that g i ( x ) > g j ( x ), ∀ j ̸ = i , if x ∈ C i . Adopting the SVM methodology, we can design the discriminant functions so that g i ( x ) = 0 to be the optimal hyperplane separating class C i from all the others. Thus, each classifier is designed to give g i ( x ) > 0 for x ∈ C i and g i ( x ) < 0 otherwise. Classification is then achieved according to the following rule: Assign x to class C i if i = argmax g k ( x ) k Hamid Beigy (Sharif University of Technology) Multi-class Classifiers Fall 1396 4 / 14

Properties of one-against-all classification The number of classifiers equals to C . Each binary classifier deals with a rather asymmetric problem in the sense that training is carried out with many more negative than positive examples. This becomes more serious when the number of classes is relatively large. This technique, however,may lead to indeterminate regions, where more than one g i ( x ) is positive Hamid Beigy (Sharif University of Technology) Multi-class Classifiers Fall 1396 5 / 14

Properties of one-against-all classification The implementation of OVA is easy. It is not robust to errors of classifiers. If a classifier make a mistake, it is possible that the entire prediction is errorneous. Theorem (OVA error bound) Suppose the average binary error of C binary classifiers is ϵ . Then the error rate of the OVA multi–class classifier is at most ( C − 1) ϵ . Please prove the above theorem. Hamid Beigy (Sharif University of Technology) Multi-class Classifiers Fall 1396 6 / 14

One-against-one classification In this case, C ( C − 1) / 2 binary classifiers are trained and each classifier separates a pair of classes. The decision is made on the basis of a majority vote. The obvious disadvantage of the technique is that a relatively large number of binary classifiers has to be trained. Hamid Beigy (Sharif University of Technology) Multi-class Classifiers Fall 1396 7 / 14

One-against-one classification AVA error bound Theorem (AVA error bound) Suppose the average binary error of the C ( C − 1) / 2 binary classifiers is at most ϵ . Then the error rate of the AVA multi–class classifier is at most 2( C − 1) ϵ . Please prove the above theorem. The bound for AVA is 2( C − 1) ϵ and the bound for OVA is ( C − 1) ϵ . Does this mean that OVA is neccessarily better than AVA? Why or why not? Please do it as a homework. Hamid Beigy (Sharif University of Technology) Multi-class Classifiers Fall 1396 8 / 14

C − class discriminant function We can avoid the difficulties of previous methods by considering a single C − class discriminant comprising C linear functions of the form g k ( x ) = w T k x + w k 0 Then assigning a point x to class C k if g k ( x ) > g j ( x ) for all j ̸ = k . The decision boundary between class C k and class C j is given by g k ( x ) = g j ( x ) and corresponds to hyperplane ( w k − w j ) T x + ( w k 0 − w j 0 ) = 0 This has the same form as decision boundary for the two-class case. R j R i R k Hamid Beigy (Sharif University of Technology) Multi-class Classifiers Fall 1396 9 / 14

Hierarchical classification In hierarchical classification, the output space is hierarchically divided i.e. the classes are arranged into a tree. { C 1 , C 2 , C 3 , C 4 } vs { C 5 , C 6 , C 7 , C 8 } { C 1 , C 2 } vs { C 3 , C 4 } { C 5 , C 6 } vs { C 7 , C 8 } C 1 vs C 2 C 3 vs C 4 C 5 vs C 6 C 7 vs C 8 C 1 C 2 C 3 C 4 C 5 C 6 C 7 C 8 Hamid Beigy (Sharif University of Technology) Multi-class Classifiers Fall 1396 10 / 14

Hierarchical classification Hierarchical classification error bound Theorem (Hierarchical classification error bound) Suppose the average binary classifiers error is ϵ . Then the error rate of the hierarchical classifier is at most ⌈ log 2 C ⌉ ϵ . One thing to keep in mind with hierarchical classifiers is that you have control over how the tree is defined. In OVA and AVA you have no control in the way that classification problems are created. In hierarchical classifiers, the only thing that matters is that, at the root, half of the classes are considered positive and half are considered negative. You want to split the classes in such a way that this classification decision is as easy as possible. Can you do better than ⌈ log 2 C ⌉ ϵ ? Yes. Using error-correcting codes. Hamid Beigy (Sharif University of Technology) Multi-class Classifiers Fall 1396 11 / 14

Error correcting coding classification In this approach, the classification task is treated in the context of error correcting coding. For a C − class problem a number of, say, L binary classifiers are used,where L is appropriately chosen by the designer. Each class is now represented by a binary code word of length L . During training, for the i th classifier, i = 1 , 2 , . . . , L , the desired labels, y , for each class are chosen to be either − 1 or +1. For each class, the desired labels may be different for the various classifiers. This is equivalent to constructing a matrix C × L of desired labels. For example, if C = 4 and L = 6, such a matrix can be Hamid Beigy (Sharif University of Technology) Multi-class Classifiers Fall 1396 12 / 14

Error correcting coding classification (cont.) For example, if C = 4 and L = 6, such a matrix can be During training, the first classifier (corresponding to the first column of the previous matrix) is designed in order to respond ( − 1 , +1 , +1 , − 1) for examples of classes C 1 , C 2 , C 3 , C 4 , respectively. The second classifier will be trained to respond ( − 1 , − 1 , +1 , − 1), and so on. The procedure is equivalent to grouping the classes into L different pairs, and, for each pair, we train a binary classifier accordingly. Each row must be distinct and corresponds to a class. Hamid Beigy (Sharif University of Technology) Multi-class Classifiers Fall 1396 13 / 14

Error correcting coding classification (cont.) When an unknown pattern is presented, the output of each one of the binary classifiers is recorded, resulting in a code word. Then,the Hamming distance of this code word is measured against the C code words, and the pattern is classified to the class corresponding to the smallest distance. This feature is the power of this technique. If the code words are designed so that the minimum Hamming distance between any pair of them is, say, d , then a correct decision will still be reached even if the decisions of at most ⌊ d − 1 2 ⌋ out of the L, classifiers are wrong. Theorem (Error-correcting error bound) Suppose the average binary classifiers error is ϵ . Then the error rate of the classifier created using error correcting codes is at most 2 ϵ . You can prove a lower bound that states that the best you could possible do is ϵ 2 . Hamid Beigy (Sharif University of Technology) Multi-class Classifiers Fall 1396 14 / 14

Multi-class Classifiers Machine Learning Hamid Beigy Sharif - PowerPoint PPT Presentation

Multi-class Classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Multi-class Classifiers Fall 1396 1 / 14 Table of contents Introduction 1 One-against-all

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

Occasion-level Classifiers or Event-level Classifiers? -Evidence from Child Language Acquisition

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

On Robust Trimming of Bayesian Network Classifiers YooJung Choi and Guy Van den Broeck UCLA

Data Dependence in Data Dependence in Combining Classifiers Combining Classifiers Mohamed

Automatically Evading Classifiers A Case Study on PDF Malware Classifiers Weilin Xu

Evaluation of Classifiers Evaluation of Classifiers ROC Curves ROC Curves Reject Curves Reject

Linear Classifiers: Expressiveness Machine Learning 1 Lecture outline Linear models:

Visualization for Explainable Classifiers Yao MING THE HONG KONG UNIVERSITY OF SCIENCE AND

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers

MAXIMUM MARGIN CLASSIFIERS MAXIMUM MARGIN CLASSIFIERS Matthieu R Bloch Tuesday, February 11,

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

Off- -The The- -Shelf Classifiers Shelf Classifiers Off A method that can be applied directly

Multiclass Classification using SVMs on GPUs Sergio Herrero 6.338J Applied Parallel Computing

Learnability Beyond Uniform Convergence Shai Shalev-Shwartz School of CS and Engineering, The

CernVM[FS] and CMS Open Data Pilot Jakob Blomer, Gerardo Ganis, Adam Huffman, Kati

CS7038 - Malware Analysis - Wk01.2 VirtualBox Lab Setup and Crash Course Coleman Kane

Towards Optimal Discriminating Order for Multiclass Classification Dong Liu, Shuicheng Yan,

Software Vulnerability Handling and practical incident recognition Idea: ( ) Sven Gabriel

Extending Binary Linear Classification One-Versus-All Classification (OVA) } In the presence of

Multiclass Predictions CMSC 422 M ARINE C ARPUAT marine@cs.umd.edu T opics Given an arbitrary