Lecture 17: More on binary vs. multi-class classifiers - PowerPoint PPT Presentation

Lecture 17: More on binary vs. multi-class classifiers (Polychotomizers: One-Hot Vectors, Softmax, and Cross-Entropy) Mark Hasegawa-Johnson, 3/9/2019. CC-BY 3.0: You are free to share and adapt these slides if you cite the original. Modified by Julia Hockenmaier

More on supervised learning 2

The supervised learning task Given a labeled training data set of N items x n ∈ X with labels y n ∈ Y D train = {( x 1 , y 1 ),…, ( x N , y N )} (y n is determined by some unknown target function f( x )) Return a model g: X X ⟼ Y Y that is a good approximation of f( x ) (g should assign correct labels y to unseen x ∉ D train )

Supervised learning terms Input items/data points x n ∈ X X (e.g. emails) are drawn from an instance space X Output labels y n ∈ Y Y (e.g. ‘spam’/‘nospam’) are drawn from a label space Y Every data point x n ∈ X X has a single correct label y n ∈ Y , defined by an (unknown ) target function f ( x ) = y

Supervised learning Input Output Target function y' = f( x ) x ∈ X y ∈ Y Learned model y = g( x ) An item y An item x drawn from a label drawn from an space Y instance space X ^ You often seen f( x ) instead of g( x ), and y^ but PowerPoint can’t really typeset that, so g( x ) and y’ will have to do.

Supervised learning: Training Labeled Training Data D train Learned Learning ( x 1 , y 1 ) model Algorithm ( x 2 , y 2 ) g( x ) … ( x N , y N ) Give the learner examples in D train The learner returns a model g( x )

Supervised learning: Testing Labeled Test Data D test ( x’ 1 , y’ 1 ) ( x’ 2 , y’ 2 ) … ( x’ M , y’ M ) Reserve some labeled data for testing

Supervised learning: Testing Labeled Test Data Raw Test Test D test Data Labels ( x’ 1 , y’ 1 ) X test Y test ( x’ 2 , y’ 2 ) y’ 1 x’ 1 … y’ 2 x’ 2 ( x’ M , y’ M ) …. ... x’ M y’ M

Supervised learning: Testing Apply the model to the raw test data Raw Test Predicted Test Data Labels Labels X test Y test g( X test ) Learned y’ 1 x’ 1 g( x’ 1 ) model y’ 2 x’ 2 g( x’ 2 ) g( x ) …. …. ... x’ M g( x’ M ) y’ M

Evaluating supervised learners Use a test data set D test that is disjoint from D train D test = {( x’ 1 , y’ 1 ),…, ( x’ M , y’ M )} The learner has not seen the test items during learning. Split your labeled data into two parts: test and training. D test and compare the predicted f( x’ i ) Take all items x’ i in D with the correct y’ i . This requires an evaluation metric (e.g. accuracy).

1. The instance space

1. The instance space X Input Output Learned x ∈ X y ∈ Y Model An item x An item y y = g( x ) drawn from an drawn from a label instance space X space Y Designing an appropriate instance space X X is crucial for how well we can predict y.

1. The instance space X When we apply machine learning to a task, we first need to define the instance space X. X. Instances x ∈ X X are defined by features : Boolean features: Does this email contain the word ‘money’? Numerical features: How often does ‘money’ occur in this email? What is the width/height of this bounding box?

X X as a vector space X is an N-dimensional vector space (e.g. ℝ N ) Each dimension = one feature. Each x is a feature vector (hence the boldface x ). Think of x = [x 1 … x N ] as a point in X : x 2 x 1

From feature templates to vectors When designing features, we often think in terms of templates, not individual features: What is the 2nd letter? N a oki → [1 0 0 0 …] A b e → [0 1 0 0 …] S c rooge → [0 0 1 0 …] What is the i- th letter? Abe → [ 1 0 0 0 0… 0 1 0 0 0 0… 0 0 0 0 1 …]

Good features are essential • The choice of features is crucial for how well a task can be learned. • In many application areas (language, vision, etc.), a lot of work goes into designing suitable features. • This requires domain expertise. • We can’t teach you what specific features to use for your task. • But we will touch on some general principles

2. The label space

2. The label space Y Input Output Learned x ∈ X y ∈ Y Model An item x An item y y = g( x ) drawn from an drawn from a label instance space X space Y The label space Y Y determines what kind of supervised learning task we are dealing with

Supervised learning tasks I Output labels y ∈ Y Y are categorical : CLASSIFICATION Binary classification : Two possible labels Multiclass classification : k possible labels Output labels y ∈ Y Y are structured objects (sequences of labels, parse trees, etc.) Structure learning, etc.

Supervised learning tasks II Output labels y ∈ Y Y are numerical : Regression (linear/polynomial): Labels are continuous-valued Learn a linear/polynomial function f(x) Ranking: Labels are ordinal Learn an ordering f(x 1 ) > f(x 2 ) over input

3. Models (The hypothesis space)

3. The model g( x ) Input Output Learned x ∈ X y ∈ Y Model An item x An item y y = g( x ) drawn from an drawn from a label instance space X space Y We need to choose what kind of model we want to learn

More terminology For classification tasks ( Y Y is categorical, e.g. {0, 1}, or {0, 1, …, k}), the model is called a classifier . For binary classification tasks ( Y Y = {0, 1} or Y Y = {-1, +1}), we can either think of the two values of Y Y as Boolean or as positive/negative

A learning problem x 1 x 2 x 3 x 4 y 1 0 0 1 0 0 2 0 1 0 0 0 3 0 0 1 1 1 ‘ 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0

A learning problem X |= 2 4 = 16 Each x has 4 bits: | X Since Y Y = {0, 1}, each f( x ) defines one subset of X X has 2 16 = 65536 subsets: There are 2 16 possible f( x ) (2 9 are consistent with our data) We would need to see all of X X to learn f( x )

A learning problem We would need to see all of X X to learn f( x ) Easy with | X |=16 Not feasible in general (for any real-world problems) Learning = generalization, not memorization of the training data

Classifiers in vector spaces f( x ) > 0 f( x ) = 0 x 2 f( x ) < 0 x 1 Binary classification: We assume f separates the positive and negative examples: Assign y = 1 to all x where f( x ) > 0 Assign y = 0 (or -1) to all x where f( x ) < 0

Learning a classifier The learning task: Find a function f( x ) that best separates the (training) data What kind of function is f? How do we define best ? How do we find f?

Which model should we pick?

Criteria for choosing models Accuracy: Prefer models that make fewer mistakes We only have access to the training data But we care about accuracy on unseen (test) examples Simplicity (Occam’s razor): Prefer simpler models (e.g. fewer parameters). These (often) generalize better, and need less data for training.

Linear classifiers CS446 Machine Learning 31

Linear classifiers f( x ) > 0 f( x ) = 0 x 2 f( x ) < 0 x 1 Many learning algorithms restrict the hypothesis space to linear classifiers : f( x ) = w 0 + wx

Linear Separability • Not all data sets are linearly separable: x 2 x 1 x 1 • Sometimes, feature transformations help: x 12 |x 2 - x 1 | x 1 x 1

Linear classifiers: f( x ) = w 0 + wx wx f( x ) = 0 f( x ) > 0 x 2 f( x ) < 0 x 1 Linear classifiers are defined over vector spaces Every hypothesis f( x ) is a hyperplane: f( x ) = w 0 + wx f( x ) is also called the decision boundary Assign ŷ = +1 to all x where f( x ) > 0 Assign ŷ = -1 to all x where f( x ) < 0 ŷ = sgn(f( x ))

y·f( x ) > 0: Correct classification f( x ) = 0 f( x ) > 0 x 2 f( x ) < 0 x 1 An example ( x , y) is correctly classified by f( x ) if and only if y·f( x ) > 0: Case 1 (y = +1 = ŷ): f( x ) > 0 ⇒ y·f( x ) > 0 Case 2 (y = -1 = ŷ): f( x ) < 0 ⇒ y·f( x ) > 0 Case 3 (y = +1 ≠ ŷ = -1): f( x ) > 0 ⇒ y·f( x ) < 0 Case 4 (y = -1 ≠ ŷ = +1): f( x ) < 0 ⇒ y·f( x ) < 0

With a separate bias term w 0 : f( x ) = w · x x + w 0 The instance space X is a d -dimensional vector space (each x ∈ X has d elements) The decision boundary f( x ) = 0 is a ( d −1)-dimensional hyperplane in the instance space. The weight vector w is orthogonal (normal) to the decision boundary f( x ) = 0: For any two points x A and x B on the decision boundary f( x A ) = f( x B ) = 0 For any vector ( x B − x A ) on the decision boundary: w ( x B − x A ) = f( x B )−w 0 −f( x A )+w 0 = 0 The bias term w 0 determines the distance of the decision boundary from the origin: For x with f( x ) = 0, the distance to the origin is w ⋅ x = − w 0 w 0 w = − w d 2 ∑ w i i = 1 CS446 Machine Learning 36

With a separate bias term w 0 : f( x ) = w · x x + w 0 x 2 decision boundary arbitrary point f( x ) = 0 x f( x ) distance of x w to decision boundary weight vector w x 1 − w 0 distance of decision boundary w to origin CS446 Machine Learning 37

Canonical representation: getting rid of the bias term With w = (w 1 , …, w N ) T and x = (x 1 , …, x N ) T : f(x) = w 0 + wx = w 0 + ∑ i=1…N w i x i w 0 is called the bias term. The canonical representation redefines w , x as w = (w 0 , w 1 , …, w N ) T and x = (1, x 1 , …, x N ) T => f( x ) = w·x CS446 Machine Learning 38

Lecture 17: More on binary vs. multi-class classifiers - PowerPoint PPT Presentation

Lecture 17: More on binary vs. multi-class classifiers (Polychotomizers: One-Hot Vectors, Softmax, and Cross-Entropy) Mark Hasegawa-Johnson, 3/9/2019. CC-BY 3.0: You are free to share and adapt these slides if you cite the original. Modified by

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

Nonlinear Classifiers II 2 Nonlinear Classifiers: Introduction Classifiers Supervised

Binary Numbers Binary numbers look like this Binary Numbers or Binary Code Binary numbers or

A Quick Review Decimal to binary Binary to decimal Binary to hexadecimal

Linear, Binary SVM Classifiers COMPSCI 371D Machine Learning COMPSCI 371D Machine

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

Binary Trees, Heaps Binary Trees, Heaps Binary trees Binary trees A binary tree (

CS440/ECE448 Lecture 22: Including Slides by Svetlana Lazebnik, 10/2016 Linear Classifiers

Balanced Search Trees Binary Search Trees Binary Search Tree Binary Search Tree A binary tree is

Binary Numbers 723 Binary Numbers 723 = 7x100 + 2x10 + 3x1 Binary Numbers 723 = 7x100 + 2x10 +

Fusion of Continuous Output Classifiers Classifiers Jacob Hays Amit Pillay James DeFelice

LECTURE 2 Review 1 Binary Math and Assembly BINARY MATH In this section, we review Binary

First look at structures CS 6355: Structured Prediction 1 So far Binary classifiers

Multi-class Classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396

Binary trees Binary trees David Morgan Binary trees Binary trees elements have up to 2

Machine Learning Nave Bayes classifiers Types of classifiers We can divide the large

The Function Placement Problem (FPP) Wolfgang Kellerer Technical University of Munich Dagstuhl,

Contents Introduction Pipelined FPGA DNN accelerators Roof-line Model and optimizing

Image Recognition Traffic Patterns for Wireless Multimedia Sensor Networks Wireless Multimedia

We Innovate Pilsen With the help of modern technologies, we make life easier, we develop talents

Machine Learning CS 786 University of Waterloo Lecture 4: May 10, 2012 What is Machine

Modeling the Catchment Via Mixtures: an Uncertainty Framework for Dynamic Hydrologic Systems

Modelling component of the CLIWA-Net project: Workpackage 4000 Erik van Meijgaard, KNMI, De Bilt,

Methods Used For the 2006 Radiance Lights The NGDC Earth Observation Group (EOG) Daniel Ziskin Kim