Lecture 17: More on binary vs. multi-class classifiers (Polychotomizers: One-Hot Vectors, Softmax, and Cross-Entropy) Mark Hasegawa-Johnson, 3/9/2019. CC-BY 3.0: You are free to share and adapt these slides if you cite the original. Modified by Julia Hockenmaier
More on supervised learning 2
The supervised learning task Given a labeled training data set of N items x n ∈ X with labels y n ∈ Y D train = {( x 1 , y 1 ),…, ( x N , y N )} (y n is determined by some unknown target function f( x )) Return a model g: X X ⟼ Y Y that is a good approximation of f( x ) (g should assign correct labels y to unseen x ∉ D train )
Supervised learning terms Input items/data points x n ∈ X X (e.g. emails) are drawn from an instance space X Output labels y n ∈ Y Y (e.g. ‘spam’/‘nospam’) are drawn from a label space Y Every data point x n ∈ X X has a single correct label y n ∈ Y , defined by an (unknown ) target function f ( x ) = y
Supervised learning Input Output Target function y' = f( x ) x ∈ X y ∈ Y Learned model y = g( x ) An item y An item x drawn from a label drawn from an space Y instance space X ^ You often seen f( x ) instead of g( x ), and y^ but PowerPoint can’t really typeset that, so g( x ) and y’ will have to do.
Supervised learning: Training Labeled Training Data D train Learned Learning ( x 1 , y 1 ) model Algorithm ( x 2 , y 2 ) g( x ) … ( x N , y N ) Give the learner examples in D train The learner returns a model g( x )
Supervised learning: Testing Labeled Test Data D test ( x’ 1 , y’ 1 ) ( x’ 2 , y’ 2 ) … ( x’ M , y’ M ) Reserve some labeled data for testing
Supervised learning: Testing Labeled Test Data Raw Test Test D test Data Labels ( x’ 1 , y’ 1 ) X test Y test ( x’ 2 , y’ 2 ) y’ 1 x’ 1 … y’ 2 x’ 2 ( x’ M , y’ M ) …. ... x’ M y’ M
Supervised learning: Testing Apply the model to the raw test data Raw Test Predicted Test Data Labels Labels X test Y test g( X test ) Learned y’ 1 x’ 1 g( x’ 1 ) model y’ 2 x’ 2 g( x’ 2 ) g( x ) …. …. ... x’ M g( x’ M ) y’ M
Evaluating supervised learners Use a test data set D test that is disjoint from D train D test = {( x’ 1 , y’ 1 ),…, ( x’ M , y’ M )} The learner has not seen the test items during learning. Split your labeled data into two parts: test and training. D test and compare the predicted f( x’ i ) Take all items x’ i in D with the correct y’ i . This requires an evaluation metric (e.g. accuracy).
1. The instance space
1. The instance space X Input Output Learned x ∈ X y ∈ Y Model An item x An item y y = g( x ) drawn from an drawn from a label instance space X space Y Designing an appropriate instance space X X is crucial for how well we can predict y.
1. The instance space X When we apply machine learning to a task, we first need to define the instance space X. X. Instances x ∈ X X are defined by features : Boolean features: Does this email contain the word ‘money’? Numerical features: How often does ‘money’ occur in this email? What is the width/height of this bounding box?
X X as a vector space X is an N-dimensional vector space (e.g. ℝ N ) Each dimension = one feature. Each x is a feature vector (hence the boldface x ). Think of x = [x 1 … x N ] as a point in X : x 2 x 1
From feature templates to vectors When designing features, we often think in terms of templates, not individual features: What is the 2nd letter? N a oki → [1 0 0 0 …] A b e → [0 1 0 0 …] S c rooge → [0 0 1 0 …] What is the i- th letter? Abe → [ 1 0 0 0 0… 0 1 0 0 0 0… 0 0 0 0 1 …]
Good features are essential • The choice of features is crucial for how well a task can be learned. • In many application areas (language, vision, etc.), a lot of work goes into designing suitable features. • This requires domain expertise. • We can’t teach you what specific features to use for your task. • But we will touch on some general principles
2. The label space
2. The label space Y Input Output Learned x ∈ X y ∈ Y Model An item x An item y y = g( x ) drawn from an drawn from a label instance space X space Y The label space Y Y determines what kind of supervised learning task we are dealing with
Supervised learning tasks I Output labels y ∈ Y Y are categorical : CLASSIFICATION Binary classification : Two possible labels Multiclass classification : k possible labels Output labels y ∈ Y Y are structured objects (sequences of labels, parse trees, etc.) Structure learning, etc.
Supervised learning tasks II Output labels y ∈ Y Y are numerical : Regression (linear/polynomial): Labels are continuous-valued Learn a linear/polynomial function f(x) Ranking: Labels are ordinal Learn an ordering f(x 1 ) > f(x 2 ) over input
3. Models (The hypothesis space)
3. The model g( x ) Input Output Learned x ∈ X y ∈ Y Model An item x An item y y = g( x ) drawn from an drawn from a label instance space X space Y We need to choose what kind of model we want to learn
More terminology For classification tasks ( Y Y is categorical, e.g. {0, 1}, or {0, 1, …, k}), the model is called a classifier . For binary classification tasks ( Y Y = {0, 1} or Y Y = {-1, +1}), we can either think of the two values of Y Y as Boolean or as positive/negative
A learning problem x 1 x 2 x 3 x 4 y 1 0 0 1 0 0 2 0 1 0 0 0 3 0 0 1 1 1 ‘ 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0
A learning problem X |= 2 4 = 16 Each x has 4 bits: | X Since Y Y = {0, 1}, each f( x ) defines one subset of X X has 2 16 = 65536 subsets: There are 2 16 possible f( x ) (2 9 are consistent with our data) We would need to see all of X X to learn f( x )
A learning problem We would need to see all of X X to learn f( x ) Easy with | X |=16 Not feasible in general (for any real-world problems) Learning = generalization, not memorization of the training data
Classifiers in vector spaces f( x ) > 0 f( x ) = 0 x 2 f( x ) < 0 x 1 Binary classification: We assume f separates the positive and negative examples: Assign y = 1 to all x where f( x ) > 0 Assign y = 0 (or -1) to all x where f( x ) < 0
Learning a classifier The learning task: Find a function f( x ) that best separates the (training) data What kind of function is f? How do we define best ? How do we find f?
Which model should we pick?
Criteria for choosing models Accuracy: Prefer models that make fewer mistakes We only have access to the training data But we care about accuracy on unseen (test) examples Simplicity (Occam’s razor): Prefer simpler models (e.g. fewer parameters). These (often) generalize better, and need less data for training.
Linear classifiers CS446 Machine Learning 31
Linear classifiers f( x ) > 0 f( x ) = 0 x 2 f( x ) < 0 x 1 Many learning algorithms restrict the hypothesis space to linear classifiers : f( x ) = w 0 + wx
Linear Separability • Not all data sets are linearly separable: x 2 x 1 x 1 • Sometimes, feature transformations help: x 12 |x 2 - x 1 | x 1 x 1
Linear classifiers: f( x ) = w 0 + wx wx f( x ) = 0 f( x ) > 0 x 2 f( x ) < 0 x 1 Linear classifiers are defined over vector spaces Every hypothesis f( x ) is a hyperplane: f( x ) = w 0 + wx f( x ) is also called the decision boundary Assign ŷ = +1 to all x where f( x ) > 0 Assign ŷ = -1 to all x where f( x ) < 0 ŷ = sgn(f( x ))
y·f( x ) > 0: Correct classification f( x ) = 0 f( x ) > 0 x 2 f( x ) < 0 x 1 An example ( x , y) is correctly classified by f( x ) if and only if y·f( x ) > 0: Case 1 (y = +1 = ŷ): f( x ) > 0 ⇒ y·f( x ) > 0 Case 2 (y = -1 = ŷ): f( x ) < 0 ⇒ y·f( x ) > 0 Case 3 (y = +1 ≠ ŷ = -1): f( x ) > 0 ⇒ y·f( x ) < 0 Case 4 (y = -1 ≠ ŷ = +1): f( x ) < 0 ⇒ y·f( x ) < 0
With a separate bias term w 0 : f( x ) = w · x x + w 0 The instance space X is a d -dimensional vector space (each x ∈ X has d elements) The decision boundary f( x ) = 0 is a ( d −1)-dimensional hyperplane in the instance space. The weight vector w is orthogonal (normal) to the decision boundary f( x ) = 0: For any two points x A and x B on the decision boundary f( x A ) = f( x B ) = 0 For any vector ( x B − x A ) on the decision boundary: w ( x B − x A ) = f( x B )−w 0 −f( x A )+w 0 = 0 The bias term w 0 determines the distance of the decision boundary from the origin: For x with f( x ) = 0, the distance to the origin is w ⋅ x = − w 0 w 0 w = − w d 2 ∑ w i i = 1 CS446 Machine Learning 36
With a separate bias term w 0 : f( x ) = w · x x + w 0 x 2 decision boundary arbitrary point f( x ) = 0 x f( x ) distance of x w to decision boundary weight vector w x 1 − w 0 distance of decision boundary w to origin CS446 Machine Learning 37
Canonical representation: getting rid of the bias term With w = (w 1 , …, w N ) T and x = (x 1 , …, x N ) T : f(x) = w 0 + wx = w 0 + ∑ i=1…N w i x i w 0 is called the bias term. The canonical representation redefines w , x as w = (w 0 , w 1 , …, w N ) T and x = (1, x 1 , …, x N ) T => f( x ) = w·x CS446 Machine Learning 38
Recommend
More recommend