Outline CS 188: Artificial Intelligence § Generative vs. Discriminative § Binary Linear Classifiers § Perceptron Lecture 21: Perceptrons § Multi-class Linear Classifiers § Multi-class Perceptron § Fixing the Perceptron: MIRA Pieter Abbeel – UC Berkeley Many slides adapted from Dan Klein. § Support Vector Machines* Classification: Feature Vectors Generative vs. Discriminative § Generative classifiers: § E.g. naïve Bayes § A causal model with evidence variables Hello, SPAM § Query model for causes given evidence # free : 2 YOUR_NAME : 0 Do you want free printr MISSPELLED : 2 or cartriges? Why pay more FROM_FRIEND : 0 when you can get them ... + § Discriminative classifiers: ABSOLUTELY FREE! Just § No causal model, no Bayes rule, often no probabilities at all! PIXEL-7,12 : 1 § Try to predict the label Y directly from X “ 2 ” PIXEL-7,13 : 0 ... § Robust, accurate with varied features NUM_LOOPS : 1 ... § Loosely: mistake driven rather than model driven 7 Outline Some (Simplified) Biology § Very loose inspiration: human neurons § Generative vs. Discriminative § Binary Linear Classifiers § Perceptron § Multi-class Linear Classifiers § Multi-class Perceptron § Fixing the Perceptron: MIRA § Support Vector Machines* 9 1
Linear Classifiers Classification: Weights § Binary case: compare features to a weight vector § Inputs are feature values § Learning: figure out the weight vector from examples § Each feature has a weight § Sum is the activation # free : 4 YOUR_NAME :-1 # free : 2 MISSPELLED : 1 YOUR_NAME : 0 FROM_FRIEND :-3 MISSPELLED : 2 ... FROM_FRIEND : 0 ... § If the activation is: w 1 f 1 § Positive, output +1 w 2 Σ >0? f 2 # free : 0 w 3 YOUR_NAME : 1 § Negative, output -1 f 3 MISSPELLED : 1 Dot product positive FROM_FRIEND : 1 means the positive class ... 10 Linear Classifiers Mini Exercise 2 --- Bias Term Linear Classifiers Mini Exercise # free : 2 # free : 4 # free : 1 Bias : 1 Bias : 1 Bias : 1 YOUR_NAME : 0 YOUR_NAME : 1 # free : 2 # free : 4 YOUR_NAME : 1 # free : 1 YOUR_NAME: 0 YOUR_NAME: 1 YOUR_NAME: 1 -1 -3 2 -1 2 § 1. Draw the 4 feature vectors and the weight vector w § 2. Which feature vectors are classified as +? As - ? § 1. Draw the 4 feature vectors and the weight vector w § 3. Draw the line separating feature vectors being classified + and -. § 2. Which feature vectors are classified as +? As - ? § 3. Draw the line separating feature vectors being classified + and -. Binary Decision Rule Outline § In the space of feature vectors § Generative vs. Discriminative § Examples are points § Binary Linear Classifiers § Any weight vector is a hyperplane § One side corresponds to Y=+1 § Perceptron: how to find the weight vector w from data. § Other corresponds to Y=-1 money 2 § Multi-class Linear Classifiers +1 = SPAM § Multi-class Perceptron 1 BIAS : -3 free : 4 § Fixing the Perceptron: MIRA money : 2 0 ... -1 = HAM 0 1 free § Support Vector Machines* 2
Binary Perceptron Update Outline § Generative vs. Discriminative § Start with zero weights § For each training instance: § Binary Linear Classifiers § Classify with current weights § Perceptron § Multi-class Linear Classifiers § If correct (i.e., y=y*), no change! § Multi-class Perceptron § If wrong: adjust the weight vector by adding or subtracting the § Fixing the Perceptron: MIRA feature vector. Subtract if y* is -1. § Support Vector Machines* 18 [demo] Example Exercise --- Which Multiclass Decision Rule Category is Chosen? § If we have multiple classes: BIAS : 1 win : 1 § A weight vector for each class: “ win the vote ” game : 0 vote : 1 the : 1 ... § Score (activation) of a class y: § Prediction highest score wins BIAS : -2 BIAS : 1 BIAS : 2 win : 4 win : 2 win : 0 game : 4 game : 0 game : 2 vote : 0 vote : 4 vote : 0 the : 0 the : 0 the : 0 ... ... ... Binary = multiclass where the negative class has weight zero Exercise: Multiclass linear classifier for Outline 2 classes and binary linear classifier § Generative vs. Discriminative § Consider the multiclass linear classifier for two classes with -1 1 2 2 § Binary Linear Classifiers § Is there an equivalent binary linear classifier, i.e., one that classifies all points x = (x 1 , x 2 ) the same way? § Perceptron § Multi-class Linear Classifiers § Multi-class Perceptron: learning the weight vectors w i from data § Fixing the Perceptron: MIRA § Support Vector Machines* 3
Learning Multiclass Perceptron Example § Start with zero weights “ win the vote ” § Pick up training instances one by one “ win the election ” § Classify with current weights “ win the game ” § If correct, no change! § If wrong: lower score of wrong answer, raise score of right answer BIAS : BIAS : BIAS : win : win : win : game : game : game : vote : vote : vote : the : the : the : ... ... ... 24 Examples: Perceptron Outline § Generative vs. Discriminative § Separable Case § Binary Linear Classifiers § Perceptron § Multi-class Linear Classifiers § Multi-class Perceptron: learning the weight vectors w i from data § Fixing the Perceptron: MIRA § Support Vector Machines* 26 Properties of Perceptrons Examples: Perceptron Separable § Separability: some parameters get § Non-Separable Case the training set perfectly correct § Convergence: if the training is separable, perceptron will eventually converge (binary case) Non-Separable § Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separability 29 30 4
Problems with the Perceptron Fixing the Perceptron § Noise: if the data isn ’ t § Idea: adjust the weight update to separable, weights might thrash mitigate these effects § Averaging weight vectors over time can help (averaged perceptron) § MIRA*: choose an update size that fixes the current mistake … § … but, minimizes the change to w § Mediocre generalization: finds a “ barely ” separating solution § Overtraining: test / held-out accuracy usually rises, then falls § Overtraining is a kind of overfitting § The +1 helps to generalize * Margin Infused Relaxed Algorithm Minimum Correcting Update Maximum Step Size § In practice, it ’ s also bad to make updates that are too large § Example may be labeled incorrectly § You may not have enough features § Solution: cap the maximum possible value of τ with some constant C § Corresponds to an optimization that assumes non-separable data min not τ =0, or would not § Usually converges faster than perceptron have made an error, so min § Usually better, especially on noisy data 35 will be where equality holds Outline Linear Separators § Generative vs. Discriminative § Which of these linear separators is optimal? § Binary Linear Classifiers § Perceptron § Multi-class Linear Classifiers § Multi-class Perceptron: learning the weight vectors w i from data § Fixing the Perceptron: MIRA § Support Vector Machines* 37 5
Mini-Exercise: Give Example Dataset that Would be Overfit Support Vector Machines by SVM, MIRA and running perceptron till convergence § Maximizing the margin: good according to intuition, theory, practice § Only support vectors matter; other training examples are ignorable § Support vector machines (SVMs) find the separator with max margin § Basically, SVMs are MIRA where you optimize over all examples at once MIRA SVM § Could running perceptron less steps lead to better generalization? Classification: Comparison Extension: Web Search § Naïve Bayes x = “ Apple Computers ” § Information retrieval: § Builds a model training data § Given information needs, § Gives prediction probabilities produce information § Strong assumptions about feature independence § Includes, e.g. web search, § One pass through data (counting) question answering, and classic IR § Perceptrons / MIRA: § Makes less assumptions about data § Web search: not exactly § Mistake-driven learning classification, but rather § Multiple passes through data (prediction) ranking § Often more accurate 40 Feature-Based Ranking Perceptron for Ranking x = “ Apple Computers ” § Inputs § Candidates § Many feature vectors: x, § One weight vector: § Prediction: § Update (if wrong): x, 6
Pacman Apprenticeship! § Examples are states s “ correct ” § Candidates are pairs (s,a) action a* § “ Correct ” actions: those taken by expert § Features defined over (s,a) pairs: f(s,a) § Score of a q-state (s,a) given by: § How is this VERY different from reinforcement learning? 7
Recommend
More recommend