part iii machine learning cs 188 artificial intelligence
play

Part III: Machine Learning CS 188: Artificial Intelligence Up until - PDF document

Part III: Machine Learning CS 188: Artificial Intelligence Up until now: how to reason in a model and how to make optimal decisions Machine learning: how to acquire a model Lecture 20: Dynamic Bayes Nets, Nave Bayes on the basis


  1. Part III: Machine Learning CS 188: Artificial Intelligence § Up until now: how to reason in a model and how to make optimal decisions § Machine learning: how to acquire a model Lecture 20: Dynamic Bayes Nets, Naïve Bayes on the basis of data / experience § Learning parameters (e.g. probabilities) § Learning structure (e.g. BN graphs) § Learning hidden concepts (e.g. clustering) Pieter Abbeel – UC Berkeley Slides adapted from Dan Klein. Machine Learning This Set of Slides Parameter Estimation r g g r g g r g g r r g g g g § Estimating the distribution of a random variable § An ML Example: Parameter Estimation § Elicitation: ask a human (why is this hard?) § Maximum likelihood § Empirically: use training data (learning!) § Smoothing § E.g.: for each outcome x, look at the empirical rate of that value: § Applications r g g § Main concepts § Naïve Bayes § This is the estimate that maximizes the likelihood of the data § Issue: overfitting. E.g., what if only observed 1 jelly bean? Estimation: Smoothing Estimation: Laplace Smoothing § Relative frequencies are the maximum likelihood estimates § Laplace ’ s estimate: § Pretend you saw every outcome H H T once more than you actually did § In Bayesian statistics, we think of the parameters as just another random variable, with its own distribution § Can derive this as a MAP ???? estimate with Dirichlet priors (see cs281a) 1

  2. Estimation: Laplace Smoothing Example: Spam Filter § Laplace ’ s estimate Dear Sir. § Input: email H H T (extended): § Output: spam/ham First, I must solicit your confidence in this transaction, this is by virture of its nature § Pretend you saw every outcome § Setup: as being utterly confidencial and top k extra times § Get a large collection of secret. … example emails, each labeled “ spam ” or “ ham ” TO BE REMOVED FROM FUTURE § Note: someone has to hand MAILINGS, SIMPLY REPLY TO THIS label all this data! MESSAGE AND PUT "REMOVE" IN THE § Want to learn to predict SUBJECT. § What ’ s Laplace with k = 0? labels of new, future emails § k is the strength of the prior 99 MILLION EMAIL ADDRESSES FOR ONLY $99 § Features: The attributes used to make the ham / spam decision § Laplace for conditionals: Ok, Iknow this is blatantly OT but I'm § Words: FREE! beginning to go insane. Had an old Dell § Smooth each condition § Text Patterns: $dd, CAPS Dimension XPS sitting in the corner and independently: § Non-text: SenderInContacts decided to put it to use, I know it was working pre being stuck in the corner, but § … when I plugged it in, hit the power nothing happened. Example: Digit Recognition Other Classification Tasks § In classification, we predict labels y (classes) for inputs x § Input: images / pixel grids 0 § Output: a digit 0-9 § Setup: § Examples: § Get a large collection of example 1 § Spam detection (input: document, classes: spam / ham) images, each labeled with a digit § OCR (input: images, classes: characters) § Note: someone has to hand label all § Medical diagnosis (input: symptoms, classes: diseases) this data! 2 § Want to learn to predict labels of new, § Automatic essay grader (input: document, classes: grades) future digit images § Fraud detection (input: account activity, classes: fraud / no fraud) § Customer service email routing § Features: The attributes used to make the § … many more 1 digit decision § Pixels: (6,8)=ON § Classification is an important commercial technology! § Shape Patterns: NumComponents, ?? AspectRatio, NumLoops § … Important Concepts Bayes Nets for Classification § Data: labeled instances, e.g. emails marked spam/ham § One method of classification: § Training set § Held out set § Use a probabilistic model! § Test set § Features are observed random variables F i Training § Features: attribute-value pairs which characterize each x Data § Y is the query variable § Experimentation cycle § Learn parameters (e.g. model probabilities) on training set § Use probabilistic inference to compute most likely Y § (Tune hyperparameters on held-out set) § Compute accuracy of test set § Very important: never “ peek ” at the test set! § Evaluation Held-Out § Accuracy: fraction of instances predicted correctly Data § Overfitting and generalization § Want a classifier which does well on test data § You already know how to do this inference Test § Overfitting: fitting the training data very closely, but not Data generalizing well § We ’ ll investigate overfitting and generalization formally in a few lectures 2

  3. Simple Classification General Naïve Bayes § A general naive Bayes model: M § Simple example: two binary features |Y| x |F| n S F parameters Y direct estimate Bayes estimate (no assumptions) F 1 F 2 F n n x |F| x |Y| |Y| parameters Conditional parameters independence § We only specify how each feature depends on the class + § Total number of parameters is linear in n Inference for Naïve Bayes General Naïve Bayes § Goal: compute posterior over causes § What do we need in order to use naïve Bayes? § Step 1: get joint probability of causes and evidence § Inference (you know this part) § Start with a bunch of conditionals, P(Y) and the P(F i |Y) tables § Use standard inference to compute P(Y|F 1 … F n ) § Nothing new here § Estimates of local conditional probability tables § P(Y), the prior over labels § P(F i |Y) for each feature (evidence variable) § These probabilities are collectively called the parameters of the + § Step 2: get probability of evidence model and denoted by θ § Up until now, we assumed these appeared by magic, but … § Step 3: renormalize § … they typically come from training data: we ’ ll look at this now A Digit Recognizer Naïve Bayes for Digits § Input: pixel grids § Simple version: § One feature F ij for each grid position <i,j> § Possible feature values are on / off, based on whether intensity is more or less than 0.5 in underlying image § Each input maps to a feature vector, e.g. § Here: lots of features, each is binary valued § Naïve Bayes model: § Output: a digit 0-9 § What do we need to learn? 3

  4. Examples: CPTs Parameter Estimation § Estimating distribution of random variables like X or X | Y § Empirically: use training data 1 0.1 § For each outcome x, look at the empirical rate of that value: 1 0.01 1 0.05 2 0.1 2 0.05 2 0.01 r g g 3 0.1 3 0.05 3 0.90 4 0.1 4 0.30 4 0.80 5 0.1 5 0.80 5 0.90 § This is the estimate that maximizes the likelihood of the data 6 0.1 6 0.90 6 0.90 7 0.1 7 0.05 7 0.25 8 0.1 8 0.60 8 0.85 9 0.1 9 0.50 9 0.60 § Elicitation: ask a human! 0 0.1 0 0.80 0 0.80 § Usually need domain experts, and sophisticated ways of eliciting probabilities (e.g. betting games) § Trouble calibrating A Spam Filter Naïve Bayes for Text § Bag-of-Words Naïve Bayes: Dear Sir. § Naïve Bayes spam filter § Predict unknown class label (spam vs. ham) First, I must solicit your confidence in this § Assume evidence features (e.g. the words) are independent transaction, this is by virture of its nature § Data: as being utterly confidencial and top § Warning: subtly different assumptions than before! secret. … § Collection of emails, Word at position labeled spam or ham § Generative model i, not i th word in TO BE REMOVED FROM FUTURE § Note: someone has to the dictionary! MAILINGS, SIMPLY REPLY TO THIS hand label all this data! MESSAGE AND PUT "REMOVE" IN THE § Split into training, held- SUBJECT. out, test sets § Tied distributions and bag-of-words 99 MILLION EMAIL ADDRESSES FOR ONLY $99 § Usually, each variable gets its own conditional probability § Classifiers distribution P(F|Y) Ok, Iknow this is blatantly OT but I'm § Learn on the training set § In a bag-of-words model beginning to go insane. Had an old Dell § (Tune it on a held-out set) § Each position is identically distributed Dimension XPS sitting in the corner and decided to put it to use, I know it was § All positions share the same conditional probs P(W|C) § Test it on new emails working pre being stuck in the corner, but § Why make this assumption? when I plugged it in, hit the power nothing happened. Example: Spam Filtering Spam Example Word P(w|spam) P(w|ham) Tot Spam Tot Ham § Model: (prior) 0.33333 0.66666 -1.1 -0.4 § What are the parameters? Gary 0.00002 0.00021 -11.8 -8.9 would 0.00069 0.00084 -19.1 -16.0 you 0.00881 0.00304 -23.8 -21.8 like 0.00086 0.00083 -30.9 -28.9 ham : 0.66 the : 0.0156 the : 0.0210 to 0.01517 0.01339 -35.1 -33.2 spam: 0.33 to : 0.0153 to : 0.0133 and : 0.0115 of : 0.0119 lose 0.00008 0.00002 -44.5 -44.0 of : 0.0095 2002: 0.0110 weight 0.00016 0.00002 -53.3 -55.0 you : 0.0093 with: 0.0108 while 0.00027 0.00027 -61.5 -63.2 a : 0.0086 from: 0.0107 you 0.00881 0.00304 -66.2 -69.0 with: 0.0080 and : 0.0105 from: 0.0075 a : 0.0100 sleep 0.00006 0.00001 -76.0 -80.5 ... ... P(spam | w) = 98.9 § Where do these tables come from? 4

Recommend


More recommend