MAP for Gaussian mean and variance Conjugate priors Mean: Gaussian - PowerPoint PPT Presentation

MAP for Gaussian mean and variance • Conjugate priors – Mean: Gaussian prior – Variance: Wishart Distribution • Prior for mean: = N( h , l 2 ) 1

MAP for Gaussian Mean (Assuming known variance s 2 ) Independent of s 2 if l 2 = s 2 /s MAP under Gauss-Wishart prior - Homework 2

Bayes Optimal Classifier Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

Classification Goal: Sports Science News Features, X Labels, Y Probability of Error 4

Optimal Classification Optimal predictor: (Bayes classifier) 1 Bayes risk 0.5 0 • Even the optimal classifier makes mistakes R(f*) > 0 • Optimal classifier depends on unknown distribution 5

Optimal Classifier Bayes Rule: Optimal classifier: Class prior Class conditional density 6

Example Decision Boundaries • Gaussian class conditional densities (1-dimension/feature) Decision Boundary 7

Example Decision Boundaries • Gaussian class conditional densities (2-dimensions/features) Decision Boundary 8

Learning the Optimal Classifier Optimal classifier: Class conditional Class prior density Need to know Prior P(Y = y) for all y Likelihood P(X=x|Y = y) for all x,y 9

Learning the Optimal Classifier Task: Predict whether or not a picnic spot is enjoyable Training Data: X = (X 1 X 2 X 3 … … X d ) Y n rows Lets learn P(Y|X) – how many parameters? Prior: P(Y = y) for all y K-1 if K labels (2 d – 1)K if d binary features Likelihood: P(X=x|Y = y) for all x,y 10

Learning the Optimal Classifier Task: Predict whether or not a picnic spot is enjoyable Training Data: X = (X 1 X 2 X 3 … … X d ) Y n rows Lets learn P(Y|X) – how many parameters? 2 d K – 1 (K classes, d binary features) Need n >> 2 d K – 1 number of training data to learn all parameters 11

Conditional Independence • X is conditionally independent of Y given Z: probability distribution governing X is independent of the value of Y, given the value of Z • Equivalent to: • e.g., Note: does NOT mean Thunder is independent of Rain 12

Conditional vs. Marginal Independence • C calls A and B separately and tells them a number n ϵ {1,…,10} • Due to noise in the phone, A and B each imperfectly (and independently) draw a conclusion about what the number was. • A thinks the number was n a and B thinks it was n b . • Are n a and n b marginally independent? – No, we expect e.g. P(n a = 1 | n b = 1) > P(n a = 1) • Are n a and n b conditionally independent given n? – Yes, because if we know the true number, the outcomes n a and n b are purely determined by the noise in each phone. P(n a = 1 | n b = 1, n = 2) = P(n a = 1 | n = 2) 13

Prediction using Conditional Independence • Predict Lightening • From two conditionally Independent features – Thunder – Rain # parameters needed to learn likelihood given L (2 2 -1)2 = 6 P(T,R|L) With conditional independence assumption (2-1)2 + (2-1)2 = 4 P(T,R|L) = P(T|L) P(R|L) 14

Naïve Bayes Assumption • Naïve Bayes assumption: – Features are independent given class: – More generally: • How many parameters now? (2-1)dK vs. (2 d -1)K • Suppose X is composed of d binary features 15

Naïve Bayes Classifier • Given: – Class Prior P(Y) – d conditionally independent features X given the class Y – For each X i , we have likelihood P(X i |Y) • Decision rule: • If conditional independence assumption holds, NB is optimal classifier! But worse otherwise. 16

Naïve Bayes Algo – Discrete features • Training Data • Maximum Likelihood Estimates – For Class Prior – For Likelihood • NB Prediction for test data 17

Subtlety 1 – Violation of NB Assumption • Usually, features are not conditionally independent: • Actual probabilities P(Y|X) often biased towards 0 or 1 (Why?) • Nonetheless, NB is the single most used classifier out there – NB often performs well, even when assumption is violated – [Domingos & Pazzani ’96] discuss some conditions for good performance 18

Subtlety 2 – Insufficient training data • What if you never see a training instance where X 1 =a when Y=b? – e.g., Y={SpamEmail}, X 1 ={‘Earn’} – P(X 1 =a | Y=b) = 0 • Thus, no matter what the values X 2 ,…,X d take: – P(Y=b | X 1 =a,X 2 ,…, X d ) = 0 • What now??? 19

MLE vs. MAP What if we toss the coin too few times? • You say: Probability next toss is a head = 0 • Billionaire says: You’re fired! …with prob 1  • Beta prior equivalent to extra coin flips • As N → 1 , prior is “forgotten” • But, for small sample size, prior is important! 20

Naïve Bayes Algo – Discrete features • Training Data • Maximum A Posteriori Estimates – add m “virtual” examples Assume priors MAP Estimate # virtual examples with Y = b Now, even if you never observe a class/feature posterior probability never zero. 21

Case Study: Text Classification • Classify e-mails – Y = {Spam,NotSpam} • Classify news articles – Y = {what is the topic of the article?} • Classify webpages – Y = {Student, professor, project, …} • What about the features X ? – The text! 22

Features X are entire document – X i for i th word in article 23

NB for Text Classification • P( X |Y) is huge!!! – Article at least 1000 words, X ={X 1 ,…,X 1000 } – X i represents i th word in document, i.e., the domain of X i is entire vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc. • NB assumption helps a lot!!! – P(X i =x i |Y=y) is just the probability of observing word x i at the i th position in a document on topic y 24

Bag of words model • Typical additional assumption – Position in document doesn’t matter : P(X i =x i |Y=y) = P(X k =x i |Y=y) – “Bag of words” model – order of words on the page ignored – Sounds really silly, but often works very well! When the lecture is over, remember to wake up the person sitting next to you in the lecture room. 25

Bag of words model • Typical additional assumption – Position in document doesn’t matter : P(X i =x i |Y=y) = P(X k =x i |Y=y) – “Bag of words” model – order of words on the page ignored – Sounds really silly, but often works very well! in is lecture lecture next over person remember room sitting the the the to to up wake when you 26

Bag of words approach aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 ... gas 1 ... oil 1 … Zaire 0 27

NB with Bag of Words for text classification • Learning phase: – Class Prior P(Y) Explore in HW – P(X i |Y) • Test phase: – For each document • Use naïve Bayes decision rule 28

Twenty news groups results 29

Learning curve for twenty news groups 30

What if features are continuous? Eg., character recognition: X i is intensity at i th pixel Gaussian Naïve Bayes (GNB): Different mean and variance for each class k and each pixel i . Sometimes assume variance is independent of Y (i.e., s i ), • or independent of X i (i.e., s k ) • or both (i.e., s ) • 31

Estimating parameters: Y discrete, X i continuous Maximum likelihood estimates: k th class j th training image i th pixel in j th training image 32

Example: GNB for classifying mental states [Mitchell et al.] ~1 mm resolution ~2 images per sec. 15,000 voxels/image non-invasive, safe measures Blood Oxygen Level Dependent (BOLD) 33 response

Gaussian Naïve Bayes: Learned  voxel,word [Mitchell et al.] 15,000 voxels or features 10 training examples or subjects per class 34

Learned Naïve Bayes Models – Means for P(BrainActivity | WordCategory) [Mitchell et al.] Pairwise classification accuracy: 85% People words Animal words 35

What you should know… • Optimal decision using Bayes Classifier • Naïve Bayes classifier – What’s the assumption – Why we use it – How do we learn it – Why is Bayesian estimation important • Text classification – Bag of words model • Gaussian NB – Features are still conditionally independent – Each feature has a Gaussian distribution given class 36

MAP for Gaussian mean and variance Conjugate priors Mean: Gaussian - PowerPoint PPT Presentation

MAP for Gaussian mean and variance Conjugate priors Mean: Gaussian prior Variance: Wishart Distribution Prior for mean: = N( h , l 2 ) 1 MAP for Gaussian Mean (Assuming known variance s 2 ) Independent of s 2 if l 2 = s 2 /s MAP

Gaussian Filter The Gaussian filter 1 2 1 A Gaussian kernel gives less 1 2 4 2 weight to

Variance Will Perkins January 22, 2013 Variance Definition The variance of a random variable X

Estimating Variance under Estimating Mean . . . Interval and Fuzzy Estimating Variance . . .

Feb 27: Expectation, Variance, and Standard Deviation In-class Midterm Exam MOVED to 3/10

Lecture 3 Capacity of Multiuser Gaussian Channels The Gaussian uplink: 6.1 The fading

Analysis of variance and regression December 4, 2007 Variance component models Variance

Alex Psomas: Lecture 18. Random Variables: Variance 1. Variance 2. Distributions Variance Flip

Variance = E[I 2 ] 2pE[I] + p 2 = E[I] 2p p + p 2 = 2 2 = p-2p+ p pq variance.1

map-D map-D data refined map-D data refined map-D A GPU Database for Real-Time Big Data

Computer control of gene expression: Robust setpoint tracking of protein mean and variance using

CS70: Jean Walrand: Lecture 36. Gaussian and CLT CS70: Jean Walrand: Lecture 36. Gaussian and

Faster Gaussian Lattice Sampling using Information Leakage Gaussian Sampling Our Work Lazy

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Abstract Data Type Map Map ADT Another fundamental abstract data type is the map (also The most

Variance; Continuous Random Variables 18.05 Spring 2014 Jeremy Orloff and Jonathan Bloom Variance

Variance; Continuous Random Variables 18.05 Spring 2014 January 1, 2017 1 / 26 Variance and

Evaluation of KIWC West Oahu Farrington Highway Escalation Costs December 17 , 2015 WOFH

Proposed Amendments Public Workshops February 6 20, 2020 Workshop Slides

A collaboration between Sengkang West Way Centre (SWC) with National Arts Council (NAC) Names of

Recall: OpenGL Image Space Approach Paint pixel with color of closest object for (each

An introduction to mean field games and their applications Pr. Olivier Gu eant (Universit e

There Will Be Oil A Sacred History of Energy and Environment In the Modern U.S. Darren Dochuk

Bounding Average-Energy Games Patricia Bouyer 1 Piotr Hofman 1 , 2 Nicolas Markey 1 , 3 Mickael

MAT 1160 WEEK 2 Dr. N. Van Cleave Spring 2010 N. Van Cleave, c 2010 Student