ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning Topics – Bayes Nets – (Finish) Parameter Learning – Structure Learning Readings: KF 18.1, 18.3; Barber 9.5, 10.4 Dhruv Batra Virginia Tech
Administrativia • HW1 – Out – Due in 2 weeks: Feb 17, Feb 19, 11:59pm – Please please please please start early – Implementation: TAN, structure + parameter learning – Please post questions on Scholar Forum. (C) Dhruv Batra 2
Recap of Last Time (C) Dhruv Batra 3
Learning Bayes nets Known structure Unknown structure Fully observable Very easy Hard data Missing data Somewhat easy Very very hard (EM) Data CPTs – x (1) P(X i | Pa Xi ) … x (m) structure parameters (C) Dhruv Batra Slide Credit: Carlos Guestrin 4
Learning the CPTs For each discrete variable X i Data x (1) … P MLE ( X i = a | Pa X i = b ) = Count( X i = a, Pa X i = b ) ˆ x (m) Count(Pa X i = b ) (C) Dhruv Batra Slide Credit: Carlos Guestrin 5
Plan for today • (Finish) BN Parameter Learning – Parameter Sharing – Plate notation • (Start) BN Structure Learning – Log-likelihood score – Decomposability – Information never hurts (C) Dhruv Batra 6
Meta BN • Explicitly showing parameters as variables • Example on board – One variable X; parameter θ X – Two variables X,Y; parameters θ X , θ Y|X (C) Dhruv Batra 7
Global parameter independence • Global parameter independence: Flu Allergy – All CPT parameters are independent – Prior over parameters is product of prior over Sinus CPTs Nose Headache • Proposition : For fully observable data D , if prior satisfies global parameter independence, then
Parameter Sharing • What if X 1 , … , X n are n random variables for coin tosses of the same coin? (C) Dhruv Batra 9
Naïve Bayes vs Bag-of-Words • What’s the difference? • Parameter sharing! (C) Dhruv Batra 10
Text classification • Classify e-mails – Y = {Spam,NotSpam} • What about the features X ? – X i represents i th word in document; i = 1 to doc-length – X i takes values in vocabulary, 10,000 words, etc. (C) Dhruv Batra 11
Bag of Words • Position in document doesn’t matter : P(X i =x i |Y=y) = P(X k =x i |Y=y) – Order of words on the page ignored – Parameter sharing When the lecture is over, remember to wake up the person sitting next to you in the lecture room. (C) Dhruv Batra Slide Credit: Carlos Guestrin 12
Bag of Words • Position in document doesn’t matter : P(X i =x i |Y=y) = P(X k =x i |Y=y) – Order of words on the page ignored – Parameter sharing in is lecture lecture next over person remember room sitting the the the to to up wake when you (C) Dhruv Batra Slide Credit: Carlos Guestrin 13
HMMs semantics: Details X 1 = {a, … z} X 2 = {a, … z} X 3 = {a, … z} X 4 = {a, … z} X 5 = {a, … z} O 1 = O 2 = O 3 = O 4 = O 5 = Just 3 distributions: (C) Dhruv Batra Slide Credit: Carlos Guestrin 14
N-grams • Learnt from Darwin’s On the Origin of Species Bigrams Unigrams _ a b c d e f g h i j k l m n o p q r s t u v w x y z 1 0.16098 _ _ 2 0.06687 a a 3 0.01414 b b 4 0.02938 c c 5 0.03107 d d 6 0.11055 e e 7 0.02325 f f 8 0.01530 g g 9 0.04174 h h 10 0.06233 i i 11 0.00060 j j 12 0.00309 k k 13 0.03515 l l 14 0.02107 m m 15 0.06007 n n 16 0.06066 o o 17 0.01594 p p 18 0.00077 q q 19 0.05265 r r 20 0.05761 s s 21 0.07566 t t 22 0.02149 u u 23 0.00993 v v 24 0.01341 w w 25 0.00208 x x 26 0.01381 y y 27 0.00039 z z (C) Dhruv Batra Image Credit: Kevin Murphy 15
Plate Notation • X 1 , … , X n are n random variables for coin tosses of the same coin • Plate denotes replication (C) Dhruv Batra 16
Plate Notation Y Plates denote X j replication of random variables D (C) Dhruv Batra 17
Hierarchical Bayesian Models • Why stop with a single prior? (C) Dhruv Batra 18
BN: Parameter Learning: What you need to know • Parameter Learning – MLE • Decomposes; results in counting procedure • Will shatter dataset if too many parents – Bayesian Estimation • Conjugate priors • Priors = regularization (also viewed as smoothing) • Hierarchical priors – Plate notation – Shared parameters (C) Dhruv Batra 19
Learning Bayes nets Known structure Unknown structure Fully observable Very easy Hard data Missing data Somewhat easy Very very hard (EM) Data CPTs – x (1) P(X i | Pa Xi ) … x (m) structure parameters (C) Dhruv Batra Slide Credit: Carlos Guestrin 20
Goals of Structure Learning • Prediction – Care about a good structure because presumably it will lead to good predictions • Discovery – I want to understand some system Data CPTs – x (1) P(X i | Pa Xi ) … x (m) structure parameters (C) Dhruv Batra 21
Types of Errors • Truth: Flu Allergy Sinus Nose Headache • Recovered: Flu Flu Allergy Allergy Sinus Sinus Nose Nose Headache Headache (C) Dhruv Batra 22
Learning the structure of a BN Data • Constraint-based approach – Test conditional independencies in data – Find an I-map <x 1 (1) , … ,x n (1) > … • Score-based approach <x 1 (m) , … ,x n (m) > – Finding a structure and parameters is a density Learn structure and estimation task parameters – Evaluate model as we evaluated parameters • Maximum likelihood • Bayesian • etc. Flu Allergy Sinus Nose Headache (C) Dhruv Batra Slide Credit: Carlos Guestrin 23
Score-based approach Possible structures Data Flu Allergy Score structure Learn parameters Sinus -52 Nose Headache <x 1 (1) , … ,x n (1) > … <x 1 (m) , … ,x n (m) > Flu Allergy Score structure Learn parameters Sinus -60 Nose Headache Flu Allergy Score structure Learn parameters Sinus -500 Nose Headache (C) Dhruv Batra Slide Credit: Carlos Guestrin 24
How many graphs? • N vertices. • How many (undirected) graphs? • How many (undirected) trees? (C) Dhruv Batra 25
What’s a good score? • Score(G) = log-likelihood(G : D, θ MLE ) (C) Dhruv Batra 26
Information-theoretic interpretation of Maximum Likelihood Score • Consider two node graph – Derived on board (C) Dhruv Batra 27
Information-theoretic interpretation of Maximum Likelihood Score • For a general graph G Flu Allergy Sinus Nose Headache (C) Dhruv Batra Slide Credit: Carlos Guestrin 28
Information-theoretic interpretation of Maximum Likelihood Score Flu Allergy Sinus Nose Headache • Implications: – Intuitive: higher mutual info à higher score – Decomposes over families in BN (node and it’s parents) – Same score for I-equivalent structures! – Information never hurts! (C) Dhruv Batra 29
Chow-Liu tree learning algorithm 1 • For each pair of variables X i ,X j – Compute empirical distribution: – Compute mutual information: • Define a graph – Nodes X 1 , … ,X n – Edge (i,j) gets weight (C) Dhruv Batra Slide Credit: Carlos Guestrin 30
Chow-Liu tree learning algorithm 2 • Optimal tree BN – Compute maximum weight spanning tree – Directions in BN: pick any node as root, and direct edges away from root • breadth-first-search defines directions (C) Dhruv Batra Slide Credit: Carlos Guestrin 31
Can we extend Chow-Liu? • Tree augmented naïve Bayes (TAN) [Friedman et al. ’ 97] – Naïve Bayes model overcounts, because correlation between features not considered – Same as Chow-Liu, but score edges with: (C) Dhruv Batra Slide Credit: Carlos Guestrin 32
Recommend
More recommend