Multinomial NaΓ―ve Bayes: A Generative Story Generative Story π = distribution over πΏ labels y for label π = 1 to πΏ: π π = distribution over J feature values for item π = 1 to π: π¦ π1 π¦ π2 π¦ π3 π¦ π4 π¦ π5 π§ π ~ Cat π for each feature π π¦ ππ βΌ Cat(π π§ π ) Maximize Log-likelihood β π = ΰ· ΰ· log π π§ π ,π¦ π,π + ΰ· log π π§ π s. t. π π π π π β₯ 0 ΰ· π ππ = 1 βπ ΰ· π π = 1 π ππ β₯ 0, π π
Multinomial NaΓ―ve Bayes: A Generative Story Generative Story π = distribution over πΏ labels y for label π = 1 to πΏ: π π = distribution over J feature values for item π = 1 to π: π¦ π1 π¦ π2 π¦ π3 π¦ π4 π¦ π5 π§ π ~ Cat π for each feature π π¦ ππ βΌ Cat(π π§ π ,π ) Maximize Log-likelihood via Lagrange Multipliers ( β₯ π constraints not shown) β π = ΰ· ΰ· log π π§ π ,π¦ π,π + ΰ· log π π§ π β π ΰ· π π β 1 β ΰ· π π ΰ· π ππ β 1 π π π π π π
Multinomial NaΓ―ve Bayes: Learning Calculate feature generation terms Calculate class priors For each k : For each k : obs k = single object containing all items k = all items with class = k items labeled as k Foreach feature j n kj = # of occurrences of j in obs k π ππ π π = |items π | π π|π = Ο π β² π ππ β² # items
Brill and Banko (2001) With enough data, the classifier may not matter Adapted from Jurafsky & Martin (draft)
Summary: NaΓ―ve Bayes is Not So NaΓ―ve, but not without issue Pro Con Model the posterior in one go? Very Fast, low storage requirements (e.g., use conditional maxent) Robust to Irrelevant Features Are the features really uncorrelated? Very good in domains with many equally important features Are plain counts always appropriate? Optimal if the independence assumptions hold Are there βbetterβ ways of handling missing/noisy data? Dependable baseline for text (automated, more principled) classification (but often not the best) Adapted from Jurafsky & Martin (draft)
Outline Directed Graphical Models NaΓ―ve Bayes Undirected Graphical Models Factor Graphs Ising Model Message Passing: Graphical Model Inference
Undirected Graphical Models An undirected graph G=(V,E) that represents a probability distribution over random variables π 1 , β¦ , π π Joint probability factorizes based on cliques in the graph
Undirected Graphical Models An undirected graph G=(V,E) that represents a probability distribution over random variables π 1 , β¦ , π π Joint probability factorizes based on cliques in the graph Common name: Markov Random Fields
Undirected Graphical Models An undirected graph G=(V,E) that represents a probability distribution over random variables π 1 , β¦ , π π Joint probability factorizes based on cliques in the graph Common name: Markov Random Fields Undirected graphs can have an alternative formulation as Factor Graphs
Markov Random Fields: Undirected Graphs π π¦ 1 , π¦ 2 , π¦ 3 , β¦ , π¦ π
Markov Random Fields: Undirected Graphs clique : subset of nodes, where nodes are pairwise connected maximal clique : a clique that cannot add a node and remain a clique π π¦ 1 , π¦ 2 , π¦ 3 , β¦ , π¦ π
Markov Random Fields: Undirected Graphs clique : subset of nodes, where nodes are pairwise connected maximal clique : a clique that cannot add a node and remain a clique π π¦ 1 , π¦ 2 , π¦ 3 , β¦ , π¦ π = 1 π ΰ· π π· π¦ π π· variables part of the clique C global normalization maximal potential function (not cliques necessarily a probability!)
Markov Random Fields: Undirected Graphs clique : subset of nodes, where nodes are pairwise connected maximal clique : a clique that cannot add a node and remain a clique π π¦ 1 , π¦ 2 , π¦ 3 , β¦ , π¦ π = 1 π ΰ· π π· π¦ π π· variables part of the clique C global normalization maximal potential function (not cliques necessarily a probability!)
Markov Random Fields: Undirected Graphs clique : subset of nodes, where nodes are pairwise connected maximal clique : a clique that cannot add a node and remain a clique π π¦ 1 , π¦ 2 , π¦ 3 , β¦ , π¦ π = 1 π ΰ· π π· π¦ π π· variables part Q : What restrictions should we of the clique C place on the potentials π π· ? global normalization maximal potential function (not cliques necessarily a probability!)
Markov Random Fields: Undirected Graphs clique : subset of nodes, where nodes are pairwise connected maximal clique : a clique that cannot add a node and remain a clique π π¦ 1 , π¦ 2 , π¦ 3 , β¦ , π¦ π = 1 π ΰ· π π· π¦ π π· variables part Q : What restrictions should we of the clique C place on the potentials π π· ? global normalization maximal potential function (not A : π π· β₯ 0 (or π π· > 0 ) cliques necessarily a probability!)
Terminology: Potential Functions π π¦ 1 , π¦ 2 , π¦ 3 , β¦ , π¦ π = 1 π ΰ· π π· π¦ π π· energy function (for clique C) (get the total energy of a configuration by summing the individual energy functions) π π· π¦ π = exp βπΉ(π¦ π· ) Boltzmann distribution
Ambiguity in Undirected Model Notation π π¦, π§, π¨ β π(π¦, π§, π¨) X Y Z π π¦, π§, π¨ β π 1 π¦,π§ π 2 π§,π¨ π 3 π¦,π¨
Outline Directed Graphical Models NaΓ―ve Bayes Undirected Graphical Models Factor Graphs Ising Model Message Passing: Graphical Model Inference
MRFs as Factor Graphs Undirected graphs: G=(V,E) that represents π(π 1 , β¦ , π π ) Factor graph of p : Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the random variables Factor nodes F take values associated with the potential functions Edges show what variables are used in which factors
MRFs as Factor Graphs Undirected graphs: X G=(V,E) that represents Y Z π(π 1 , β¦ , π π ) Factor graph of p : Bipartite graph of evidence nodes X, factor nodes F, and edges T
MRFs as Factor Graphs Undirected graphs: X G=(V,E) that represents π(π 1 , β¦ , π π ) Y Z Factor graph of p : Bipartite graph of evidence nodes X, factor nodes F, and X edges T Y Z Evidence nodes X are the random variables
MRFs as Factor Graphs Undirected graphs: G=(V,E) X that represents π(π 1 , β¦ , π π ) Y Z Factor graph of p : Bipartite graph of evidence nodes X, factor nodes F, and edges T Evidence nodes X are the X random variables Factor nodes F take values Y Z associated with the potential functions
MRFs as Factor Graphs Undirected graphs: G=(V,E) that X represents π(π 1 , β¦ , π π ) Factor graph of p : Bipartite graph of evidence nodes X, Y Z factor nodes F, and edges T Evidence nodes X are the random variables X Factor nodes F take values associated with the potential functions Y Z Edges show what variables are used in which factors
Different Factor Graph Notation for the Same Graph X X Y Z Y Z X Y Z
Directed vs. Undirected Models: Moralization x 1 x 3 x 2 x 4
Directed vs. Undirected Models: Moralization x 1 x 3 x 1 x 3 x 2 x 2 x 4 x 4 π π¦ 1 , β¦ , π¦ 4 = π π¦ 1 π π¦ 2 π π¦ 3 π(π¦ 4 |π¦ 1 , π¦ 2 , π¦ 3 )
Directed vs. Undirected Models: Moralization x 1 x 3 x 1 x 3 x 2 x 2 x 4 x 4 π π¦ 1 , β¦ , π¦ 4 = parents of nodes in a π π¦ 1 π π¦ 2 π π¦ 3 π(π¦ 4 |π¦ 1 , π¦ 2 , π¦ 3 ) directed graph must be connected in an undirected graph
Example: Linear Chain z 1 z 2 z 3 z 4 Directed (e.g., hidden Markov model [HMM]; generative) w 1 w 2 w 3 w 4
Example: Linear Chain z 1 z 2 z 3 z 4 Directed (e.g., hidden Markov model [HMM]; generative) w 1 w 2 w 3 w 4 z 1 z 2 z 3 z 4 Directed (e.g.., maximum entropy Markov model [MEMM]; conditional) w 1 w 2 w 3 w 4
Example: Linear Chain z 1 z 2 z 3 z 4 Directed (e.g., hidden Markov model [HMM]; generative) w 1 w 2 w 3 w 4 z 1 z 2 z 3 z 4 Directed (e.g.., maximum entropy Markov model [MEMM]; conditional) w 1 w 2 w 3 w 4 z 1 z 2 z 3 z 4 Undirected (e.g., conditional random field w 1 w 2 w 3 w 4 [CRF])
Example: Linear Chain z 1 z 2 z 3 z 4 Directed (e.g., hidden Markov model [HMM]; generative) w 1 w 2 w 3 w 4 z 1 z 2 z 3 z 4 Directed (e.g.., maximum entropy Markov model [MEMM]; conditional) w 1 w 2 w 3 w 4 z 1 z 2 z 3 z 4 Undirected as factor graph (e.g., conditional random field [CRF])
Example: Linear Chain Conditional Random Field z 1 z 2 z 3 z 4 Widely used in applications like part-of-speech tagging Noun-Mod Noun Noun Verb President Obama told Congress β¦
Example: Linear Chain Conditional Random Field z 1 z 2 z 3 z 4 Widely used in applications like part-of-speech tagging Noun-Mod Noun Noun Verb President Obama told Congress β¦ and named entity recognition Person Person Org. Other President Obama told Congress β¦
Linear Chain CRFs for Part of Speech Tagging A linear chain CRF is a conditional probabilistic model of the sequence of tags π¨ 1 , π¨ 2 , β¦ , π¨ π conditioned on the entire input sequence π¦ 1:π
Linear Chain CRFs for Part of Speech Tagging π β£|β’ A linear chain CRF is a conditional probabilistic model of the sequence of tags π¨ 1 , π¨ 2 , β¦ , π¨ π conditioned on the entire input sequence π¦ 1:π
Linear Chain CRFs for Part of Speech Tagging π π¨ 1 , π¨ 2 , β¦ , π¨ π |β’ A linear chain CRF is a conditional probabilistic model of the sequence of tags π¨ 1 , π¨ 2 , β¦ , π¨ π conditioned on the entire input sequence π¦ 1:π
Linear Chain CRFs for Part of Speech Tagging π π¨ 1 , π¨ 2 , β¦ , π¨ π |π¦ 1:π A linear chain CRF is a conditional probabilistic model of the sequence of tags π¨ 1 , π¨ 2 , β¦ , π¨ π conditioned on the entire input sequence π¦ 1:π
Linear Chain CRFs for Part of Speech Tagging π π¨ 1 , π¨ 2 , β¦ , π¨ π |π¦ 1:π π 1 π 3 z 1 z 2 π 2 z 3 z 4 π 4 π π π π 1 2 3 4
Linear Chain CRFs for Part of Speech Tagging π 1 π 3 z 1 z 2 π 2 z 3 z 4 π 4 π π π π 1 2 3 4 π π¨ 1 , π¨ 2 , β¦ , π¨ π |π¦ 1:π β N exp( π π , π + π π , π π π¨ π , π¨ π+1 ΰ· π π¨ π ) i=1
Linear Chain CRFs for Part of Speech Tagging π π : inter-tag features (can depend on any/all input words π¦ 1:π ) π 1 π 3 z 1 z 2 π 2 z 3 z 4 π 4 π π π π 1 2 3 4
Linear Chain CRFs for Part of Speech Tagging π π : inter-tag features π π : solo tag features (can depend on (can depend on any/all input words any/all input words π¦ 1:π ) π¦ 1:π ) π 1 π 3 z 1 z 2 π 2 z 3 z 4 π 4 π π π π 1 2 3 4
Linear Chain CRFs for Part of Speech Tagging π π : inter-tag features π π : solo tag features (can depend on (can depend on any/all input words any/all input words π¦ 1:π ) π¦ 1:π ) Feature design, just π 1 π 3 z 1 z 2 π 2 z 3 z 4 π 4 like in maxent π π π π 1 2 models! 3 4
Linear Chain CRFs for Part of Speech Tagging π π : inter-tag features π π : solo tag features (can depend on (can depend on any/all input words any/all input words π¦ 1:π ) π¦ 1:π ) Example: π π,πβπ z j , z j+1 = 1 (if z j == N & z j+1 == V) else 0 π π,told,πβπ z j , z j+1 = 1 (if z j == N & z j+1 == V & x j == told) else 0 π 1 π 3 z 1 z 2 π 2 z 3 z 4 π 4 π π π π 1 2 3 4
Outline Directed Graphical Models NaΓ―ve Bayes Undirected Graphical Models Factor Graphs Ising Model Message Passing: Graphical Model Inference
Example: Ising Model Image denoising (Bishop, 2006; Fig 8.30) y: observed (noisy) pixel/state w/ 10% noise original X Y x: original pixel/state
Example: Ising Model Image denoising (Bishop, 2006; Fig 8.30) y: observed (noisy) pixel/state w/ 10% noise original x: original pixel/state two solutions Q : What are the cliques?
Example: Ising Model Image denoising (Bishop, 2006; Fig 8.30) y: observed (noisy) pixel/state w/ 10% noise original x: original pixel/state two solutions Q : What are the cliques?
Example: Ising Model y: Image denoising (Bishop, 2006; Fig 8.30) observed (noisy) pixel/state w/ 10% noise original x: original pixel/state two solutions neighboring pixels should be similar πΉ π¦, π§ = β ΰ· π¦ π β πΎ ΰ· π¦ π π¦ π β π ΰ· π¦ π π§ π π ππ π x i and y i should allow for a bias be correlated
Example: Ising Model y: Image denoising (Bishop, 2006; Fig 8.30) observed (noisy) pixel/state w/ 10% noise original x: original pixel/state two solutions neighboring pixels should be similar πΉ π¦, π§ = β ΰ· π¦ π β πΎ ΰ· π¦ π π¦ π β π ΰ· π¦ π π§ π π ππ π x i and y i should allow for a bias be correlated
Example: Ising Model y: Image denoising (Bishop, 2006; Fig 8.30) observed (noisy) pixel/state w/ 10% noise original x: original pixel/state two solutions neighboring pixels should be similar Q : Why subtract Ξ² and Ξ· ? πΉ π¦, π§ = β ΰ· π¦ π β πΎ ΰ· π¦ π π¦ π β π ΰ· π¦ π π§ π π ππ π x i and y i should allow for a bias be correlated
Example: Ising Model y: Image denoising (Bishop, 2006; Fig 8.30) observed (noisy) pixel/state w/ 10% noise original x: original pixel/state two solutions neighboring pixels should be similar Q : Why subtract Ξ² and Ξ· ? πΉ π¦, π§ = β ΰ· π¦ π β πΎ ΰ· π¦ π π¦ π β π ΰ· π¦ π π§ π A : Better states β lower π ππ π energy (higher potential) x i and y i should allow for a bias π π· π¦ π = exp βπΉ(π¦ π· ) be correlated
Markov Random Fields with Factor Graph Notation unary y: observed factor (noisy) pixel/state variable factor nodes are added according to maximal x: original cliques pixel/state binary factor factor graphs are bipartite
Outline Directed Graphical Models NaΓ―ve Bayes Undirected Graphical Models Factor Graphs Ising Model Message Passing: Graphical Model Inference
Two Problems for Undirected Models π π¦ 1 , π¦ 2 , π¦ 3 , β¦ , π¦ π = 1 π ΰ· π π· π¦ π π· Finding the normalizer Computing the marginals
Two Problems for Undirected Models π π¦ 1 , π¦ 2 , π¦ 3 , β¦ , π¦ π = 1 π ΰ· π π· π¦ π π· Finding the normalizer Computing the marginals π = ΰ· ΰ· π π (π¦ π ) π¦ π
Two Problems for Undirected Models π π¦ 1 , π¦ 2 , π¦ 3 , β¦ , π¦ π = 1 π ΰ· π π· π¦ π π· Finding the normalizer Computing the marginals Sum over all variable combinations, with the x n coordinate fixed π π (π€) = ΰ· ΰ· π π (π¦ π ) π = ΰ· ΰ· π π (π¦ π ) π¦:π¦ π =π€ π π¦ π Example: 3 variables, fix the 2 nd dimension π 2 (π€) = ΰ· ΰ· ΰ· π π (π¦ = π¦ 1 , π€, π¦ 3 ) π¦ 1 π¦ 3 π
Two Problems for Undirected Models π π¦ 1 , π¦ 2 , π¦ 3 , β¦ , π¦ π = 1 π ΰ· π π· π¦ π π· Finding the normalizer Computing the marginals Sum over all variable combinations, with the x n coordinate fixed π π (π€) = ΰ· ΰ· π π (π¦ π ) π = ΰ· ΰ· π π (π¦ π ) π¦:π¦ π =π€ π π¦ π Example: 3 Q : Why are these difficult? variables, fix the 2 nd dimension π 2 (π€) = ΰ· ΰ· ΰ· π π (π¦ = π¦ 1 , π€, π¦ 3 ) π¦ 1 π¦ 3 π
Two Problems for Undirected Models π π¦ 1 , π¦ 2 , π¦ 3 , β¦ , π¦ π = 1 π ΰ· π π· π¦ π π· Finding the normalizer Computing the marginals Sum over all variable combinations, with the x n coordinate fixed π π (π€) = ΰ· ΰ· π π (π¦ π ) π = ΰ· ΰ· π π (π¦ π ) π¦:π¦ π =π€ π π¦ π Example: 3 Q : Why are these difficult? variables, fix the 2 nd dimension A : Many different combinations π 2 (π€) = ΰ· ΰ· ΰ· π π (π¦ = π¦ 1 , π€, π¦ 3 ) π¦ 1 π¦ 3 π
Message Passing: Count the Soldiers If you are the front soldier in the line, say the number βoneβ to the soldier behind you. If you are the rearmost soldier in the line, say the number βoneβ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side ITILA, Ch 16
Message Passing: Count the Soldiers If you are the front soldier in the line, say the number βoneβ to the soldier behind you. If you are the rearmost soldier in the line, say the number βoneβ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side ITILA, Ch 16
Message Passing: Count the Soldiers If you are the front soldier in the line, say the number βoneβ to the soldier behind you. If you are the rearmost soldier in the line, say the number βoneβ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side ITILA, Ch 16
Message Passing: Count the Soldiers If you are the front soldier in the line, say the number βoneβ to the soldier behind you. If you are the rearmost soldier in the line, say the number βoneβ to the soldier in front of you. If a soldier ahead of or behind you says a number to you, add one to it, and say the new number to the soldier on the other side ITILA, Ch 16
Sum-Product Algorithm Main idea: message passing An exact inference algorithm for tree-like graphs Belief propagation (forward-backward for HMMs) is a special case
Sum-Product definition of π π¦ π = π€ = ΰ· π π¦ 1 , π¦ 2 , β¦ , π¦ π , β¦ , π¦ π marginal π¦:π¦ π =π€ β¦ β¦
Sum-Product definition of π π¦ π = π€ = ΰ· π π¦ 1 , π¦ 2 , β¦ , π¦ π , β¦ , π¦ π marginal π¦:π¦ π =π€ main idea : use bipartite nature of graph to efficiently compute the marginals β¦ β¦ The factor nodes can act as filters
Sum-Product definition of π π¦ π = π€ = ΰ· π π¦ 1 , π¦ 2 , β¦ , π¦ π , β¦ , π¦ π marginal π¦:π¦ π =π€ main idea : use bipartite nature of graph to efficiently compute the marginals π πβπ β¦ β¦ π πβπ π πβπ
Sum-Product alternative π π¦ π = π€ = ΰ· π πβπ¦ π (π¦ π ) marginal computation π main idea : use bipartite nature of graph to efficiently compute the marginals π πβπ β¦ β¦ π πβπ π πβπ
Sum-Product From variables to factors m π πβπ π¦ π = ΰ· π π β² βπ π¦ π π β² βπ(π)\π n set of factors in which variable n participates default value of 1 if empty product
Sum-Product From variables to factors m π πβπ π¦ π = ΰ· π π β² βπ π¦ π π β² βπ(π)\π n set of factors in which variable n participates default value of 1 if From factors to variables empty product π πβπ π¦ π m = ΰ· π π π π ΰ· π π β² βπ (π¦ πβ² ) π β² βπ(π)\π π π \π n sum over configuration of set of variables that the variables for the m th factor, m th factor depends on with variable n fixed
Recommend
More recommend