Predicting Structures: Conditional Models and Local Classifiers CS 6355: Structured Prediction 1
Outline • Sequence models • Hidden Markov models – Inference with HMM – Learning • Conditional Models and Local Classifiers • Global models – Conditional Random Fields – Structured Perceptron for sequences 2
Today’s Agenda • Conditional models for predicting sequences • Log-linear models for multiclass classification • Maximum Entropy Markov Models 3
Today’s Agenda • Conditional models for predicting sequences • Log-linear models for multiclass classification • Maximum Entropy Markov Models 4
HMM redux • The independence assumption '.# ' 𝑄 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' , 𝑧 # , 𝑧 % , ⋯ , 𝑧 ' = 𝑄 𝑧 # * 𝑄 𝑧 +,# 𝑧 + * 𝑄 𝑦 + 𝑧 + +-# +-# 5
� HMM redux • The independence assumption '.# ' 𝑄 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' , 𝑧 # , 𝑧 % , ⋯ , 𝑧 ' = 𝑄 𝑧 # * 𝑄 𝑧 +,# 𝑧 + * 𝑄 𝑦 + 𝑧 + +-# +-# • Training via maximum likelihood max 2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲 + , 𝐳 + ∣ 𝜌, 𝐵, 𝐶) + We are optimizing joint likelihood of the input and the output for training 6
� HMM redux • The independence assumption '.# ' 𝑄 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' , 𝑧 # , 𝑧 % , ⋯ , 𝑧 ' = 𝑄 𝑧 # * 𝑄 𝑧 +,# 𝑧 + * 𝑄 𝑦 + 𝑧 + +-# +-# • Training via maximum likelihood max 2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲 + , 𝐳 + ∣ 𝜌, 𝐵, 𝐶) + We are optimizing joint likelihood of the input and the output for training 7
� HMM redux Probability of input given the prediction! • The independence assumption '.# ' 𝑄 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' , 𝑧 # , 𝑧 % , ⋯ , 𝑧 ' = 𝑄 𝑧 # * 𝑄 𝑧 +,# 𝑧 + * 𝑄 𝑦 + 𝑧 + +-# +-# • Training via maximum likelihood 2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max max 2,3,4 * 𝑄(𝐲 + , 𝐳 + ∣ 𝜌, 𝐵, 𝐶) + We are optimizing joint likelihood of the input and the output for training 8
� HMM redux Probability of input given the prediction! • The independence assumption '.# ' 𝑄 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' , 𝑧 # , 𝑧 % , ⋯ , 𝑧 ' = 𝑄 𝑧 # * 𝑄 𝑧 +,# 𝑧 + * 𝑄 𝑦 + 𝑧 + +-# +-# • Training via maximum likelihood max 2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲 + , 𝐳 + ∣ 𝜌, 𝐵, 𝐶) + We are optimizing joint likelihood of the input and the output for training At prediction time, we only care about the probability of output given the input: 𝑄(𝑧 # , 𝑧 % , ⋯ , 𝑧 ' ∣ 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' ) Why not directly optimize this conditional likelihood instead? 9
Modeling next-state directly • Instead of modeling the joint distribution 𝑄(𝐲, 𝐳) , focus on 𝑄(𝐳 ∣ 𝐲) only – Which is what we care about eventually anyway (At least in this context) • For sequences, different formulations – Maximum Entropy Markov Model [McCallum, et al 2000] – Projection-based Markov Model [Punyakanok and Roth, 2001] (other names: discriminative/conditional markov model, …) 10
Generative vs Discriminative models • Generative models – learn P(x, y) – Characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model • Discriminative models – learn P(y | x) – Directly characterizes the decision boundary only – Eg: Logistic Regression, Conditional models (several names) 11
Generative vs Discriminative models • Generative models – learn P(x, y) – Characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model A generative model tries to characterize the distribution of the inputs, a discriminative model doesn’t care • Discriminative models – learn P(y | x) – Directly characterizes the decision boundary only – Eg: Logistic Regression, Conditional models (several names) 12
Generative vs Discriminative models • Generative models – learn P(x, y) – Characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model A generative model tries to characterize the distribution of the inputs, a discriminative model doesn’t care • Discriminative models – learn P(y | x) – Directly characterizes the decision boundary only – Eg: Logistic Regression, Conditional models (several names) 13
Generative vs Discriminative models • Generative models – learn P(x, y) – Characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model A generative model tries to characterize the distribution of the inputs, a discriminative model doesn’t care • Discriminative models – learn P(y | x) – Directly characterizes the decision boundary only – Eg: Logistic Regression, Conditional models (several names) 14
Generative vs Discriminative models • Generative models – learn P(x, y) – Characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model A generative model tries to characterize the distribution of the inputs, a discriminative model doesn’t care • Discriminative models – learn P(y | x) – Directly characterizes the decision boundary only – Eg: Logistic Regression, Conditional models (several names) 15
� HMM redux Probability of input given the prediction! • The independence assumption '.# ' 𝑄 𝑦 # , 𝑦 % , ⋯ , 𝑦 ' , 𝑧 # , 𝑧 % , ⋯ , 𝑧 ' = 𝑄 𝑧 # * 𝑄 𝑧 +,# 𝑧 + * 𝑄 𝑦 + 𝑧 + +-# +-# • Training via maximum likelihood max 2,3,4 𝑄 𝐸 ∣ 𝜌, 𝐵, 𝐶 = max 2,3,4 * 𝑄(𝐲 + , 𝐳 + ∣ 𝜌, 𝐵, 𝐶) + We are optimizing joint likelihood of the input and the output for training At prediction time, we only care about the probability of output given the input. Why not directly optimize this conditional likelihood instead? 16
Let’s revisit the independence assumptions y t-1 y t 𝑄 𝑧 + 𝑧 +.# , anything else = 𝑄 𝑧 + 𝑧 +.# HMM 𝑄 𝑦 + 𝑧 + , anything else = 𝑄 𝑦 + 𝑧 + x t 17
Another independence assumption 𝑄 𝑧 + 𝑧 +.# , 𝑧 +.% , ⋯ , 𝑦 + , 𝑦 +.# , ⋯ = 𝑄 𝑧 + 𝑧 +.# , 𝑦 + y t-1 y t y t-1 y t Conditional HMM x t x t model 18
Another independence assumption 𝑄 𝑧 + 𝑧 +.# , 𝑧 +.% , ⋯ , 𝑦 + , 𝑦 +.# , ⋯ = 𝑄 𝑧 + 𝑧 +.# , 𝑦 + y t-1 y t y t-1 y t Conditional HMM x t x t model 19
� Another independence assumption 𝑄 𝑧 + 𝑧 +.# , 𝑧 +.% , ⋯ , 𝑦 + , 𝑦 +.# , ⋯ = 𝑄 𝑧 + 𝑧 +.# , 𝑦 + y t-1 y t y t-1 y t Conditional HMM x t x t model This assumption lets us write the conditional probability of the entire output sequence 𝐳 as 𝑄 𝐳 𝐲 = * 𝑄(𝑧 + ∣ 𝑧 +.# , 𝑦 + ) + 20
� Another independence assumption 𝑄 𝑧 + 𝑧 +.# , 𝑧 +.% , ⋯ , 𝑦 + , 𝑦 +.# , ⋯ = 𝑄 𝑧 + 𝑧 +.# , 𝑦 + y t-1 y t y t-1 y t Conditional HMM x t x t model This assumption lets us write the conditional probability of the entire output sequence 𝐳 as 𝑄 𝐳 𝐲 = * 𝑄(𝑧 + ∣ 𝑧 +.# , 𝑦 + ) + We need to learn this function 21
Modeling 𝑄(𝑧 𝑗 ∣ 𝑧 𝑗 − 1, 𝑦𝑗) Different approaches possible 1. Train a maximum entropy classifier 2. Or, ignore the fact that we are predicting a probability, we only care about maximizing some score. Train any classifier, using say the perceptron algorithm In either case : Use rich features that depend on input and previous state – We can increase the dependency to arbitrary neighboring x i ’s – Eg. Neighboring words influence this words POS tag • 22
Where are we? • Conditional models for predicting sequences • Log-linear models for multiclass classification • Maximum Entropy Markov Models 23
� Log-linear models for multiclass Consider multiclass classification – Inputs: 𝐲 – Output: 𝐳 ∈ 1, 2, ⋯ , 𝐿 – Feature representation: 𝜚 𝐲, 𝐳 • We have seen this before: Kesler construction Define probability of an input 𝐲 taking a label 𝐳 as (𝐱 R 𝜚 𝐲, 𝐳 ) exp 𝑄 𝐳 𝐲, 𝐱 = ∑ exp 𝐱 R 𝜚 𝐲, 𝐳 T 𝐳 U A generalization of logistic regression to the multiclass setting 24
� Log-linear models for multiclass Consider multiclass classification – Inputs: 𝐲 – Output: 𝐳 ∈ 1, 2, ⋯ , 𝐿 – Feature representation: 𝜚 𝐲, 𝐳 • We have seen this before: Kesler construction Define probability of an input 𝐲 taking a label 𝐳 as (𝐱 R 𝜚 𝐲, 𝐳 ) exp 𝑄 𝐳 𝐲, 𝐱 = ∑ exp 𝐱 R 𝜚 𝐲, 𝐳 T 𝐳 U A generalization of logistic regression to the multiclass setting 25
� Log-linear models for multiclass Consider multiclass classification – Inputs: 𝐲 – Output: 𝐳 ∈ 1, 2, ⋯ , 𝐿 Interpretation : Score for label, – Feature representation: 𝜚 𝐲, 𝐳 converted to a well-formed probability distribution by • We have seen this before: Kesler construction exponentiating + normalizing Define probability of an input 𝐲 taking a label 𝐳 as (𝐱 R 𝜚 𝐲, 𝐳 ) exp 𝑄 𝐳 𝐲, 𝐱 = ∑ exp 𝐱 R 𝜚 𝐲, 𝐳 T 𝐳 U A generalization of logistic regression to the multiclass setting 26
Recommend
More recommend