Generative and Discriminative Learning Machine Learning 1
What we saw most of the semester • A fixed, unknown distribution D over X £ Y – X: Instance space, Y: label space (eg: {+1, -1}) • Given a dataset S = {(x i , y i )} • Learning – Identify a hypothesis space H, define a loss function L(h, x, y) – Minimize average loss over training data (plus regularization) • The guarantee – If we find an algorithm that minimizes loss on the observed data – Then, learning theory guarantees good future behavior (as a function of H) 2
What we saw most of the semester Is this different • A fixed, unknown distribution D over X £ Y from assuming a – X: Instance space, Y: label space (eg: {+1, -1}) distribution over X and a fixed oracle function f? • Given a dataset S = {(x i , y i )} • Learning – Identify a hypothesis space H, define a loss function L(h, x, y) – Minimize average loss over training data (plus regularization) • The guarantee – If we find an algorithm that minimizes loss on the observed data – Then, learning theory guarantees good future behavior (as a function of H) 3
What we saw most of the semester Is this different • A fixed, unknown distribution D over X £ Y from assuming a – X: Instance space, Y: label space (eg: {+1, -1}) distribution over X and a fixed oracle function f? • Given a dataset S = {(x i , y i )} • Learning – Identify a hypothesis space H, define a loss function L(h, x, y) – Minimize average loss over training data (plus regularization) • The guarantee – If we find an algorithm that minimizes loss on the observed data – Then, learning theory guarantees good future behavior (as a function of H) 4
What we saw most of the semester Is this different • A fixed, unknown distribution D over X £ Y from assuming a – X: Instance space, Y: label space (eg: {+1, -1}) distribution over X and a fixed oracle function f? • Given a dataset S = {(x i , y i )} • Learning – Identify a hypothesis space H, define a loss function L(h, x, y) – Minimize average loss over training data (plus regularization) • The guarantee – If we find an algorithm that minimizes loss on the observed data – Then, learning theory guarantees good future behavior (as a function of H) 5
What we saw most of the semester Is this different • A fixed, unknown distribution D over X £ Y from assuming a – X: Instance space, Y: label space (eg: {+1, -1}) distribution over X and a fixed oracle function f? • Given a dataset S = {(x i , y i )} • Learning – Identify a hypothesis space H, define a loss function L(h, x, y) – Minimize average loss over training data (plus regularization) • The guarantee – If we find an algorithm that minimizes loss on the observed data – Then, learning theory guarantees good future behavior (as a function of H) 6
Discriminative models Goal: learn directly how to make predictions • Look at many (positive/negative) examples • Discover regularities in the data • Use these to construct a prediction policy • Assumptions come in the form of the hypothesis class Bottom line: approximating ℎ: 𝑌 → 𝑍 is estimating the conditional probability 𝑄(𝑍|𝑌) 7
Generative models • Explicitly model how instances in each category are generated by modeling the joint probability of X and Y, that is 𝑄(𝑍, 𝑌) • That is, learn 𝑄(𝑌|𝑍) and 𝑄(𝑍) • We did this for naïve Bayes – Naïve Bayes is a generative model • Predict 𝑄(𝑍|𝑌) using the Bayes rule 8
Example: Generative story of naïve Bayes 9
Example: Generative story of naïve Bayes P(Y) Y First sample a label 10
Example: Generative story of naïve Bayes P(Y) Y X 1 P(X 1 | Y) Given the label, sample the features independently from the conditional distributions 11
Example: Generative story of naïve Bayes P(Y) Y X 1 X 2 P(X 1 | Y) P(X 2 | Y) Given the label, sample the features independently from the conditional distributions 12
Example: Generative story of naïve Bayes P(Y) Y X 1 X 2 X 3 P(X 1 | Y) P(X 2 | Y) P(X 3 | Y) Given the label, sample the features independently from the conditional distributions 13
Example: Generative story of naïve Bayes P(Y) Y . . . X 1 X 2 X 3 P(X 1 | Y) P(X 2 | Y) P(X 3 | Y) Given the label, sample the features independently from the conditional distributions 14
Example: Generative story of naïve Bayes P(Y) Y . . . X 1 X 2 X 3 X d P(X 1 | Y) P(X 2 | Y) P(X 3 | Y) P(X d | Y) Given the label, sample the features independently from the conditional distributions 15
Generative vs Discriminative models • Generative models – learn P(x, y) – Use the capacity of the model to characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model • Discriminative models – learn P(y | x) – Use the capacity of the model to characterize the decision boundary only – Eg: Logistic Regression, Conditional models (several names) 16
Generative vs Discriminative models • Generative models – learn P(x, y) – Use the capacity of the model to characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model • Discriminative models – learn P(y | x) – Use model capacity to characterize the decision boundary only – Eg: Logistic Regression, Conditional models (several names), most neural models 17
Generative vs Discriminative models • Generative models – learn P(x, y) – Use the capacity of the model to characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model • Discriminative models – learn P(y | x) – Use model capacity to characterize the decision boundary only – Eg: Logistic Regression, Conditional models (several names), most neural models 18
Generative vs Discriminative models • Generative models – learn P(x, y) – Use the capacity of the model to characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model • Discriminative models – learn P(y | x) – Use model capacity to characterize the decision boundary only – Eg: Logistic Regression, Conditional models (several names), most neural models 19
Generative vs Discriminative models • Generative models – learn P(x, y) – Use the capacity of the model to characterize how the data is generated (both inputs and outputs) – Eg: Naïve Bayes, Hidden Markov Model A generative model tries to characterize the distribution of the inputs, a discriminative model doesn’t care • Discriminative models – learn P(y | x) – Use model capacity to characterize the decision boundary only – Eg: Logistic Regression, Conditional models (several names), most neural models 20
Recommend
More recommend