Machine Learning 2007: Lecture 11 Instructor: Tim van Erven (Tim.van.Erven@cwi.nl) Website: www.cwi.nl/˜erven/teaching/0708/ml/ November 28, 2007 1 / 35
Overview Organisational Organisational Matters ● Matters Models ● Models Maximum Likelihood Parameter Estimation ● Maximum Likelihood Parameter Estimation Probability Theory ● Probability Theory Bayesian Learning ● Bayesian Learning ✦ The Bayesian Distribution ✦ From Prior to Posterior ✦ MAP Parameter Estimation ✦ Bayesian Predictions ✦ Discussion ✦ Advanced Issues 2 / 35
Organisational Guest lecture: Matters Next week, Peter Gr¨ unwald will give a special guest lecture Models ● Maximum Likelihood about minimum description length (MDL) learning. Parameter Estimation This Lecture versus Mitchell: Probability Theory Bayesian Learning Chapter 6 up to section 6.5.0 about Bayesian learning. ● I present things in a better order. ● Mitchell also covers the connection between MAP parameter ● estimation and least squares linear regression: It is good for you to study this, but I will not ask an exam question about it. 3 / 35
Overview Organisational Organisational Matters ● Matters Models ● Models Maximum Likelihood Parameter Estimation ● Maximum Likelihood Parameter Estimation Probability Theory ● Probability Theory Bayesian Learning ● Bayesian Learning ✦ The Bayesian Distribution ✦ From Prior to Posterior ✦ MAP Parameter Estimation ✦ Bayesian Predictions ✦ Discussion ✦ Advanced Issues 4 / 35
Prediction Example without Noise Training data: Organisational Matters y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 Models D = Maximum Likelihood 0 1 0 1 0 1 0 1 Parameter Estimation Probability Theory Hypothesis Space: Bayesian Learning h 1 : y n = 0 � 0 if n is odd h 2 : y n = H = { h 1 , h 2 , h 3 } 1 if n is even h 3 : y n = 1 5 / 35
Prediction Example without Noise Training data: Organisational Matters y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 Models D = Maximum Likelihood 0 1 0 1 0 1 0 1 Parameter Estimation Probability Theory Hypothesis Space: Bayesian Learning h 1 : y n = 0 � 0 if n is odd h 2 : y n = H = { h 1 , h 2 , h 3 } 1 if n is even h 3 : y n = 1 By simple list-then-eliminate: Only h 2 is consistent with the training data. ● Therefore we predict, in accordance with h 2 , that y 9 = 0 . ● 5 / 35
Turning Hypotheses into Distributions Models: Organisational Matters We may view each hypothesis as probability distribution that ● Models Maximum Likelihood gives probability 1 to a certain outcome. Parameter Estimation A hypothesis space that contains such probabilistic ● Probability Theory hypotheses is called a (statistical) model . Bayesian Learning The previous hypotheses as distributions: P 1 : P 1 ( y n = 0) = 1 M = { P 1 , P 2 , P 3 } ( 1 if n is odd P 2 : P 2 ( y n = 0) = 0 if n is even P 3 ( y n = 1) = 1 P 3 : 6 / 35
Turning Hypotheses into Distributions Models: Organisational Matters We may view each hypothesis as probability distribution that ● Models Maximum Likelihood gives probability 1 to a certain outcome. Parameter Estimation A hypothesis space that contains such probabilistic ● Probability Theory hypotheses is called a (statistical) model . Bayesian Learning The previous hypotheses as distributions: P 1 : P 1 ( y n = 0) = 1 M = { P 1 , P 2 , P 3 } ( 1 if n is odd P 2 : P 2 ( y n = 0) = 0 if n is even P 3 ( y n = 1) = 1 P 3 : List-then-eliminate still works: A probabilistic hypothesis is consistent with the data if it gives ● positive probability to the data. 6 / 35
Prediction Example with Noise Noise: Organisational Matters Using probabilistic hypotheses is natural when there is noise Models ● Maximum Likelihood in the data. Parameter Estimation Suppose we observe a measurement error with some (small) ● Probability Theory probability ǫ . Bayesian Learning This is easy to incorporate: P 1 : P 1 ( y n = 0) = 1 − ǫ M = { P 1 , P 2 , P 3 } ( 1 − ǫ if n is odd P 2 : P 2 ( y n = 0) = if n is even ǫ P 3 : P 3 ( y n = 1) = 1 − ǫ 7 / 35
Prediction Example with Noise Noise: Organisational Matters Using probabilistic hypotheses is natural when there is noise Models ● Maximum Likelihood in the data. Parameter Estimation Suppose we observe a measurement error with some (small) ● Probability Theory probability ǫ . Bayesian Learning This is easy to incorporate: P 1 : P 1 ( y n = 0) = 1 − ǫ M = { P 1 , P 2 , P 3 } ( 1 − ǫ if n is odd P 2 : P 2 ( y n = 0) = if n is even ǫ P 3 : P 3 ( y n = 1) = 1 − ǫ List-then-eliminate does not work any more: For example, P 1 ( D = 0 , 1 , 0 , 1 , 0 , 1 , 0 , 1) = ǫ 4 (1 − ǫ ) 4 . ● Typically many or all probabilistic hypotheses in our model will ● be consistent with the data. 7 / 35
Overview Organisational Organisational Matters ● Matters Models ● Models Maximum Likelihood Parameter Estimation ● Maximum Likelihood Parameter Estimation Probability Theory ● Probability Theory Bayesian Learning ● Bayesian Learning ✦ The Bayesian Distribution ✦ From Prior to Posterior ✦ MAP Parameter Estimation ✦ Bayesian Predictions ✦ Discussion ✦ Advanced Issues 8 / 35
Parameters Parameters index the elements of a hypothesis space: Organisational Matters Models H = { h 1 , h 2 , h 3 } ⇐ ⇒ H = { h θ | θ ∈ { 1 , 2 , 3 }} Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning 9 / 35
Parameters Parameters index the elements of a hypothesis space: Organisational Matters Models H = { h 1 , h 2 , h 3 } ⇐ ⇒ H = { h θ | θ ∈ { 1 , 2 , 3 }} Maximum Likelihood Parameter Estimation Usually in a convenient way: Probability Theory Bayesian Learning Hypotheses are often expressed in terms of the parameters. In linear regression for example: H = { h w | w ∈ R 2 } where h w : y = w 0 + w 1 x . 9 / 35
Parameters Parameters index the elements of a hypothesis space: Organisational Matters Models H = { h 1 , h 2 , h 3 } ⇐ ⇒ H = { h θ | θ ∈ { 1 , 2 , 3 }} Maximum Likelihood Parameter Estimation Usually in a convenient way: Probability Theory Bayesian Learning Hypotheses are often expressed in terms of the parameters. In linear regression for example: H = { h w | w ∈ R 2 } where h w : y = w 0 + w 1 x . Example where the hypothesis space is a model: For example in prediction of binary outcomes: � � 1 �� 4 , 1 2 , 3 M = P θ | θ ∈ where P θ ( y n = 1) = θ . 4 9 / 35
Maximum Likelihood Parameter Estimation Training data and model: Organisational Matters Models D = y 1 y 2 y 3 y 4 y 5 y 6 y 7 y 8 Maximum Likelihood Parameter Estimation 0 1 1 1 0 1 1 1 Probability Theory � � 1 4 , 1 2 , 3 �� M = P θ | θ ∈ where P θ ( y n = 1) = θ . Bayesian Learning 4 Likelihood: θ 1 / 4 1 / 2 3 / 4 (1 / 4) 6 (3 / 4) 2 (1 / 2) 8 (3 / 4) 6 (1 / 4) 2 P θ ( D ) = 9 / 65536 = 256 / 65536 = 729 / 65536 Maximum Likelihood Parameter Estimation: ˆ θ = arg max θ P θ ( D ) = 3 / 4 10 / 35
Overview Organisational Organisational Matters ● Matters Models ● Models Maximum Likelihood Parameter Estimation ● Maximum Likelihood Parameter Estimation Probability Theory ● Probability Theory Bayesian Learning ● Bayesian Learning ✦ The Bayesian Distribution ✦ From Prior to Posterior ✦ MAP Parameter Estimation ✦ Bayesian Predictions ✦ Discussion ✦ Advanced Issues 11 / 35
Relating Unions and Intersections Organisational Matters Models Maximum Likelihood Parameter Estimation Probability Theory Bayesian Learning For any two events A and B : P ( A ∪ B ) = P ( A ) + P ( B ) − P ( A ∩ B ) 12 / 35
The Law of Total Probability Organisational a Matters b g c Models d f Maximum Likelihood e Parameter Estimation Probability Theory Bayesian Learning Suppose Ω = { a, b, c, d, e, f, g } . ● 13 / 35
The Law of Total Probability Organisational a Matters b g c Models d f Maximum Likelihood e Parameter Estimation Probability Theory Bayesian Learning Suppose Ω = { a, b, c, d, e, f, g } . ● A partition of Ω cuts it into parts: ● ✦ Let the parts be A 1 = { a, b } , A 2 = { c, d, e } and A 3 = { f, g } ✦ The parts do not overlap, and together cover Ω . 13 / 35
The Law of Total Probability Organisational a Matters b g c Models d f Maximum Likelihood e Parameter Estimation Probability Theory Bayesian Learning Suppose Ω = { a, b, c, d, e, f, g } . ● A partition of Ω cuts it into parts: ● ✦ Let the parts be A 1 = { a, b } , A 2 = { c, d, e } and A 3 = { f, g } ✦ The parts do not overlap, and together cover Ω . B = { b, d, f } ● 13 / 35
The Law of Total Probability Organisational a Matters b g c Models d f Maximum Likelihood e Parameter Estimation Probability Theory Bayesian Learning Suppose Ω = { a, b, c, d, e, f, g } . ● A partition of Ω cuts it into parts: ● ✦ Let the parts be A 1 = { a, b } , A 2 = { c, d, e } and A 3 = { f, g } ✦ The parts do not overlap, and together cover Ω . B = { b, d, f } ● Law of Total Probability: 3 3 � � P ( B ) = P ( B ∩ A i ) = P ( B | A i ) P ( A i ) i =1 i =1 13 / 35
Recommend
More recommend