CS480/680 Machine Learning Lecture 6: January 23 st , 2020 Maximum A posteriori & Maximum Likelihood Zahra Sheikhbahaee Sources: A Tutorial on Energy Based Learning University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee
Outline • Probabilistic Modeling • Gibbs Distribution • Maximum A Posteriori Estimation • Maximum Likelihood Estimation University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 2
Probabilistic Modeling Goal: Given a set of observations 𝑇 = { 𝑦 % , 𝑧 % : 𝑗 = 1, . . , 𝑄} (training set), we want to produce a model for regression, classification or decision making that predict the best 𝑍 from 𝑌 . we want to estimate a function that computes the conditional distribution 𝑄 ( 𝑍 |𝑌 ) for any given 𝑌 . We write this function as 𝑄 𝑍 𝑌, 𝑇 . • Design Architecture: We decompose 𝑄 𝑍 𝑌, 𝑇 into two parts: 𝑄 𝑍 𝑌, 𝑇 = 3 𝑄 𝑍 𝑌, 𝑋 𝑄 𝑋 𝑇 𝑒𝑋 Our estimate of 𝑄 𝑍 𝑌, 𝑋 will be among a family of functions 𝑔 𝑋, 𝑍, 𝑌 , ∀𝑋 , where the functions are parameterized by a vector 𝑋 . The internal structure of the parameterized function 𝑔(𝑋, 𝑍, 𝑌) is called the architecture i.e. logistic regressors, neural networks, etc. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 3
Gibbs Distribution • Energy function : for convenience we will often define 𝑔(𝑋, 𝑍, 𝑌) as the normalized exponential of an energy function 𝐹 𝑋, 𝑍, 𝑌 : 𝑄(𝑍 |𝑌, 𝑋) ≈ 𝑔(𝑋, 𝑍, 𝑌) = exp(−𝛾𝐹(𝑋, 𝑍, 𝑌)) 𝑎 @ (𝑋, 𝑌, 𝛾) β : an arbitrary positive constant (inverse temperature) 𝑎 @ 𝑋, 𝑌, 𝛾 : a normalization term ( the partition function ) 𝑎 @ 𝑋, 𝑌, 𝛾 = 3 exp −𝛾𝐹 𝑋, 𝑍, 𝑌 𝑒𝑍 𝑎 @ ensures that our estimate of 𝑄(𝑍 |𝑌, 𝑋) is normalized. High probability states corresponds to low energy configuration. Condition : We can only transform energies into probabilities if ∫ 𝑓 DE@(F,G,H) 𝑒𝑍 𝑑𝑝𝑜𝑤𝑓𝑠𝑓𝑡 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 4
Probabilistic Modeling 𝑄 ( 𝑍 |𝑌, 𝑇) = 3 𝑄 ( 𝑍 |𝑌, 𝑋 ) 𝑄 ( 𝑋|𝑇 )𝑒𝑋 • The energy includes ”hidden” variables 𝑋 whose value is never given to us Learning : 𝑄 ( 𝑋|𝑇 ) is the result of a learning procedure that assigns a probability (or an energy) to each possible value of 𝑋 as a function of the training set. The learning procedure will assign high probabilities to values of 𝑋 that assign high combined probability ( low combined energy ) to the observed data. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 5
Likelihood of Observations 𝑄 ( 𝑋|𝑇) = 𝑄 ( 𝑋|𝒵, 𝒴 ) = 𝑄 (𝒵|𝒴 , 𝑋 ) 𝑄 ( 𝑋|𝒴 ) 𝑄 (𝒵|𝒴 ) Where 𝒴 = (𝑦 R , 𝑦 S , … , 𝑦 U ) and 𝒵 = (𝑧 R , 𝑧 S , … , 𝑧 U ) and the denominator is a normalization term 𝑄 𝒵 𝒴 = 3 𝑄 (𝒵|𝒴 , 𝑋 ) 𝑄 𝑋 𝒴 𝑒𝑋 Sample independence: We assume that samples are independent. So the conditional probability of the training set under the model is a product over samples U U 𝑄 𝑧 R , … , 𝑧 U 𝑦 R , … , 𝑦 U , 𝑋 = V 𝑄 𝑧 % 𝑦 % , 𝑋 = V 𝑔(𝑋, 𝑧 % , 𝑦 % ) %WR %WR U [𝐹 𝑋, 𝑧 % , 𝑦 % + 1 𝑄 𝑍 𝑌, 𝑋 = exp(−𝛾 X 𝛾 log 𝑎 @ 𝑋, 𝑦 % , 𝛾 ]) %WR Where we have 𝑎 @ 𝑋, 𝑦 % , 𝛾 = ∫ 𝑓 DE@(F,_,` a ) 𝑒𝑧 University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 6
Choosing a Regularizer = 𝑄 𝑋 𝒵, 𝒴 = 𝑄 𝒵 𝒴, 𝑋 𝑄(𝑋|𝒴) 𝑄 𝑋 𝑇 𝑄(𝒵|𝒴) The term 𝑄 ( 𝑋|𝒴 ) is an arbitrary prior distribution over the values of 𝑋 that we can choose freely. 𝑋 . We will often represent this prior as the normalized exponential of a penalty term or regularizer H 𝑋 The term H is used to embed our prior knowledge about which energy function in our family are preferable to others in the absence of training data. 𝑄 𝑋 = 1 𝑓 DEb(F) 𝑎 b Parameters that produce low values of the regularizer will be favored over parameters that produce large values. “good” models (e.g. simple, smooth, well behaved) the regularizer is small “bad” models the regularizer is large University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 7
Posterior of a Parameter • The probability of a particular parameter value 𝑋 given the observations 𝑇 is U [𝐹 𝑋, 𝑧 % , 𝑦 % + 1 exp[−𝛾{∑ %WR 𝛾 log 𝑎 @ 𝑋, 𝑦 % , 𝛾 ] + 𝐼 𝑋 }] 𝑄 𝑋 𝑇 = 𝑎 F (𝑇, 𝛾) 𝐹(𝑋, 𝑍, 𝑌) can be a linear combination of basis function. The advantage of the energy-based approach is that it puts very little restriction on the nature of 𝜁 = {𝐹 𝑋, 𝑍, 𝑌 : 𝑋 ∈ 𝒳} 𝐼 ( 𝑋 ) is the regularizer that contains our preferences for “good” models over “bad” ones. Our choice of 𝐼 ( 𝑋 ) is somewhat arbitrary, but some work better than others for particular applications. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 8
Posterior of a Parameter U [𝐹 𝑋, 𝑧 % , 𝑦 % + 1 exp[−𝛾{∑ %WR 𝛾 log 𝑎 @ 𝑋, 𝑦 % , 𝛾 ] + 𝐼 𝑋 }] 𝑄 𝑋 𝑇 = 𝑎 F (𝑇, 𝛾) 𝑎 F (𝑇, 𝛾 ) is the normalization term that ensures that the integral of P 𝑋 𝑇 over 𝑋 is 1. 𝑎 F (𝑇, 𝛾 ) is the integral over 𝑋 of the numerator. 𝑎 @ 𝑋, 𝑦 % , 𝛾 are the normalization terms (one for each sample) that ensure that the integral 𝑄 𝑍 𝑦 % , 𝑋 over 𝑍 is 1: 𝑎 @ 𝑋, 𝑦 % , 𝛾 = 3 exp(−𝛾𝐹 𝑋, 𝑍, 𝑦 % 𝑒𝑍 β is a positive constant that we are free to choose as we like or that we can estimate. It reflects the reliability of the data. Low values should be used to get probability estimates with noisy data. Large values should be used to get good discrimination. We can estimate β through learning too (we can fold it into E, as a component of W). University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 9
Intractability of Bayesian Learning The Bayesian predictive distribution 𝑄 𝑧 ∗ 𝑦 ∗ , 𝑦 R , 𝑧 R , … , (𝑦 U , 𝑧 U )) = 3 𝑔 𝑋, 𝒵, 𝒴 𝑄 𝑋 𝑦 R , 𝑧 R , … , 𝑦 U , 𝑧 U 𝑒𝑋 • To compute the distribution of 𝑧 ∗ for a particular input 𝑦 ∗ , we are supposed to integrate the product of two complicated functions over all possible values of 𝑋 . • This is totally intractable in general. • There are special classes of functions for 𝑔 for which the integral is tractable, but that class is fairly restricted. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 10
Tractable Learning Methods 1. Maximum A Posteriori Estimation. simply replace the distribution 𝑄 ( 𝑋|𝑇 ) by a Dirac delta function centered on its mode (maximum). 2. Maximum Likelihood Estimation. Same as above, but drop the regularizer. 3. Restricted Class of function. Simply restrict yourself to special forms of 𝑔(𝑋, 𝑍, 𝑌 ) for which the integral can be computed analytically (e.g. Gaussians). 4. Sampling. Draw a a bunch of samples of 𝑋 from the distribution 𝑄 ( 𝑋|𝑇 ), and replace the integral by a sum over those samples. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 11
Maximum A Posteriori Estimation • Assume that the mode (maximum) of 𝑄 ( 𝑋|𝑇 ) is so much larger than all other values that we can view 𝑄 ( 𝑋|𝑇 ) as a Dirac delta function centered around its maximum 𝑄 jkl ( 𝑋|𝑇 ) ≈ 𝜀 ( 𝑋 − 𝑋 MAP ) 𝑋 MAP = arg max p 𝑄 ( W|S ) with this approximation, we get simply: 𝑄 𝑍 𝑌, 𝑇 = 𝑄 𝑍 𝑌, 𝑋 jkl If we take the limit β → ∞ , P ( W|S ) does converge to a delta function around its maximum. So the MAP approximation is simply the large β limit. University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 12
Computing 𝑋 jkl R 𝑋 jkl = 𝑏𝑠max F 𝑄 𝑋 𝑇 = argmax t u (v,E) F U [𝐹 𝑋, 𝑧 % , 𝑦 % + R exp[−𝛾{∑ %WR E log 𝑎 @ 𝑋, 𝑦 % , 𝛾 ] + 𝐼 𝑋 }] = U [𝐹 𝑋, 𝑧 % , 𝑦 % + R exp[−𝛾{∑ %WR argmax E log 𝑎 @ 𝑋, 𝑦 % , 𝛾 ] + 𝐼 𝑋 }] = F U [𝐹 𝑋, 𝑧 % , 𝑦 % + R F ∑ %WR arg min E log 𝑎 @ 𝑋, 𝑦 % , 𝛾 ] + 𝐼 𝑋 }] = U [𝐹 𝑋, 𝑧 % , 𝑦 % + R ∑ %WR argmin E log ∫ exp(−𝛾 𝐹 𝑋, 𝒵, 𝑦 % ) 𝑒𝒵 ] + 𝐼 𝑋 F We can take the log because log is monotonic. To find the MAP parameter estimate, we need to find the value of 𝑋 that minimizes: U [𝐹 𝑋, 𝑧 % , 𝑦 % + 1 𝑀 jkl 𝑋 = X 𝛾 log 3 exp(−𝛾𝐹 𝑋, 𝒵, 𝑦 % ) 𝑒𝒵 ] + 𝐼 𝑋 %WR University of Waterloo CS480/680 Winter 2020 Zahra Sheikhbahaee 13
Recommend
More recommend