model inference
play

Model inference . Course of Machine Learning Master Degree in - PowerPoint PPT Presentation

Model inference . Course of Machine Learning Master Degree in Computer Science University of Rome ``Tor Vergata'' Giorgio Gambosi a.a. 2018-2019 1 Model inference Purpose the data domain. Dataset distributed (iid): they can be seen as


  1. Model inference . Course of Machine Learning Master Degree in Computer Science University of Rome ``Tor Vergata'' Giorgio Gambosi a.a. 2018-2019 1

  2. Model inference Purpose the data domain. Dataset distributed (iid): they can be seen as realizations of a single random variable. 2 Inferring a probabilistic model from a collection of observed data X = { x 1 , . . . , x n } . A probabilistic model is a probability distribution over A dataset X is a collection of N observed data, independent and identically

  3. Model inference Problems considered Inference objectives: data collection of a given model type (probability distribution), which best the set of already observed data 3 Model selection Selecting the probabilistic model M best suited for a given Estimation Estimate the values of the set θ = ( θ 1 , . . . , θ D ) of parameters model the observed data X Prediction Compute the probability p ( x | X ) of a new observation from

  4. Bayesian learning Context data. The corresponding predictive distribution of data is 4 Model space M : a model m ∈ M is a probability distribution p ( x | m ) over Let p ( m ) be any prior distribution of models ∑ p ( m ) = 1 m ∈M ∑ p ( x ) = p ( x | m ) p ( m ) m ∈M

  5. Inference and the predictive distribution is 5 After the observation of a dataset X , the updated probabilities are n p ( m | X ) = p ( m ) p ( X | m ) ∏ ∝ p ( m ) p ( X | m ) = p ( m ) p ( x i | m ) p ( X ) i =1 ∑ p ( x | X ) = p ( x | m ) p ( m | X ) m ∈M

  6. Parameters Posterior parameter distribution The posterior predictive distribution, given the model, is Parametric models 6 predictive distribution is then Models are defined as parametric probability distributions, with parameters θ ranging on a parameter space Θ . A prior parameter distribution p ( θ | m ) is defined for a model. The prior ∫ p ( x | m ) = p ( x | θ , m ) p ( θ | m ) d θ Θ Given a model m ∈ M , Bayes' formula makes it possible to infer the posterior distribution of parameters, given the dataset X p ( θ | X , m ) = p ( θ | m ) p ( X | θ , m ) ∝ p ( θ | m ) p ( X | θ , m ) p ( X | m ) ∫ p ( x | X , m ) = p ( x | θ , m ) p ( θ | X , m ) d θ Θ

  7. Bayesian inference Theorem (Bayes) where According to the bayesian approach to inference, parameters are considered 7 The approach relies on Bayes' classic result: data. as random variables, whose distributions have to be inferred from observed Let X , Y be a pair of (sets of) random variables. Then, p ( Y | X ) = p ( X | Y ) p ( Y ) = p ( X | Y ) p ( Y ) p ( X ) ∫ Z p ( X , Z ) d Z • p ( Y ) is the prior probability of Y (with respect to the observation of X ) • p ( Y | X ) is the posterior probability of Y • p ( X | Y ) is the likelihood of X w.r.t. Y • p ( X ) is the evidence of X

  8. Point estimate of parameters and as follows performed. The posterior predictive distribution can then be approximated Motivation Idea This is usually impossible to be done efficiently. 8 Given a model m , the bayesian approach is aimed to derive the posterior distribution of the set of parameters θ . This requires computing p ( θ | X ) = p ( X | θ ) p ( θ ) p ( X | θ ) p ( θ ) = ∫ p ( X ) Θ p ( X | θ ) p ( θ ) d θ ∫ p ( x | X ) = p ( x | θ ) p ( θ | X ) d θ θ Only an estimate of the ``best'' value ˆ θ in θ (according to some measure) is ∫ ∫ p ( x | ˆ p ( x | X ) = p ( x | θ ) p ( θ | X ) d θ ≈ θ ) p ( θ | X ) d θ θ θ ∫ = p ( x | ˆ p ( θ | X ) d θ = p ( x | ˆ θ ) θ ) θ

  9. Maximum likelihood estimate Log-likelihood Estimate The maximum occurs at the same point: argmax Approach is usually preferrable. 9 Determine the parameter value that maximize the likelihood Frequentist point of view: parameters are deterministic variables, whose value is unknown and must be estimated. N ∏ L ( θ | X ) = p ( X | θ ) = p ( x i | θ ) i =1 N ∑ l ( θ | X ) = ln L ( θ | X ) = ln p ( x i | θ ) i =1 l ( θ | X ) = argmax L ( θ | X ) θ θ N ˆ ∑ θ ML = argmax L ( θ | X ) = argmax ln p ( x i | θ ) θ θ i =1

  10. Maximum likelihood estimate more concisely, Solution Prediction 10 Solve the system ∂l ( θ | X ) = 0 i = 1 , . . . , D ∂θ i ∇ θ l ( θ | X ) = 0 Probability of a new observation x : ∫ ∫ p ( x | ˆ p ( x | X ) = p ( x | θ ) p ( θ | X ) d θ ≈ θ ML ) p ( θ | X ) d θ θ θ ∫ = p ( x | ˆ p ( θ | X ) d θ = p ( x | ˆ θ ML ) θ ML ) θ

  11. Maximum likelihood estimate Likelihood Example Log-likelihood 11 Collection X of n binary events, modeled through a Bernoulli distribution with unknown parameter φ p ( x | φ ) = φ x (1 − φ ) 1 − x N ∏ φ x i (1 − φ ) 1 − x i L ( φ | X ) = i =1 N ∑ l ( φ | X ) = ( x i ln φ + (1 − x i ) ln(1 − φ )) = N 1 ln φ + N 0 ln(1 − φ ) i =1 where N 0 ( N 1 ) is the number of events x ∈ X equal to 0 (1) ∂l ( φ | X ) = N 1 N 0 N 0 + N 1 = N 1 N 1 ˆ φ − 1 − φ = 0 = ⇒ φ ML = ∂φ N

  12. ML and overfitting Overfitting Maximizing the likelihood of the observed dataset tends to result into an obtained estimates are suitable to model observed data, but may be too specialized to be used to model different datasets. Penalty functions overfitting and the overall complexity of the model. This results in the following function to maximize 12 estimate too sensitive to the dataset values, hence into overfitting. The An additional function P ( θ ) can be introduced with the aim to limit C ( θ | X ) = l ( θ | X ) − P ( θ ) as a common case, P ( θ ) = γ 2 ∥ θ ∥ 2 , with γ a tuning parameter.

  13. Maximum a posteriori estimate is computed. Idea Estimate 13 distribution). The parameter value maximizing observations, also taking into account previous knowledge (prior considered as a random variable, whose distribution has to be derived from Inference through maximum a posteriori (MAP) is similar to ML, but θ is now p ( θ | X ) = p ( X | θ ) p ( θ ) p ( X ) ˆ θ MAP = argmax p ( θ | X ) = argmax p ( X | θ ) p ( θ ) θ θ = argmax L ( θ | X ) p ( θ ) = argmax ( l ( θ | X ) + ln p ( θ )) θ θ ( N ) ∑ = argmax ln p ( x i | θ ) + ln p ( θ ) θ i =1

  14. MAP and gaussian prior Inference Hypothesis From the hypothesis, 14 uniform variance and null covariance.That is, Assume θ is distributed around the origin as a multivariate gaussian with ∥ θ ∥ 2 −∥ θ ∥ 2 1 ( − 1 ) ( ) p ( θ ) ∼ N ( θ | 0 , σ 2 ) = ∝ exp (2 π ) d/ 2 σ d exp 2 σ 2 2 σ 2 ˆ p ( θ | X ) = argmax ( l ( θ | X ) + ln p ( θ )) θ MAP = argmax θ θ −∥ θ ∥ 2 l ( θ | X ) − ∥ θ ∥ 2 ( ( )) ( ) l ( θ | X ) + ln exp = argmax = argmax 2 σ 2 2 σ 2 θ θ 1 which is equal to the penalty function introduced before, if γ = σ 2

  15. MAP estimate Log-likelihood Example 15 distribution: Collection X of n binary events, modeled as a Bernoulli distribution with unknown parameter φ . Initial knowledge of φ is modeled as a Beta p ( φ | α, β ) = Beta ( φ | α, β ) = Γ( α + β ) Γ( α )Γ( β ) φ α − 1 (1 − φ ) β − 1 N ∑ l ( φ | X ) = ( x i ln φ + (1 − x i ) ln(1 − φ )) = N 1 ln φ + N 0 ln(1 − φ ) i =1 1 − φ + α − 1 − β − 1 ∂φl ( φ | X ) + ln Beta ( φ | α, β ) = N 1 ∂ N 0 φ − 1 − φ = 0 = ⇒ φ N 1 + α − 1 N 1 + α − 1 ˆ φ MAP = N 0 + N 1 + α + β − 2 = N + α + β − 2

  16. Note Gamma function The function is an extension of the factorial to the real numbers field: hence, for any 16 ∫ ∞ t x − 1 e − t dt Γ( x ) = 0 integer x , Γ( x ) = ( x − 1)!

  17. Applying bayesian inference Mode and mean Once the posterior distribution of the distribution. This may lead to inaccurate estimates, as in the figure below: 17 p ( θ | X ) = p ( X | θ ) p ( θ ) = p ( X | θ ) p ( θ ) p ( X ) ∫ θ p ( X | θ ) d θ is available, MAP estimate computes the most probable value (mode) θ MAP p ( x ) x

  18. Applying bayesian inference Mode and mean A better estimation can be obtained by applying a fully bayesian approach and referring to the whole posterior distribution, for example by deriving the 18 expectation of θ w.r.t. p ( θ | X ) , ∫ θ ∗ = E p ( θ | X ) [ θ ] = θ p ( θ | X ) d θ θ

  19. Bayesian estimate Posterior distribution since Example 19 distribution: Collection X of n binary events, modeled as a Bernoulli distribution with unknown parameter φ . Initial knowledge of φ is modeled as a Beta p ( φ | α, β ) = Beta ( φ | α, β ) = Γ( α + β ) Γ( α )Γ( β ) φ α − 1 (1 − φ ) β − 1 ∏ N i =1 φ x i (1 − φ ) 1 − x i p ( φ | α, β ) p ( φ | X , α, β ) = p ( X ) = φ N 1 (1 − φ ) N 0 φ α − 1 (1 − φ ) β − 1 = φ N 1 + α − 1 (1 − φ ) N 0 + β − 1 Γ( α )Γ( β ) Z Γ( α + β ) p ( X ) ∫ + ∞ −∞ p ( φ | X , α, β ) dφ = 1 , Z must be equal to the normalizing coefficient of the distribution Beta ( φ | α + N 1 , β + N 0 ) . Hence, p ( φ | X , α, β ) = Beta ( φ | α + N 1 , β + N 0 )

Recommend


More recommend