4CSLL5 Parameter Estimation (Supervised and Unsupervised) - PowerPoint PPT Presentation

4CSLL5 Parameter Estimation (Supervised and Unsupervised) 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Outline 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood (re-)Estimation Hidden variant of 2nd scenario The EM Algorithm Numerically worked example Martin Emms More realistic run of EM October 15, 2020 4CSLL5 Parameter Estimation (Supervised and Unsupervised) 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Outline Unsupervised Maximum Likelihood (re-)Estimation Hidden variant of 2nd scenario Hidden variable variant Suppose you no longer see the outcome of Z ; you still see the tosses of the chosen coin, but you can’t tell which it is. The data now looks like this d Z X : tosses of chosen coin H counts The EM Algorithm 1 ? H H H H H H H H T T (8H) 2 ? T T H T T T H T T T (2H) 3 ? H T H H T H H H H T (7H) 4 ? H T H H H T H H H H (8H) 5 ? T T T T T T H T T T (1H) 6 ? H H T H H H H H H H (9H) 7 ? T H H T H H H H H T (7H) 8 ? H H H H H H T H H H (9H) 9 ? H H T T T T T H T T (3H) Z is so-called hidden variable in each case

4CSLL5 Parameter Estimation (Supervised and Unsupervised) 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood (re-)Estimation Unsupervised Maximum Likelihood (re-)Estimation Hidden variant of 2nd scenario Hidden variant of 2nd scenario The ’product of sums’ problem We still have the probability model for combinations ( Z , X ), with the same parameters θ a , θ h | a and θ h | b We would still like to find values for θ a , θ h | a and θ h | b which again maximise the � (1 − θ h | a ) #( d , t ) + (1 − θ a ) θ #( d , h ) (1 − θ h | b ) #( d , t ) � � θ a θ #( d , h ) p ( d ) = probability of the observed data h | a h | b d For each d we just know the coin-tosses X d . Their probability is now a sum so can we maximise (12), repeated above? the preceding procedure of taking logs runs into a dead-end, because p ( d ) is no p ( X d ) p ( Z = a ) p ( X d | Z = a ) + p ( Z = b ) p ( X d | Z = b ) = longer all products, turning into sums. Instead the log is (1 − θ h | a ) #( d , t ) + (1 − θ a ) θ #( d , h ) θ a θ #( d , h ) (1 − θ h | b ) #( d , t ) = h | a h | b � (1 − θ h | a ) #( d , t ) + (1 − θ a ) θ #( d , h ) (1 − θ h | b ) #( d , t ) � θ a θ #( d , h ) � log h | a h | b d and the entire data set’s probability, p ( d ) is the product: and there is no known way to cleverly break this down as there was before � p ( X d ) p ( d ) = (11) this is essentially the problem we face if we want to do parameter estimation d with hidden variables – this is done widely in eg. Machine Translation and (1 − θ h | a ) #( d , t ) + (1 − θ a ) θ #( d , h ) � (1 − θ h | b ) #( d , t ) � θ a θ #( d , h ) � = (12) Speech Recognition. The EM or ’Expectation Maximisation’ algorithm will turn h | a h | b d out to be the solution 4CSLL5 Parameter Estimation (Supervised and Unsupervised) 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood (re-)Estimation Unsupervised Maximum Likelihood (re-)Estimation Hidden variant of 2nd scenario The EM Algorithm The general hidden variable set-up Taking stock: what kinds of thing can we calculate? before proceeding lets try to make clear the general case of a hidden variable ◮ parameters given visible data : we have seen illustrations where z is problem known for each datum, and where finding parameter values maximising the data likelihood was easy: its relative frequencies all the way (scenario 1: 1 You have D data items vis var; scenario 2: 2 vis vars, one for coin-choice, and one for the coin-tosses on whatever coin was chosen). In fact to do the parameter In the fully observed case, each data item d is represented by the values of a estimation we really just needed numbers about how often types of set of variables, which we’ll split into two sets � z d , x d � outcomes occurred. and you have a probability model – ie. formula – spelling how likely any such ◮ posterior probs on hidden vars : if we have all the parameters θ , for fully observed case is P ( � z d , x d � ; θ ) where θ are all the parameters of the model datum d we can ’easily’ work out P ( z = k | x d ; θ ). In our third scenario where the coin choice was hidden, for Z = a the formula is In the hidden case, for each data item d you just have values on a subset of the variables x d ; the other variables z d are hidden θ a θ #( d , h ) (1 − θ h | a ) #( d , t ) If A ( z ) represents the space of all possible values for the variables z , then the h | a P ( Z = a | X d ; θ a , θ h | a , θ h | b ) = probability of each partial data item is θ a θ #( d , h ) (1 − θ h | a ) #( d , t ) + (1 − θ a ) θ #( d , h ) (1 − θ h | b ) #( d , t ) h | a h | b P ( z = k , x d ; θ ) P ( x d ; θ ) = � ◮ EM methods put those two abilities to use in iterative procedures to k ∈A ( z ) re-estimate parameters

4CSLL5 Parameter Estimation (Supervised and Unsupervised) 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood (re-)Estimation Unsupervised Maximum Likelihood (re-)Estimation The EM Algorithm The EM Algorithm EM sketch EM sketch specific for scenario 3 (hidden coin choice) Let’s use the notation γ d ( k ) for P ( z = k | x d ) – which is something of a convention in EM methods ◮ Viterbi EM : (i) using some values for θ a , θ h | a , θ h | b , for each d work out we will describe EM for the moment as just a kind of procedure or recipe. γ d ( k ) for each value k ∈ { a , b } ; (ii) pick the best Z = k and ’complete’ d Later we will consider how to show that the procedure does something sensible. with this value for Z making a virtual complete corpus; (iii) re-estimate θ a , θ h | a , θ h | b on this virtual data. If you go back to (i) and do this over and ◮ Viterbi EM : (i) using some values for θ , for each d work out γ d ( k ) for over again you would be doing what is called Viterbi EM each value k ∈ A ( z ); (ii) pick the best z = k and ’complete’ d with this value for z making a virtual complete corpus; (iii) re-estimate θ on this ◮ real EM : (i) using some values for θ a , θ h | a , θ h | b , for each d work out γ d ( k ) virtual data. If you go back to (i) and do this over and over again you for each value k ∈ { a , b } ; (ii) pretend these γ d ( k ) are counts in a virtual would be doing what is called Viterbi EM corpus of completions of d ; (iii) re-estimate θ a , θ h | a , θ h | b on this virtual data. If you back to (i) and do this over and over again you would be ◮ real EM : (i) using some values for θ , for each d work out γ d ( k ) for each doing what is called EM value k ∈ A ( z ); (ii) pretend these γ d ( k ) are counts in a virtual corpus of completions of d ; (iii) re-estimate θ on this virtual data. If you back to (i) and do this over and over again you would be doing what is called EM 4CSLL5 Parameter Estimation (Supervised and Unsupervised) 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood (re-)Estimation Unsupervised Maximum Likelihood (re-)Estimation The EM Algorithm The EM Algorithm The EM algorithm The EM algorithm is a parameter (re)-estimation procedure, which starting from some original setting of parameters θ 0 , generates a converging sequence of re-estimates: The E step gives weighted guesses, γ d ( k ), for each way of completing each θ 0 → . . . → θ n → θ n +1 → . . . → θ final data point. These γ d ( k ) are then treated as counts of virtual completed data, so each data point x d is split into virtual population where each θ n goes to θ n +1 by a so-called E -step, followed by a M step: E step  virtual data virtual ’count’   ( z = 1 , x d ) γ d (1)  x d generate a virtual complete data corpus by treating each incomplete data item : : ( x d ) as standing for all possible completions with values for z , ( z = k , x d ) ,   ( z = k , x d ) γ d ( k )  weighting each by its conditional probability P ( z = k | x d ; θ n ) , under current parameters θ n : often this quantity is called the responsibility . Use γ d ( k ) for P ( z = k | x d ) . M step treating the ’responsibilities’ γ d ( k ) as if they were counts, apply maximum likelihood estimation to the virtual corpus to derive new estimates θ n +1 .

4CSLL5 Parameter Estimation (Supervised and Unsupervised) - PowerPoint PPT Presentation

4CSLL5 Parameter Estimation (Supervised and Unsupervised) 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Outline 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood (re-)Estimation Hidden

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Martin Emms September 20, 2019 4CSLL5

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Supervised Maximum Likelihood

4CSLL5 IBM Translation Models Martin Emms October 29, 2020 4CSLL5 IBM Translation Models

4CSLL5 IBM Translation Models IBM models Probabilities and Translation Alignments Martin Emms

4CSLL5 IBM Translation Models Martin Emms October 22, 2020 4CSLL5 IBM Translation Models IBM

4CSLL5 IBM Translation Models Martin Emms October 23, 2020 4CSLL5 IBM Translation Models

4CSLL5 Advanced Computational Linguistics Introduction Phrase Based Machine Trans Martin

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

I 4 - Bayesian parameter estimation in a normal model STAT 587 (Engineering) Iowa State

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Generative Adversarial Networks (GANs) By: Ismail Elezi ismail.elezi@gmail.com Supervised

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Current State of Unsupervised Deep Learning William Falcon, PhD Student AGENDA AGENDA

Maximum-likelihood and Bayesian parameter estimation Andrea Passerini passerini@disi.unitn.it

Today. Secret Sharing. Polynomials A polynomial P ( x ) = a d x d + a d 1 x d 1 +

Basic Assumptions for Efficient Model Representation Michael Gutmann Probabilistic Modelling and

d.vn dI d Xper Pr Xp dir l xr MEE PDdppnpja.tn 10 4 I 2 Px S going 43 Pym p yGis p

Lecture 15 Jan-Willem van de Meent TOPIC MODELS (part 2) Borrowing from : David Blei

A General Solver Based on Sparse Resultants Ioannis Z. Emiris presented by Pavel Trutman

Online Collaborative Prediction of Regional Vote Results Vincent Etter, Emtiyaz Khan, Mattias

Dynamically Typed Programming Languages Part 2: Dynamic PCF Jim Royer CIS 352 April 16, 2019

Algebraic and combinatorial methods for bounding the number of the complex embeddings of