Language and Document Analysis: Motivating Latent variable Models - PowerPoint PPT Presentation

Language and Document Analysis: Motivating Latent variable Models Wray Buntine National ICT Australia (NICTA) MLSS, ANU, Jan., 2009 Buntine Document Models

Part-of-Speech with Hidden Markov Models Topics in Text with Discrete Component Analysis Part II Problems and Methods Buntine Document Models

Part-of-Speech with Hidden Markov Models Topics in Text with Discrete Component Analysis Outline We review some key problems and key algorithms using latent variables. 1 Part-of-Speech with Hidden Markov Models Markov Model Hidden Markov Model 2 Topics in Text with Discrete Component Analysis Background Algorithms Buntine Document Models

Part-of-Speech with Hidden Markov Models Markov Model Topics in Text with Discrete Component Analysis Hidden Markov Model Outline We look at the Hidden Markov Model, because its an important base algorithm. We use it to introduce Conditional Random Fields, a recent high performance algorithm. 1 Part-of-Speech with Hidden Markov Models Markov Model Hidden Markov Model 2 Topics in Text with Discrete Component Analysis Buntine Document Models

Part-of-Speech with Hidden Markov Models Markov Model Topics in Text with Discrete Component Analysis Hidden Markov Model Parts of Speech, A Useful Example A set of candidate POS exist for each word. taken from a dictionary or lexicon. Which is the right one in this sentence? Lets take some fully tagged data, where the truth is known, and use statistical learning. A standard notation for representing tags , in this example, is: Fed/NNP raises/VBZ interest/NNS rates/NNS (in effort to control inflation.) 0.5/CD %/% ... We use this to illustrate Markov models and HMMs. Reference: Manning and Sch¨ utze, chaps 9 and 10. Buntine Document Models

Part-of-Speech with Hidden Markov Models Markov Model Topics in Text with Discrete Component Analysis Hidden Markov Model Outline 1 Part-of-Speech with Hidden Markov Models Markov Model Hidden Markov Model 2 Topics in Text with Discrete Component Analysis Buntine Document Models

Part-of-Speech with Hidden Markov Models Markov Model Topics in Text with Discrete Component Analysis Hidden Markov Model Markov Model with Known Tags There are I words. w i = i -th word. t i = tag for i -th word. Our 1st-order Markov model in the figure shows which variables depend on which. The ( i + 1)-th tag depends on the i -th tag. The i -th word depends on the i -th tag. Resultant formula for p ( t 1 , t 2 , t 3 , ..., t N , w 1 , w 2 , w 3 , ..., w N ) is � � p ( t 1 ) p ( t i | t i − 1 ) p ( w i | t i ) i =2 ,..., I i =1 ,..., I Buntine Document Models

Part-of-Speech with Hidden Markov Models Markov Model Topics in Text with Discrete Component Analysis Hidden Markov Model Fiitting Markov Model with Known Tags Have p ( t 1 , t 2 , t 3 , ..., t N , w 1 , w 2 , w 3 , ..., w N ) is � � p ( t 1 ) p ( t i | t i − 1 ) p ( w i | t i ) i =2 ,..., I i =1 ,..., I Have K distinct tags and J distinct words. Use p ( t i = k 1 | t i − 1 = k 2 ) = a k 2 , k 1 , p ( t 1 = k ) = c k , p ( w i = j | t i = k ) = b k , j . a and b are probability matrices whose columns sum to one. Collecting like terms T k 1 , k 2 � � � b W k , j c S k a k k 1 , k 2 k , j k k 1 , k 2 k , j where T k 1 , k 2 is count of times tag k 2 follows tag k 1 , and W k , j is count of times tag k assigned to word j , and S k is count of times sentence starts with tag k . Buntine Document Models

Part-of-Speech with Hidden Markov Models Markov Model Topics in Text with Discrete Component Analysis Hidden Markov Model Fiitting Markov Model with Known Tags, cont. Standard maximum likelihood methods apply, so these parameters a and b become their observed proportions: a k 1 , k 2 is proportion of tags of type k 2 when previous was k 1 , b k , j is proportion of words of type j when tag was k , T k 1 , k 2 W k , j S k Thus a k 1 , k 2 = k 2 T k 1 , k 2 , b k , j = j W k , j , c k = k S k . P P P Note we have many sentences in the training data, and each one has a fresh start, so c k is estimating from all those initial tags in sentences. As is standard when dealing with frequencies, we can smooth these out by adding small amounts to the numerator and denominator to make all quantities non-zero. Buntine Document Models

Part-of-Speech with Hidden Markov Models Markov Model Topics in Text with Discrete Component Analysis Hidden Markov Model Comments In practice, the naive estimation of a and b works poorly because we never have enough data. Most words occur infrequently, so we cannot get good tag statistics for them. Kupiec (1992) suggested grouping infrequent words together based on their pattern of candidate POS. This overcomes paucity of data with a reasonable compromise. So “red” and “black” can both be NN or JJ, so they belong to the same ambiguity class . Ambiguity classes not used for frequent words. Unknown words are also a problem. A first approximation is to assign unknown words with first capitals to NP. Buntine Document Models

Part-of-Speech with Hidden Markov Models Markov Model Topics in Text with Discrete Component Analysis Hidden Markov Model Estimating Tags for New Text We now fix the Markov model parameters a , b and � c . We have a new sentence with I words w 1 , w 2 , ..., w I . How do we estimate its tag set? We ignore the lexical contraints for now ( e.g. , “interest” is VB, VBZ or NNS), and fold them in later. Task so described is: � � � � t = argmax � t p t , � w | a , b ,� c where the probability is as before. Buntine Document Models

Part-of-Speech with Hidden Markov Models Markov Model Topics in Text with Discrete Component Analysis Hidden Markov Model Estimating Tags for New Text, cont. Wish to solve � � argmax � t p ( t 1 ) p ( t i | t i − 1 ) p ( w i | t i ) i =2 ,..., I i =1 ,..., I The task is simplified by the fact that knowing the value for tag t N splits the problem neatly into parts, so define m ( t 1 ) = p ( t 1 ) � � m ( t N ) = max t 1 ,..., t N − 1 | t N p ( t 1 ) p ( t i | t i − 1 ) p ( w i | t i ) i =2 ,..., N i =1 ,..., N − 1 We get the recursion for m ( t N +1 ): � � = max t 1 ,..., t N | t N +1 p ( t 1 ) p ( t i | t i − 1 ) p ( w i | t i ) i =2 ,..., N +1 i =1 ,..., N � � = max t N | t N +1 max t 1 ,..., t N − 1 | t N , t N +1 p ( t 1 ) p ( t i | t i − 1 ) p ( w i | t i ) i =2 ,..., N +1 i =1 ,..., N = max t N p ( t N +1 | t N ) p ( w N | t N ) m ( t N ) Buntine Document Models

Part-of-Speech with Hidden Markov Models Markov Model Topics in Text with Discrete Component Analysis Hidden Markov Model Estimating Tags for New Text, cont. We apply this incrementally, building up a contingent solution from left to right. This is called the Viterbi algorithm , first developed in 1967. 1 Initialise m ( t 1 ), m ( t 1 = k ) = c k . 2 For i = 2 , ..., I , compute m ( t i ), m ( t i = k 1 ) = max k 2 ( a k 2 , k 1 b k 2 , w N m ( t i − 1 = k 2 )) then store the backtrace, the k 2 that achieves maximum for each t i = k 1 . 3 At the end, I , find the maximum t I = argmax k m ( t I = k ), and chain through the backtraces to get the maximum sequence for t 1 , . . . , t I . This technique is an example of dynamic programming. Buntine Document Models

Part-of-Speech with Hidden Markov Models Markov Model Topics in Text with Discrete Component Analysis Hidden Markov Model Comments What about lexical contraints, e.g. , our dictionary tells us that “interest” is either VB, VBZ or NNS? Thus p ( w i = ’interest’ | t i = ’JJS’) = 0. Thus we would like to enforce zeros in some entries of the b matrix. Likewise, with the ambiguity classes above, and with the individual words, we just assign some zero’s to b k , j for j the index of the word. Buntine Document Models

Part-of-Speech with Hidden Markov Models Markov Model Topics in Text with Discrete Component Analysis Hidden Markov Model Estimating Tag Probabilities We again fix the Markov model parameters a , b and � c . We have a new sentence with I words w 1 , w 2 , ..., w I . We’ve got the most likely tag set using the Viterbi algorithm. What’s the uncertainty here? Task can be described as: find the tag probabilities for each t N . � � � � p ( t N | � w ) ∝ p t , � w | a , b ,� c � t / t N where the probability is as before. Buntine Document Models

Part-of-Speech with Hidden Markov Models Markov Model Topics in Text with Discrete Component Analysis Hidden Markov Model Estimating Tag Probabilities, cont. Wish to compute p ( t N | � w ), got by normalising   � � � p ( t N , � w ) =  p ( t 1 ) p ( t i | t i − 1 ) p ( w i | t i )  i =2 ,..., I i =1 ,..., I � t / t N Note we have:   � � � p ( t N | w 1 , ..., w N − 1 ) =  p ( t 1 ) p ( t i | t i − 1 ) p ( w i | t i )  t 1 ,..., t N − 1 i =2 ,..., N i =1 ,..., N − 1   � � � p ( w N +1 , ..., w I | t N ) = p ( t i | t i − 1 ) p ( w i | t i )   t N +1 ,..., t I i = N +1 ,..., I i = N +1 ,..., I Thus p ( t N , � w ) = p ( t N | w 1 , ..., w N − 1 ) p ( w N +1 , ..., w I | t N ) p ( w N | t N ) Buntine Document Models

Language and Document Analysis: Motivating Latent variable Models - PowerPoint PPT Presentation

Language and Document Analysis: Motivating Latent variable Models Wray Buntine National ICT Australia (NICTA) MLSS, ANU, Jan., 2009 Buntine Document Models Part-of-Speech with Hidden Markov Models Topics in Text with Discrete Component

1 Latent variable models In the next section we will discuss latent variable models for

Latent Variable Models CS3750 Xiaoting Li 1 Out utli line Latent Variable Models

Language and Document Analysis: Motivating Latent variable Models Wray Buntine National ICT

Learning Overcomplete Latent Variable Models through Tensor Methods Anima Anandkumar UC Irvine

Part III: Latent Tree Models Le Song ICML 2012 Tutorial on Spectral Algorithms for Latent

Pengtao Xie Joint work with Yuntian Deng and Eric Xing Carnegie Mellon University 1 Latent

Numberjack User Guide May 27, 2013 1 Variables Constructor for the class Variable : Constructor

Variable selection bias Bias in Ensemble Bias in Ensemble Methods Methods Variable selection

Maximum Reconstruction Estimation for Generative Latent-Variable Models Yong Cheng joint work

Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable Model CS330

Latent Variable Models with Gaussian Processes Neil D. Lawrence GP Master Class 6th February

Distributed Variational Inference in Sparse Gaussian Process Regression and Latent Variable Models

A Latent Variable Model of Synchronous Parsing for Syntactic and Semantic Dependencies James

Latent Variable models for GWAs Oliver Stegle Machine Learning and Computational Biology Research

Bayesian Latent Variable Modelling of Longitudinal Family Data for Genetic Pleiotropy Studies

MLG Spotlight Talks August 20th, 2018 Growing Better Graphs with Latent-Variable Probabilistic

A New, Simple, universal, Low Cost LED Driver and Controller Akram M. Fayaz , Charif

Background vanilladb.org Why do you need a database system? 2 To store data, why not just use

Four-Lesson Special The Holocaust, Anti-Semitism, and UsPart 2 May 24, 2016 Dean Bible

Systems Shan-Hung Wu CS, NTHU Why do you need a database system? 2 To store data, why not

Software Testing Fernando Brito e Abreu (fba@di.fct.unl.pt) Universidade Nova de Lisboa

Verilog HDL:Digital Design and Modeling Chapter 11 Additional Design Examples Chapter 11

Recursion Chapter 11 Chapter 11 1 Reminders Project 6 is over. Project 7 has begun

CSC 101 Lecture Notes Week 7 C Structures Reading: Chapter 11 CSC101-S10-L7 Slide 2 I. Intro to