course of pattern recognition
play

Course of Pattern Recognition Stochastic models dealing with - PowerPoint PPT Presentation

Introduction Markov Models Learning Edit Distances with EM Course of Pattern Recognition Stochastic models dealing with sequences Amaury Habrard L ABORATOIRE H UBERT C URIEN , UMR CNRS 5516 e Jean Monnet Saint- Universit Etienne


  1. Introduction Markov Models Learning Edit Distances with EM References References 1 Apprentissage Artificiel , Cornu´ ejols, Miclet, Eyrolles, 2002 (in french...). 2 Machine Learning , Tom Mitchell, McGraw Hill, 1997. 3 Pattern Recognition , Sergio Theodoridis, Konstantinos Koutroumbas, Academic Press, 2006 Important notice We see an instance of EM which is a very important algorithm for learning parameters of probabilistic models. The Expectation Maximization Algorithm - A short tutorial , Sean Borman, technical report. lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  2. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Markov Models lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  3. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Many real world problems require to know the probability of a sequence of events: Natural language processing, Text recognition, Weather forecast, etc. Definition Given a sequence of events a 1 . . . a k , the joint probability p ( a 1 . . . a k ) is obtained as follows: p ( a 1 . . . a k ) = p ( a 1 ) × p ( a 2 | a 1 ) × p ( a 3 | a 1 a 2 ) . . . × p ( a k | a 1 . . . a k − 1 ) where a 1 . . . a k − 1 is the history of symbol a k . For example, p ( he reads a book ) = p ( he ) × p ( reads | he ) × p ( a | he reads ) × lhc-logo p ( book | he reads a ) × p (# | he reads a book ) . Amaury Habrard Pattern Recognition & Machine Learning

  4. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm N-grams Question How to model sequence probabilities? A natural way to simplify the previous calculation consists in bounding the history → n-grams Definition A n-gram is a type of probabilistic model for predicting the next item a i given the n − 1 previous observed elements of the sequence, such that p ( a i | a 1 . . . a i − 1 ) = p ( a i | a i − ( n − 1 ) a i − ( n − 2 ) . . . a i − 1 ) For example, with n = 3, we get the following 3-gram: p ( a i | a 1 . . . a i − 1 ) = p ( a i | a i − 2 a i − 1 ) lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  5. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm N-grams Question How to train the n-gram model from a learning sample? Let us take the followng learning sequence aabaacaab . It is possible to estimate the conditional probabilities such that: p ( b | aa ) = p ( aab ) p ( aa ) = 2 / 3 p ( c | aa ) = p ( aac ) p ( aa ) = 1 / 3 ... lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  6. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm N-grams From the estimated probabilites, we can deduce the following automaton modeling the 3-grams: When n=2 (bigram), the probability of a word (or symbol) only depends on the preceding one. A bigram is a first-order observable lhc-logo Markov Model . Amaury Habrard Pattern Recognition & Machine Learning

  7. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm N-grams - Exercise Exercise Let us consider the three following sequences of weather states composed of the allowed symbols C (Cloudy), S (Sunny) and R (Rainy): CCRCSSS SSCRSC RRCCRS 1 Build the probabilistic automaton modeling with bi-grams the weather forecast. 2 Compute the probability of the sequence RRCCRS . lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  8. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Observable Markov Models lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  9. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Formalism of Observable Markov Models Definition A stochastic model is a random process which can change of state s i at each time q t . The probability P ( q 1 , q 2 , ..., q T ) of a set of states q 1 , q 2 , ..., q T is computed as follows: P ( q 1 , q 2 , ..., q T ) = P ( q 1 ) . P ( q 2 | q 1 ) . P ( q 3 | q 2 q 1 ) ... P ( q T | q 1 , ..., q T − 1 ) Definition A stochastic process fulfills the first-order Markov property if: ∀ t , P ( q t = s i | q t − 1 = s j , q t − 2 = s k ... ) = P ( q t = s i | q t − 1 = s j ) lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  10. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Observable Markov Models ...that means that the global evolution is determined by the initial probability and the successive transitions. Thus, T � P ( q 1 , q 2 , ..., q T ) = P ( q 1 ) . P ( q 2 | q 1 ) .... P ( q T | q T − 1 ) = P ( q 1 ) . P ( q k | q k − 1 ) k = 2 Definition A markov model is stationary if and only if ∀ t , ∀ k , P ( q t = s i | q t − 1 = s j ) = P ( q t + k = s i | q t + k − 1 = s j ) ...that means that whatever the time t , the probability of a transition between two states s i and s j remains the same. lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  11. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Observable Markov Models Definition An observable markov model is a finite state automaton for which each state is associated to a given observation . Transition probabilities between two states are defined by matrix A = [ a ij ] , where a ij expresses the probability to go from state s i to state s j . More formally, a ij = P ( q t = s j | q t − 1 = s i ) lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  12. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Observable Markov Models Toy Example: Let us (try to) model the weather ( rainy, sunny or cloudy ) by an observable markov model. To each possible observation is associated a state, such that s 1 = 1 → rainy , s 2 = 2 → cloudy , s 3 = 3 → sunny . Let us suppose that the transition probabilities of matrix A have been estimated from a training sample: t | t + 1 rainy cloudy sunny rainy 0.2 0.3 0.5 cloudy 0.6 0.2 0.2 sunny 0.1 0.1 0.8 a 1 , 2 = 0 . 3 means that the probability to observe a cloudy weather after a rainy day is 0 . 3. lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  13. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Observable Markov Models One can deduce the corresponding observable markov chain. 0.2 0.2 03 Cloudy Rainy 0.6 0.1 0.2 0.5 0.1 Sunny lhc-logo 0.8 Amaury Habrard Pattern Recognition & Machine Learning

  14. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Observable Markov Models Exercise A couple spending a week-end together plans to go to the museum and to the beach during the following two days. It is Friday and the weather is cloudy. Question What is the best schedule for that week-end? lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  15. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Example of application P ( CCC ) = 1 × P ( C | C ) × P ( C | C ) = 1 × 0 . 2 × 0 . 2 = 0 . 04 P ( CCR ) = 1 × 0 . 2 × 0 . 6 = 0 . 12 P ( CCS ) = 1 × 0 . 2 × 0 . 2 = 0 . 04 P ( CRR ) = 1 × 0 . 6 × 0 . 2 = 0 . 12 P ( CRC ) = 1 × 0 . 6 × 0 . 3 = 0 . 18 P ( CRS ) = 1 × 0 . 6 × 0 . 5 = 0 . 30 P ( CSS ) = 1 × 0 . 2 × 0 . 8 = 0 . 16 P ( CSC ) = 1 × 0 . 2 × 0 . 1 = 0 . 02 P ( CSC ) = 1 × 0 . 2 × 0 . 1 = 0 . 02 The best schedule is Saturday = museum and Sunday = beach! Note that the probabilities sum up to 1 . lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  16. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Property of a Markov Model Property Given a Markov Model M and a constant c, if ∀ i , � j a ij = 1 then M describes a statistical distribution over all the sequences of size c, such that � P ( x ) = 1 ∀ x ∈{ C , R , S } c lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  17. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Limitations of Observable MM: From Observable to Hidden MM lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  18. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm From Observable to Hidden Markov Models Let’s take an example in ornithology! Let θ be a theory stating that about 30 swans and 80 gooses land every day on a given lake. Let us observe the sequence of arrivals at a given day. At time t : some birds are arriving (20 swans and 5 gooses) → the theory seems to be wrong, but... At time t + 1: about 10 birds are landing equally distributed among the two classes of birds. At time t + 2: many gooses arrive accompanied by some swans. lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  19. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm From Observable to Hidden Markov Models Remark It would not be relevant to only state that P ( G ) = 80 110 = 0 . 73 and P ( S ) = 30 110 = 0 . 27 Indeed, we can note that the arrival time of birds is important to characterize the species of birds. It is important to take into account the three periods of time and wrap them in the model. Question How to model such a phenomenon? → Hidden Markov Models lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  20. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm From Observable to Hidden Markov Models Question Why Hidden Markov Models are important? A possible observable markov model would be the following: p P ( G ) = p State 1: Goose P ( S ) = 1 − p p State 2: Swan 1 − p 1 − p lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  21. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Limitations of observable markov models Each state corresponds to a given observation: Swan or Goose. Observing a goose G (resp. a swan S) means to be in state 1 (resp. 2). The probability of a sequence does not depend on the appearance of the birds. Indeed, P ( GGGS ) = P ( SGGG ) = p 3 ( 1 − p ) Remark We have seen that the arrival time is important. So, a simple observable markov model is not sufficient to model this phenomenon. There is no more reason to assume that the number of states is equal to the number of possible observations. lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  22. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Discovering Markov Models by example A possible way for modeling this phenomenon could be: P ( S ) P ( G ) time 1 0.8 0.2 time 2 0.5 0.5 time 3 0.1 0.9 25/26 88/89 8/9 time 1 time 3 time 2 1/26 1/9 1/89 P(S)=0.8 P(S)=0.5 P(S)=0.1 P(G)=0.2 P(G)=0.9 P(G)=0.5 Question lhc-logo How to learn such a model? Amaury Habrard Pattern Recognition & Machine Learning

  23. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Features of such a model It is represented in the form of a stochastic finite state automaton. It models a statistical distribution over the sequences of birds landing on that lake. It can be viewed as a generative model from which sequences of arrivals can be sampled. It can be learned from examples that only belong to the target concept ( positive examples ). It can allow us to classify new instances. lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  24. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Hidden Markov model: Formalism Two sets of random variables are available: one is observable , the other one is hidden . The observable variables are the observations themselves O 1 , O 2 , ..., O T ( e.g. GGGSGS). The hidden variables are the states q 1 , q 2 , ..., q T in which O 1 , O 2 , ..., O T have been observed ( e.g. time 1, time 2 or time 3 ). Example Other example: 3 possible states (no revise, revise a little, revise a lot for exams) 2 possible observations (to pass or to fail an exam) One can only observe the result (pass or fail) without having any lhc-logo information about the level of revision. Amaury Habrard Pattern Recognition & Machine Learning

  25. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Hidden Markov model: Formalism A Hidden Markov Model (HMM) λ = ( A , B , π ) is defined by: 1 a set S of N states where S = { s 1 , s 2 , ..., s N } . A state at a given time t is noted q t ( q t ∈ S ) . 2 M observable symbols in each state, V = { v 1 , v 2 , ..., v M } . An element O t of V represents the symbol observed at time t . 3 a matrix A of transition probabilities between states. a ij = A ( i , j ) = P ( q t + 1 = s j | q t = s i ) , ∀ i , j ∈ [ 1 .. N ] , ∀ t [ 1 .. T ] 4 a matrix B of observation probabilities of symbols. b j ( k ) = P ( O t = v k | q t = s j ) , 1 ≤ j ≤ N , 1 ≤ k ≤ M 5 a vector of probabilities π representing the initial probability of lhc-logo each state. π = { π i } i = 1 , 2 ,..., N . π i = P ( q 1 = s i ) , 1 ≤ i ≤ N Amaury Habrard Pattern Recognition & Machine Learning

  26. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Links between HMMs and probabilistic automata Let V a finite alphabet of symbols. An HMM defined a distribution over strings of size n ( V n ). HMM are equivalent to probabilistic non-deterministic finite automata (PNFA) without final probabilities. To define a distribution over all the possible (finite) strings ( V ∗ = � n V n ), one needs to add a termination symbol ( # ) to V . In this case HMM with final probabilities define the same class of distribution than PNFA: for any HMM there exists an equivalent PNFA with the same number of states for any PNFA with m states there exists an equivalent HMM with a number of states less or equal than min ( m 2 , m × | V | ) lhc-logo Important notice: this slide with the next two slides are outside of the scope of the course. Amaury Habrard Pattern Recognition & Machine Learning

  27. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Example: conversion of a HMM into a PNFA 0.04 0.36 0.04 0.36 a 0.02 a 0.18 0.9 b 0.72 0.1 11 12 11 12 b 0.08 [a 0.2] [a 0.3] a 0.21 [b 0.8] [b 0.7] b 0.21 b 0.49 0.7 0.1 0.3 b 0.02 a 0.08 a 0.09 � 0.9 a 0.72 b 0.18 a 0.27 0.7 22 22 21 0.3 21 a 0.63 [a 0.8] [a 0.9] b 0.03 b 0.07 [b 0.2] [b 0.1] 0.42 0.42 0.18 0.18 Fig. 10. Transformation of an HMM into an equivalent PNFA. lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  28. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Example: conversion of a PNFA into a HMM 0.04 0.36 0.9 11 12 0.1 [a 0.3] [a 0.2] [b 0.7] [b 0.8] 0.7 0.1 0.3 0.9 1 2 � 0.4 0.6 a 0.27 0.7 21 22 b 0.63 0.3 a 0.02 1 2 a 0.27 [a 0.8] [a 0.9] a 0.56 [b 0.2] [b 0.1] b 0.14 0.42 0.18 b 0.08 b 0.03 lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  29. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm What are the problems to solve? P1. The assessment from the model λ of the probability of a sequence of observations O = { O 1 , ..., O T } . Example Given a HMM modeling the English language, what is the probability that the sentence “ the cat is sleeping on the bed ” is syntactically correct? Example Given a HMM modeling the arrival of swans and gooses on the lake, what is the probability that a given sequence of birds follows this model? lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  30. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm What are the problems to solve? P2. The search for the most probable path ( i.e. the estimation of the hidden part of the model). Given a sequence of observations O and a given model λ , what is the set of states used by O in λ ? Example What are the most probable set of states in which a nuclear station has been before a given accident? lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  31. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm What are the problems to solve? P3. The learning of the parameters A , B , π of λ . This requires an objective function. In machine learning, several inductive principles are available (you already studied the empirical risk minimization ERM ). Since we only have positive examples, we can not apply the ERM principle. We rather optimize as objective function the likelihood of the learning set . � L = P ( O | λ ) O ∈ LS Note we often consider the log likelihood because of the properties of log: � log L = log P ( O | λ ) O ∈ LS lhc-logo (it avoids numerical problems) Amaury Habrard Pattern Recognition & Machine Learning

  32. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm The HMM and the bayesian classification Question How to use the HMM to deal with a multi-class problem? 1 One HMM λ ( k ) is learned for each class c k ∈ C = { c 1 , ..., c K } . 2 Given an observation O i ∈ O , one selects the model ( i.e. the class) that returns the maximum probability P ( λ ( k ) | O i ) , where P ( λ ( k ) | O i ) = P ( O i | λ ( k )) . P ( λ ( k )) P ( O i ) 3 One needs to be able to compute P ( O i | λ ( k )) . lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  33. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm P1/ Computation of P ( O | λ ) lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  34. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Computation of the probability P ( O | λ ) Assumption: the learned model λ is available. One aims to compute P ( O = O 1 , ..., O T | λ ) , i.e. the probability to observe O 1 at t = 1, then O 2 , at t = 2, ..., and O T at t = T . If one tests all the possible paths from a set of N states, the algorithmic complexity for computing the probability of a sequence of size T is Θ( N T ) . Fortunately, one can use dynamic programming to improve the algorithmic complexity ( Θ( N 2 T ) ). lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  35. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Forward Algorithm Forward Algorithm Let α t ( i ) = P ( O 1 O 2 ... O t , q t = s i | λ ) be the probability to be in state s i while having observed the first t symbols of O . Initialization: α 1 ( i ) = π i b i ( O 1 ) ,1 ≤ i ≤ N Induction: N � α t + 1 ( j ) = [ α t ( i ) a ij ] b j ( O t + 1 ) , 1 ≤ t ≤ T − 1 , 1 ≤ j ≤ N i = 1 Final step: N � P ( O | λ ) = α T ( i ) i = 1 lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  36. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Graphical explanation b (O ) b (O ) 1 2 1 1 π 1 a 11 α (1) α (1) 1 2 State 1 State 1 a 21 b (O ) 2 1 π 2 α (2) 1 State 2 a i1 b (O ) i 1 π i α ( ) i 1 State i lhc-logo t=1 t=2 t=3 Amaury Habrard Pattern Recognition & Machine Learning

  37. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Exercise Let us assume that the following HMM models the leisure activities ( Museum or Beach ) according to the weather. 0.2 1 0.2 Forward Algorithm 0.3 Museum:1 Museum:0.7 Beach:0 Beach:0.3 Let α t ( i ) be the probability to be in state s i 1 2 0.6 at time t . Cloudy Rainy α 1 ( i ) = π i b i ( O 1 ) ,1 ≤ i ≤ N Init: Induction: 0.1 0.2 0.5 0.1 N X α t + 1 ( j ) = [ α t ( i ) a ij ] b j ( O t + 1 ) i = 1 Museum:0.1 Beach:0.9 P ( O | λ ) = P N Final step: i = 1 α T ( i ) 3 Sunny 0.8 What is the probability to go first to the museum and then to the beach? lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  38. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm P2/ Computation of the optimal path lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  39. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Computation of the optimal path: The Viterbi algorithm The Viterbi algorithm (Viterbi 1967) is a dynamic programming algorithm for finding the most likely sequence of hidden states. The aim is to determine the best path that corresponds to the observation O , i.e. argmax Q P ( Q , O | λ ) Let δ t ( i ) be the probability of the current best path leading to state s i at time t given the first t observations: δ t ( i ) = max q 1 ,..., q t − 1 P ( q 1 , q 2 , ..., q t = s i , O 1 , O 2 , ..., O t | λ ) By induction, one computes the next quantity δ t + 1 ( j ) while storing the best states in a matrix φ : δ t + 1 ( j ) = [ max i δ t ( i ) a ij ] b j ( O t + 1 ) φ t + 1 ( j ) ← ArgMax i [ δ t ( i ) a ij ] lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  40. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Viterbi Algorithm Viterbi Algorithm foreach State i do δ 1 ( i ) ← π i b i ( O 1 ); φ 1 ( i ) ← 0; t ← 2; while t ≤ T do j ← 1; while j ≤ N do δ t ( j ) ← Max i [ δ t − 1 ( i ) a ij ] b j ( O t ) ; φ t ( j ) ← ArgMax i [ δ t − 1 ( i ) a ij ] ; j ← j + 1; t ← t + 1; q ∗ T ← ArgMax i [ δ T ( i )] ; t ← T − 1; while t ≥ 1 do q ∗ t ← φ t + 1 ( q ∗ t + 1 ) ; t ← t − 1; lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  41. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Example What is the optimal path for the sequence of activities O = ( Museum , Beach ) ? δ 1 ( 1 ) = π 1 b 1 ( M ) = 0 . 7 → φ 1 ( 1 ) = 0 δ 1 ( 2 ) = π 2 b 2 ( M ) = 0 . 0 → φ 1 ( 2 ) = 0 0.2 1 0.2 δ 1 ( 3 ) = π 3 b 3 ( M ) = 0 . 0 → φ 1 ( 3 ) = 0 0.3 Museum:1 Museum:0.7 Beach:0 Beach:0.3 1 2 δ 2 ( 1 ) = max ( δ 1 ( 1 ) a 11 , δ 1 ( 2 ) a 21 , δ 1 ( 3 ) a 31 ) ⋆ b 1 ( B ) 0.6 Cloudy Rainy = max ( 0 . 7 ⋆ 0 . 2 , 0 × 0 . 3 , 0 × 0 . 1 ) ⋆ 0 . 3 = 0 . 042 → φ 2 ( 1 ) = 1 0.1 0.2 δ 2 ( 2 ) = max ( δ 1 ( 1 ) a 12 , δ 1 ( 2 ) a 22 , δ 1 ( 3 ) a 32 ) ⋆ b 2 ( B ) 0.5 0.1 = max ( 0 . 7 ⋆ 0 . 6 , 0 × 0 . 2 , 0 × 0 . 1 ) ⋆ 0 = 0 → φ 2 ( 2 ) = 1 Museum:0.1 δ 2 ( 3 ) = max ( δ 1 ( 1 ) a 13 , δ 1 ( 2 ) a 23 , δ 1 ( 3 ) a 33 ) ⋆ b 3 ( B ) Beach:0.9 3 Sunny = max ( 0 . 7 ⋆ 0 . 2 , 0 × 0 . 5 , 0 × 0 . 8 ) ⋆ 0 . 9 = 0 . 126 → φ 2 ( 3 ) = 1 0.8 lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  42. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Example Finally, we get φ 1 ( 1 ) = 0, φ 2 ( 1 ) = 1 φ 1 ( 2 ) = 0, φ 2 ( 2 ) = 1 φ 1 ( 3 ) = 0, φ 2 ( 3 ) = 1 max ( δ 2 ( 1 ) , δ 2 ( 2 ) , δ 2 ( 3 )) = 0 . 126 q ∗ 2 = 3 q ∗ 1 = φ 2 ( 3 ) = 1 We deduce that the best sequence of states is: ( cloudy , sunny ) lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  43. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Exercise What is the optimal path for O = ( Beach , Beach , Museum ) ? Viterbi Algorithm foreach State i do 0.2 1 0.2 δ 1 ( i ) ← π i b i ( O 1 ); φ 1 ( i ) ← 0; 0.3 Museum:1 Museum:0.7 t ← 2; Beach:0 Beach:0.3 while t ≤ T do 1 2 0.6 Cloudy Rainy j ← 1; while j ≤ N do 0.1 0.2 δ t ( j ) ← Max i [ δ t − 1 ( i ) a ij ] b j ( O t ) ; φ t ( j ) ← ArgMax i [ δ t − 1 ( i ) a ij ] ; 0.5 0.1 j ← j + 1; t ← t + 1; q ∗ T ← ArgMax i [ δ T ( i )]; t ← T − 1; Museum:0.1 Beach:0.9 while t ≥ 1 do 3 q ∗ t ← φ t + 1 ( q ∗ t + 1 ); t ← t − 1; Sunny 0.8 lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  44. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Example Beach Beach Museum state1 1 ∗ 0 . 3 0 . 3 ∗ a 11 ∗ max (( 0 . 3 ∗ 0 . 2 ∗ 0 . 3 ) ∗ a 11 ∗ b 1 = 0 . 018 ∗ 0 . 2 , b 1 ( Beach ) 0 ∗ a 21 ∗ b 1 = 0 , ( 0 . 3 ∗ 0 . 2 ∗ 0 . 9 ) ∗ a 3 1 ∗ b 1 = = 0 . 3 ∗ 0 . 2 ∗ 0 . 3 0 . 054 ∗ 0 . 1 ) ∗ 0 . 7 = max ( 0 . 00252 , 0 , 0 . 00378 ) max ( 0 . 3 ∗ 0 . 2 ∗ 0 . 3 ∗ a 12 = 0 . 3 ∗ 0 . 2 ∗ 0 . 3 state2 0*0 0 ∗ 0 . 6 , 0 , 0 . 3 ∗ 0 . 2 ∗ 0 . 9 ∗ a 32 = 0 . 3 ∗ 0 . 2 ∗ 0 . 9 ∗ 0 . 1 ) ∗ 1 = max ( 0 . 0108 , 0 , 0 . 0054 ) max ( 0 . 3 ∗ 0 . 2 ∗ 0 . 3 ∗ a 13 = 0 . 3 ∗ 0 . 2 ∗ 0 . 3 ∗ 0 . 2 , state3 0*0.9 0 . 3 a 13 ∗ ∗ b 3 ( Beach ) 0 , 0 . 3 ∗ 0 . 2 ∗ 0 . 9 ∗ a 33 = 0 . 3 ∗ 0 . 2 ∗ 0 . 9 ∗ 0 . 8 ) ∗ 0 . 1 = 0 . 3 ∗ 0 . 2 ∗ 0 . 9 = max ( 0 . 00036 , 0 , 0 . 00432 ) We deduce that the best sequence of states is: (cloudy (state 1), cloudy (state 1), rainy (state 2) ) (see next slide). lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  45. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Example Finally, we get φ 1 ( 1 ) = 0, φ 2 ( 1 ) = 1, φ 3 ( 1 ) = 3 φ 1 ( 2 ) = 0, φ 2 ( 2 ) = 1, φ 3 ( 2 ) = 1 φ 1 ( 3 ) = 0, φ 2 ( 3 ) = 1, φ 3 ( 3 ) = 3 max ( δ 3 ( 1 ) , δ 3 ( 2 ) , δ 3 ( 3 )) = max ( 0 . 00378 , 0 . 0108 , 0 . 00432 ) = 0 . 0108 q ∗ 3 = 2 q ∗ 2 = φ 3 ( 2 ) = 1 q ∗ 1 = φ 2 ( 1 ) = 1 We deduce that the best sequence of states is: ( cloudy , cloudy , rainy ) lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  46. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Viterbi Algorithm Complexity Time complexity: O ( N 2 T ) where N is the number of states and T the size of the observation. Space complexity: O ( NT ) (one path per state) lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  47. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm P3/ Learning of the model λ lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  48. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Learning of the model λ Aim : learning of the parameters ( A , B , π ) that maximize the likelihood of a training set of observations O = { O 1 , ..., O m } . ...or in other words, we wish to estimate the model parameters for which the observed data are the most likely. lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  49. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Learning of the model λ Let O = { O 1 , ..., O m } be a learning set. One aims to maximize the likelihood: m � P ( O k | λ ) P ( O | λ ) = k = 1 One searches for the parameters of λ such that m � P ( O k | λ )] argmax λ [ P ( O | λ ) = k = 1 lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  50. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Baum-Welch Algorithm The Baum-Welch algorithm (that is named for Leonard E. Baum and Lloyd R. Welch.), is also called forward-backward algorithm. It allows the calculation of the probability of a sequence by using 2 procedures, called forward and backward 1 α t ( i ) (already used!) is the probability to be in state s i while having observed the first t symbols of the current observation O k . 2 β t ( i ) is the probability to observe from state s i the last symbols (from t + 1 to T ) of O k . lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  51. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Backward Algorithm Backward Algorithm β t ( i ) = P ( O t + 1 , ... O T | q t = s i ) β T ( i ) = 1 , 1 ≤ i ≤ N Initialization: Induction: N � β t ( j ) = a ji b i ( O t + 1 ) β t + 1 ( i ) , 1 ≤ t ≤ T − 1 , 1 ≤ j ≤ N i = 1 Final step: N � P ( O | λ ) = π i b i ( O 1 ) β 1 ( i ) i = 1 lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  52. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Graphical explanation 1 π b (O ) a b (O ) 1 1 1 11 1 2 β (1) β (1) 1 2 State 1 a 12 b (O ) 2 2 1 π b (O ) 2 2 1 β (2) β (2) β (2) 1 2 1 State 2 a b (O ) 13 3 2 1 π b (O ) 3 3 1 β (3) β (3) 1 2 State 3 t=1 t=2 lhc-logo Backward: ” right-to-left ” - Forward: ” left-to-right ” - Final values are equal Amaury Habrard Pattern Recognition & Machine Learning

  53. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Exercise Given the following hidden markov models λ = ( A , B , π ) . 0.2 1 0.2 0.3 Backward Algorithm Museum:1 Museum:0.7 Beach:0 Beach:0.3 Init: β T ( i ) = 1 , 1 ≤ i ≤ N 1 2 0.6 Cloudy Rainy Induction: N 0.1 0.2 X β t ( j ) = a ji b i ( O t + 1 ) β t + 1 ( i ) 0.5 0.1 i = 1 Final step: Museum:0.1 P ( O | λ ) = P N i = 1 π i b i ( O 1 ) β 1 ( i ) Beach:0.9 3 Sunny 0.8 Question Compute P ( Museum , Beach | λ ) using the backward function. (=0.168) lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  54. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Baum-Welch Algorithm Two functions (forward and backward) at our disposal to learn A , B , π . The Baum-Welch algorithm is based on the following steps: 1 Choose an initial set of parameters λ 0 Advice: Non zero values to prevent the parameters from not being used during the learning process. Principle: Reinforcement of some paths given the training set and the initialized parameters. 2 Compute λ 1 from λ 0 using expectation and maximization steps. Goal: increase of the likelihood. 3 Repeat this process until convergence. Use of a threshold or a statistical test. lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  55. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Baum-Welch Algorithm It is a generalized expectation-maximization (EM) algorithm. Definition The expectation-maximization (EM) algorithm is a powerful computational technique for locating maxima of functions. It is widely used in statistics for maximum likelihood estimation of parameters in incomplete or hidden data models. Remark In an HMM context, the parameters are A, B and π and the hidden variables are the states reached during the induction. lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  56. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm EM Algorithm Question Why is this algorithm called expectation-maximization ? Answering this question requires some background in optimization... lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  57. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm EM Algorithm Let O be an observable sequence of events, and λ = ( A , B , π ) the parameters we want to learn. The objective is to find λ such that P ( O | λ ) is maximum. Rather than optimizing P ( O | λ ) , we maximize L ( λ ) = lnP ( O | λ ) since ln ( x ) is a strictly increasing function, the value of λ which maximizes P ( O | λ ) also maximizes lnP ( O | λ ) . ln ( x ) is a concave function on which the Jensen’s inequality can be applied (we will see later the interest of this inequality). lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  58. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm EM Algorithm Theorem (Jensen’s inequality) Let f be a concave function defined on an interval I. If x 1 , x 2 , ..., x n ∈ I and γ 1 , γ 2 , ..., γ n ≥ 0 with � n i = 1 γ i = 1 , then � n � n � � γ i x i ≥ γ i f ( x i ) f i = 1 i = 1 lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  59. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm EM Algorithm The EM algorithm is an iterative procedure. Assume that after the n th iteration the current estimate for λ is given by λ n . Since the objective is to maximize lnP ( O | λ ) , we wish to compute an updated estimate λ that maximizes L ( λ ) − L ( λ n ) = lnP ( O | λ ) − lnP ( O | λ n ) (1) So far, we have not considered any unobserved variables z . To integrate z into the optimization process, we can note that: � P ( O | λ ) = P ( O | z , λ ) × P ( z | λ ) z lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  60. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm EM Algorithm Equation 1 can be rewritten as �� � L ( λ ) − L ( λ n ) = ln P ( O | z , λ ) × P ( z | λ ) − lnP ( O | λ n ) (2) z Jensen’s inequality can then be applied on Equation 2 where the constant γ i will take the form of P ( z | O , λ n ) . �� � L ( λ ) − L ( λ n ) = ln z P ( O | z , λ ) × P ( z | λ ) − lnP ( O | λ n ) �� � z P ( O | z , λ ) × P ( z | λ ) × P ( z | O ,λ n ) = ln − lnP ( O | λ n ) P ( z | O ,λ n ) �� � z P ( z | O , λ n ) × P ( O | z ,λ ) P ( z | λ ) = ln − lnP ( O | λ n ) P ( z | O ,λ n ) � � P ( O | z ,λ ) P ( z | λ ) ≥ � z P ( z | O , λ n ) ln − lnP ( O | λ n ) P ( z | O ,λ n ) � � P ( O | z ,λ ) P ( z | λ ) = � z P ( z | O , λ n ) ln lhc-logo P ( z | O ,λ n ) P ( O | λ n ) Amaury Habrard Pattern Recognition & Machine Learning

  61. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm EM Algorithm Therefore, our objective is to choose λ such that � � P ( O | z ,λ ) P ( z | λ ) L ( λ ) ≥ L ( λ n ) + � z P ( z | O , λ n ) ln is P ( z | O ,λ n ) P ( O | λ n ) maximized. Let λ n + 1 be this updated value. � � P ( z | O , λ n ) ln P ( O | z , λ ) P ( z | λ ) � λ n + 1 = argmax λ L ( λ n ) + P ( z | O , λ n ) P ( O | λ n ) z Now drop terms which are constant w.r.t. λ �� � λ n + 1 = argmax λ z P ( z | O , λ n ) lnP ( O | z , λ ) P ( z | λ ) �� � z P ( z | O , λ n ) ln P ( O , z ,λ ) P ( z ,λ ) λ n + 1 = argmax λ P ( z ,λ ) P ( λ ) �� � λ n + 1 = argmax λ z P ( z | O , λ n ) lnP ( O , z | λ ) lhc-logo � � λ n + 1 = argmax λ E z | O ,λ n { lnP ( O , z | λ ) } Amaury Habrard Pattern Recognition & Machine Learning

  62. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm EM Algorithm Conclusion � � In λ n + 1 = argmax λ E z | O ,λ n { lnP ( O , z | λ ) } the expectation and maximization steps are apparent. The EM algorithm thus consists of iterating the: 1 E-step: Determine the conditional expectation E z | O ,λ n { lnP ( O , z | λ ) } . This is done using the α t ( i ) (forward) and β t ( i ) (backward) procedures. 2 M-step: Maximize this expression with respect to λ . Following this principle, one can analytically define the update rule of the parameters A , B , π . lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  63. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Baum-Welch Algorithm Definition Let p k t ( i , j ) be the probability to use the transition going from state s i (emitting the symbol O k t ) to state s j (emitting O k t + 1 ) with the k th observation of the learning set. p k t ( i , j ) = P ( q t = s i , q t + 1 = s j | O k , λ ) t ( i , j ) = P ( q t = s i , q t + 1 = s j , O k | λ ) p k P ( O k | λ ) t ( i , j ) = α k t ( i ) . a ij . b j ( O k t + 1 ) .β k t + 1 ( j ) p k P ( O k | λ ) p k t ( i , j ) will be useful to estimate matrix A . lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  64. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Baum-Welch Algorithm Definition t ( i ) be the probability that the t th symbol of O k is emitted in s i . Let γ k γ k t ( i ) = P ( q t = s i | O k , λ ) N � γ k P ( q t = s i , q t + 1 = s j | O k , λ ) t ( i ) = j = 1 � N j = 1 P ( q t = s i , q t + 1 = s j , O k | λ ) γ k t ( i ) = P ( O k | λ ) N t ( i , j ) = α k t ( i ) .β k t ( i ) � γ k p k t ( i ) = P ( O k | λ ) j = 1 lhc-logo γ k t ( i ) will be useful to estimate matrix B Amaury Habrard Pattern Recognition & Machine Learning

  65. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Baum-Welch Algorithm Definition One can now estimate the new parameters of the model m π i = 1 � γ k 1 ( i ) m k = 1 ...that means that π i is the proportion of times that state s i is used to emit the first symbol of a sequence. lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  66. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Baum-Welch Algorithm Definition � | O k | � m t = 1 p k t ( i , j ) k = 1 a ij = P ( q t + 1 = s j | q t = s i ) = � | O k | � m t = 1 γ k t ( i ) k = 1 ...that means that a ij is the proportion of times that the transition from s i to s j is used in the learning set. lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  67. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Baum-Welch Algorithm Definition | O k | m � � γ k b j ( l ) = P ( O t = v l | q t = s j ) = t ( j ) k = 1 t : O k t = v l ...that means that b j ( l ) is the proportion of times that HMM is in state s j and emits the symbol v l . lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  68. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Baum-Welch Algorithm Definition All the parameters of the HMM are estimable and computable. m π i = 1 � γ k 1 ( i ) m k = 1 � | O k | � m t = 1 p k t ( i , j ) k = 1 a ij = P ( q t + 1 = s j | q t = s i ) = � | O k | � m t = 1 γ k t ( i ) k = 1 | O k | m � � γ k b j ( l ) = P ( O t = v l | q t = s j ) = t ( j ) k = 1 t : O k t = v l lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  69. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Example Let us consider the following HMM with 6 states with π 0 = { 1 , 0 , 0 , 0 , 0 , 0 } and O = { bca # , cca # , bbba # , bcba # , cbba # , ccba # } 1 2 3 4 5 6 a b c # 1 0 0.5 0.5 0 0 0 1 0.33 0.33 0.33 0 2 0 0 0 0 1 0 2 0.33 0.33 0.33 0 3 0 0 0 1 0 0 3 0.33 0.33 0.33 0 B0= A0= 4 0.33 0.33 0.33 0 4 0 0 0 0 1 0 5 0.33 0.33 0.33 0 5 0 0 0 0 0 1 6 0 0 0 1 6 0 0 0 0 0 0 2 1 0.5 5 6 1 1 0.5 1 lhc-logo 3 4 Amaury Habrard Pattern Recognition & Machine Learning

  70. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Example Update of a 12 the probability of the transition between s 1 and s 2 . � m k = 1 p k 1 ( 1 , 2 ) a 12 = P ( q 2 = s 2 | q 1 = s 1 ) = � m k = 1 γ k 1 ( 1 ) We know that t ( i , j ) = α k t ( i ) . a ij . b j ( O k t + 1 ) .β k t + 1 ( j ) p k P ( O k | λ ) Let us consider the first example O 1 = bca # (probabilities of 1 are omitted) 1 ( 1 , 2 ) = α 1 1 ( 1 ) . a 12 . b 2 ( O 1 2 ) .β 1 2 ( 2 ) = 1 / 3 × 1 / 2 × 1 / 3 × 1 / 3 p 1 = 1 P ( O 1 | λ ) 1 / 54 lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  71. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Example b c a # 1/3 1 1/3 s1 1/2 1/3 0 s2 1/18 1/2 1/3 0 1/18 s3 1 1 1/3 0 1/54 s4 1 0 1/3 0 1/54 0 s5 1 1 0 s6 1/54 lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  72. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm Conclusion Conclusion HMM are generative models that describe a probabilistic distribution over the sequences. There is no theoretical result dealing with the optimal number of states. The structure of the HMM is provided by the user. The states do not correspond to real physical phenomena. lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  73. Introduction Observable Markov Models Markov Models Hidden Markov Models Learning Edit Distances with EM Expectation-Maximization (EM) Algorithm HMM are models that allow us to deal with strings. Another way to handle such structured data is to call on specific metrics that enable us to use standard methods such as the k-nearest neighbor algorithm. The edit distance is probably the most used metric to compare two strings. However, it highly depends on the costs assigned to each edit operation. The next part deals with a new learning method of those parameters. lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  74. Introduction Markov Models How to learn edit probabilities with EM? Learning Edit Distances with EM String Edit Distance Definition The Levenshtein distance (or Edit Distance) between two strings x = x 1 ... x T and y = y 1 ... y V is given by the minimum number of edit operations needed to transform x into y , where an operation is an insertion, deletion, or substitution of a single character. Rather than counting the number of edit operations, one can assign an edit cost to each edit operation and search for the less costly transformation: c ( x i , y j ) is the cost of the substitution of x i by y j , c ( x i , λ ) is the cost of the deletion ( x i into the empty symbol λ ), c ( λ, y j ) is the cost of the insertion of y j . lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  75. Introduction Markov Models How to learn edit probabilities with EM? Learning Edit Distances with EM Edit Distance Remark We can make the following remarks: The impact of the choice of the edit costs is very important. Modify edit costs = change the neighborhood of a given string = modify its classification. Three possible solutions to tune the edit costs: 1 Arbitrary choice... 2 Use of background knowledge (e.g. on a qwerty keyboard, the key “w” is more often changed into a “q” or a “e” than into a “m”). 3 Learn the edit parameters from a training set. lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  76. Introduction Markov Models How to learn edit probabilities with EM? Learning Edit Distances with EM Question How to learn edit costs? Remark A string can be changed into another one according to different edit scripts. Therefore, the edit scripts can be considered as hidden parameters. The observable parameters are the symbols of a pair of (input,ouput) strings. The EM algorithm can be used to learn those parameters !! lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

  77. Introduction Markov Models How to learn edit probabilities with EM? Learning Edit Distances with EM Learning framework of the edit distance One can use the EM algorithm and the forward and backward procedures to learn an edit distance. Since EM is a probabilistic method, one first learns the probability P ( x , y ) of a pair of strings x and y . A stochastic edit distance (in fact an edit similarity) can be deduced from the negative logarithm of P ( x , y ) such that: DE ( x , y ) = − logP ( x , y ) The symmetry property of the distance can be lost. Distance → Similarity. lhc-logo Amaury Habrard Pattern Recognition & Machine Learning

Recommend


More recommend