Discrete HMM A simple example : ( ) = + + + + + + + all 1 2 3 4 5 6 7 8 all paths + + + + + + ⎡ ⎤ ⎡ ⎤ 1 2 3 4 5 6 7 8 π + π = γ ⋅ π + γ ⋅ π log log ( 1 ) log ( 1 ) log ⎢ ⎥ ⎢ ⎥ 1 2 1 1 2 2 ⎣ all ⎦ ⎣ all ⎦ = = t 1 t 2 = j 2 = j 1 = i 1 ⎡ + + + + ⎤ ⎡ 1 2 1 5 ⎤ ⎡ 3 4 2 6 ⎤ + + + + log a log a ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 11 12 ⎣ all all ⎦ ⎣ all all ⎦ ⎣ ⎦ = i 2 ⎡ + + + + ⎤ ⎡ 5 6 3 7 ⎤ ⎡ 7 8 4 8 ⎤ + + + log a log a ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ 21 22 ⎣ all all ⎦ ⎣ all all ⎦ ⎣ ⎦
( ) = P s i , X λ ( ) γ = t i ( ) Discrete HMM t P X λ ( ) = P s i , X λ = t ( ) N ∑ = P s j , X λ t The Forward/Backward Procedure = j 1 ( ) ( ) α β i i = t t ( ) ( ) ( ) ( ) ( ) ( ) α β α β α β 1 1 1 1 1 1 N ( ) ( ) ∑ α β 3 3 1 1 2 2 j j t t = j 1 s 1 s 1 s 1 State ( ) = = P s i , s j , X λ ( ) + ξ = t t 1 i , j ( ) s 2 s 2 s 2 t P X λ ( ) ( ) ( ) ( ) ( ) ( ) ( ) α β α β α β 2 2 2 2 2 2 = = P s i , s j , X λ 3 3 1 1 2 2 + = t t 1 ( ) N N 1 2 3 time ∑ ∑ = = P s i , s j , X λ + t t 1 = = i 1 j 1 X 1 X 2 X 3 ( ) ( ) α β i a b ( x ) j + + = t ij j t 1 t 1 N N ( ) ( ) ∑ ∑ α β i a b ( x ) j + + t ij j t 1 t 1 = = i 1 j 1
Discrete HMM Q-function : ⎡ ⎤ ⎢ ⎥ p ( X , S | λ ) ∑ ⎢ ⎥ = ⋅ Q ( λ | λ ) log p ( X , S | λ ) ⎢ ⎥ ∑ S p ( X , S | λ ) ⎢ ⎥ ⎣ ⎦ ξ ( t ) γ ( 1 ) S ij i ⎡ ⎤ ⎡ ⎤ ⎛ − ⎞ N N N T 1 ∑ ∑ ∑ ∑ = = π + = = ⎜ ⎟ Pr( s i | X , λ ) log Pr( s i , s j | X , λ ) log a ⎢ ⎥ ⎢ ⎥ + 1 i t t 1 ij ⎣ ⎦ ⎝ ⎠ ⎣ ⎦ = = = = i 1 i 1 j 1 t 1 ⎡ ⎤ ⎛ ⎞ N K ∑ ∑ ∑ ⎜ ⎟ + = Pr( s j , x ~ v | X , λ ) log b ⎢ ⎥ ⎜ ⎟ t t k jk ⎢ ⎥ ⎝ ⎠ ⎣ ⎦ = = j 1 k 1 t : x ~ v t k T ∑ γ ⋅ = ( j ) 1 ( x v ) t t k = t 1 T ∑ γ ( j ) t = t 1
Discrete HMM R-function : π For simplicity , prior independen ce of , A and B is assumed. The prior density for λ is then ( ) ( ) ( ) ( ) = ⋅ ⋅ p λ p π p A p B and their densities assume the form of Dirichlet distributi ons then ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ N N N N K ( ) ∏ ∏ ∏ ∏ ∏ η − η − 1 ν − = π 1 1 p λ K a b ij ⎢ ⎥ i ik ⎢ ⎥ ⎢ ⎥ c i ij ik ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ = = = = = i 1 i 1 j 1 i 1 k 1 η η ν > where , and 1 i ij ik [ ] ( ) ( ) ( ) = = λ max log p λ | X max log p X | λ p λ MAP λ λ [ ] ( ) ( ) = + max log p X | λ log p λ λ [ ( ) ( ) ] = + max Q λ | λ log p λ λ = + We define the auxiliary function R ( λ | λ ) Q ( λ | λ ) log p ( λ )
Discrete HMM ∴ = Ψ + R ( λ | λ ) ( constant ) ⎡ ⎤ N ( ) ∑ = + η − π + Pr( s i | X , λ ) 1 log ⎢ ⎥ 1 i i ⎣ ⎦ = i 1 ⎡ ⎤ ⎛ ⎞ ⎛ − ⎞ N N T 1 ∑ ∑ ∑ ⎜ ⎟ = = + η − + ⎜ ⎟ Pr( s i , s j | X , λ ) 1 log a ⎢ ⎥ ⎜ ⎟ + t t 1 ij ij ⎝ ⎠ ⎝ ⎠ ⎣ ⎦ = = = i 1 j 1 t 1 ⎡ ⎤ ⎛ ⎞ ⎛ ⎞ N K ∑ ∑ ∑ ⎜ ⎟ ⎜ ⎟ = = + ν − ⎢ ⎥ Pr( s i , x v | X , λ ) 1 log b ⎜ ⎟ ⎜ ⎟ t t k ik jk ⎢ ⎥ ⎝ ⎠ ⎝ ⎠ ⎣ ⎦ = = j 1 k 1 t : x ~ v t k
Discrete HMM So, we can obtain ⎛ − ⎞ T 1 ∑ = = + η − ⎜ ⎟ Pr( s i , s j | X , λ ) 1 + t t 1 ij = + η − Pr( s i | X , λ ) 1 ⎝ ⎠ = π = = t 1 1 i a i ij N ⎡ ⎤ ⎛ ⎞ − N T 1 [ ] ∑ ∑ ∑ = + η − Pr( s i | X , λ ) 1 = = + η − ⎜ ⎟ Pr( s i , s j | X , λ ) 1 ⎢ ⎥ 1 i + t t 1 ij ⎝ ⎠ ⎣ ⎦ = i 1 = = j 1 t 1 ⎛ ⎞ ∑ ⎜ ⎟ = = + ν − Pr( s i , x v | X , λ ) 1 ⎜ ⎟ t t k ik ⎝ ⎠ t : x ~ v = b t k jk ⎡ ⎤ ⎛ ⎞ K ∑ ∑ ⎜ ⎟ = = + ν − Pr( s i , x v | X , λ ) 1 ⎢ ⎥ ⎜ ⎟ t t k ik ⎢ ⎥ ⎝ ⎠ ⎣ ⎦ = k 1 t : x ~ v t k
Discrete HMM • How to choose the initial estimate for ? π , a and b i ij jk • One reasonable choice of the initial estimate is the mode of the prior density. η − 1 π = = ( 0 ) i i 1 , , N L i N ( ) ∑ η − 1 p = p 1 η − 1 = ij = ( 0 ) a i , j 1 , , N L ij N ( ) ∑ η − 1 ip = p 1 ν − 1 = jk = = ( 0 ) b j 1 , , N and k 1 , , K L L jk K ( ) ∑ ν − 1 jp = p 1
Discrete HMM • What ’ s the mode ? If λ is the mode of the prior density mode ⇒ = λ max p ( λ ) mode λ – So applying Lagrange Multiplier we can easily derive above modes. N N ∏ ∑ ( ) – Example : π π ∝ π η − ⇒ π π = Ψ + η − π 1 p ( , , ) log p ( , , ) 1 log L L i 1 N i 1 N i i = = i 1 i 1 ∂ π π N log p ( , , ) 1 L ( ) ∑ = η − × + π − = 1 N 1 l ( 1 ) 0 i p ∂ π π = p 1 i i η − η − 1 1 N N ∑ ∑ p ⇒ π = π = ∴ = i but 1 1 i − p − l l = = p 1 p 1 η − ( ) N 1 ∑ ∴ − = η − ⇒ π = l 1 i p i ( ) N ∑ η − = p 1 1 p = p 1
Discrete HMM • Another reasonable choice of the initial estimate is the mean of the prior density. η η ij π ( 0 ) = = ( 0 ) = = i i 1 , , N a i , j 1 , , N L L i ij N N ∑ ∑ η η p ip = = p 1 p 1 ν jk ( 0 ) = = = b j 1 , , N and k 1 , , K L L jk K ∑ ν jp = p 1 • Both are some kind of summarization of the available information about the parameters before any data are observed.
SCHMM ⇒ − Likelihood Semi Continuous HMM ⇒ + − Prior Dirichlet normal Wishart ⎡ ⎤ T ∑ ∏ Λ = π Let p ( X | ) b ( x ) a b ( x ) be the likelihood ⎢ ⎥ s s 1 s s s t ⎣ ⎦ − 1 1 t 1 t t = S t 2 K { } ∑ = = where X x ,..., x and b ( x ) w N ( x | m , r ) 1 T i t ik t k k = k 1 { } = Λ λ , , λ , θ , , θ where M is the total HMMs number L L 1 M 1 K { } = π = = ( m ) ( m ) ( m ) λ , a , w | i , j 1 ,..., N ( state number ), k 1 , , K L m i ij ik { } = = and θ m , r k 1 ,.., K ( mixture number ) k k k 1 T − − − ( x m ) r ( x m ) = π − k k k D / 2 1 / 2 where N ( x | m , r ) ( 2 ) | r | e 2 k k k
SCHMM The prior density for Λ is assumed to be : ⎡ ⎤ ⎡ ⎤ M K ∏ ∏ = g ( Λ ) g ( λ ) g ( m , r ) ⎢ ⎥ ⎢ ⎥ m k k ⎣ ⎦ ⎣ ⎦ = = m 1 k 1 independent ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ N N N N K ∏ ∏ ∏ ∏ ∏ η − η − 1 ν − ∝ π 1 1 where g ( λ ) K a w ⎢ ij ⎥ ⎢ i ⎥ ⎢ ik ⎥ m c i ij ik ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ = = = = = i 1 i 1 j 1 i 1 k 1 If r is a full precision matrix then g ( m , r ) is assumed as a k k k α − τ D 1 T k − k − − − ( m µ ) γ ( m µ ) tr ( u r ) − ⇒ ∝ k k k k k k k normal Wishart g ( m , r ) | r | e e 2 2 2 k k k α > − τ > D 1 , 0 , µ is a vector of dimension D k k k × and u is a D D positive defintite matrix k If r is a diagonal precision matrix then g ( m , r ) is assumed as a k k k τ D α − 1 / 2 2 − kd − µ r ( m ) ∏ kd − β − ⇒ ∝ kd kd kd r product of normal gamma g ( m , r ) r e e 2 kd kd 2 k k kd = d 1
SCHMM ( m , n ) ( m , n ) Let X denote the n th o bservation sequence of length T associated with model m and each model m has W observatio n sequences . m the MAP estimates of Λ can be obtained by ⎡ ⎤ ⎛ ⎞ M Wm ∏ ∏ = ⎜ ( m , n ) ⎟ Λ arg max f ( X | λ , Θ ) g ( Λ ) ⎢ ⎥ MAP m ⎝ ⎠ ⎣ ⎦ Λ = = m 1 n 1 Model 1 Model 2 Model M ( 1 , 1 ) ( 2 , 1 ) ( M , 1 ) X X X L M M M ( 1 , W ) ( 2 , W ) ( M , W ) X X X 1 2 M
SCHMM Q-function : − Define a Q function as [ ] W M m ∑ ∑ = ( m , n ) ( m , n ) Q ( Λ | Λ ) E log f ( X , S , L | Λ ) | X , Λ = = m 1 n 1 W M ∑ ∑ m ∑ ∑ = ( m , n ) ( m , n ) ( m , n ) ( m , n ) ( m , n ) ( m , n ) f ( S , L | X , Λ ) log f ( X , S , L | Λ ) = = (m,n) (m,n) m 1 n 1 S L W ( m , n ) ( m , n ) ( m , n ) M f ( S , L , X | Λ ) ∑ ∑ m ∑ ∑ = ( m , n ) ( m , n ) ( m , n ) log f ( X , S , L | Λ ) ( m , n ) f ( X | Λ ) = = (m,n) (m,n) m 1 n 1 S L ( m , n ) ( m , n ) ( m , n ) Where f ( X , S , L | Λ ) ( m , n ) [ ] T ∏ = π ( m , n ) w N ( x | m , r ) a w N ( x | m , r ) s s , l 1 l l s s s l t l l − 1 1 1 1 1 t 1 t t t t t = t 2
SCHMM Q-function : ∴ − Q function can be decomposed in ⎛ ⎞ ⎛ ⎞ ( m , n ) W W M N M N N T m m ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ⎜ ⎟ = ⎜ γ ⎟ π + γ + ( m , n ) ( m ) ( m , n ) ( m ) Q ( Λ | Λ ) ( i ) log ( i , j ) log a ⎜ ⎟ ⎜ ⎟ 1 i t ij ⎝ ⎠ ⎝ ⎠ = = = = = = = = m 1 i 1 n 1 m 1 i 1 j 1 n 1 t 1 ⎛ ⎞ ( m , n ) ( m , n ) W W ( ) M N K T K M T m m ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ⎜ ⎟ ξ + ξ ( m , n ) ( m ) ( m , n ) ( m , n ) ( i , k ) log w ( k ) log N ( x | m , r ) ⎜ ⎟ t ik t t k k ⎝ ⎠ = = = = = = = = = m 1 i 1 k 1 n 1 t 1 k 1 m 1 n 1 t 1 γ = = = ( m , n ) ( m , n ) ( m , n ) ( m , n ) where ( i , j ) Pr( s i , s j | X , λ ) + t t t 1 m γ = = ( m , n ) ( m , n ) ( m , n ) ( i ) Pr( s i | X , λ ) t t m ξ ( m , n ) = ( m , n ) = ( m , n ) = ( m , n ) ( i , k ) Pr( s i , l k | X , λ ) t t t m ξ = = ( m , n ) ( m , n ) ( m , n ) ( k ) Pr( l k | X , λ ) t t m ( m ) ( m , n ) w N ( x | m , r ) ξ ( m , n ) = γ ( m , n ) ⋅ and ( i , k ) ( i ) ik t k k t t K ∑ ( m ) ( m , n ) w N ( x | m , r ) ik t k k = k 1
SCHMM ( ) ( ) M N M N N ∑ ∑ ∑ ∑ ∑ = η − π + η − ( m ) ( m ) ( m ) ( m ) log g ( Λ ) 1 log 1 log a i i ij ij = = = = = m 1 i 1 m 1 i 1 j 1 ( ) M N K K ∑ ∑ ∑ ∑ + ν − + + ( m ) ( m ) 1 log w log g ( m , r ) Constant jk jk k k = = = = m 1 i 1 k 1 k 1 ⎧ ⎫ ⎡ ⎤ ⎛ ⎞ ⎪ W ⎪ M N ∑ ∑ ∑ m = + = ⎜ γ ⎟ + η − π ( m , n ) ( m ) ( m ) R ( Λ | Λ ) Q ( Λ | Λ ) log g ( Λ ) ( i ) 1 log ⎨ ⎬ ⎢ ⎥ ⎜ ⎟ 1 i i ⎪ ⎪ ⎝ ⎠ ⎣ ⎦ ⎩ ⎭ = = = 1 1 1 m i n ⎧ ⎫ ⎡ ⎤ ⎛ ⎞ ( m , n ) ⎪ W ⎪ M N N T m ∑ ∑ ∑ ∑ ∑ ⎜ ⎟ + γ + η − ( m , n ) ( m ) ( m ) ( i , j ) 1 log a ⎢ ⎥ ⎨ ⎬ ⎜ ⎟ t ij ij ⎪ ⎪ ⎢ ⎥ ⎝ ⎠ ⎣ ⎦ ⎩ ⎭ = = = = = m 1 i 1 j 1 n 1 t 1 ⎧ ⎫ ⎡ ⎤ ⎛ ⎞ ( m , n ) ⎪ W ⎪ M N K T ∑ ∑ ∑ ∑ m ∑ ⎜ ⎟ + ξ + ν − ( m , n ) ( m ) ( m ) ⎢ ( i , k ) 1 ⎥ log w ⎨ ⎬ ⎜ ⎟ t jk jk ⎪ ⎪ ⎢ ⎥ ⎝ ⎠ ⎣ ⎦ ⎩ ⎭ = = = = = m 1 i 1 k 1 n 1 t 1 ( m , n ) W ( ) K M T K m ∑ ∑ ∑ ∑ ∑ + ξ + + ( m , n ) ( m , n ) ( k ) log N ( x | m , r ) log g ( m , r ) Constant t t k k k k = = = = = k 1 m 1 n 1 t 1 k 1
SCHMM Initial probability • Differentiating w.r.t and π ( m ) R ( Λ | Λ ) i equating it to zero. ⎧ ⎫ ⎡ ⎤ ⎛ ⎞ ∂ ∂ ⎪ W ⎪ R ( Λ | Λ ) M N ∑ ∑ ∑ m ⎜ ⎟ = ⇒ γ ( m , n ) + η ( m ) − π ( m ) = 0 ( i ) 1 log 0 ⎨ ⎢ ⎥ ⎬ ⎜ ⎟ 1 i i ∂ π ∂ π ( m ) ( m ) ⎪ ⎪ ⎝ ⎠ ⎣ ⎦ ⎩ ⎭ = = = m 1 i 1 n 1 i i ⎡ ⎤ ⎛ ⎞ ⎛ ⎞ ∂ W N m ∑ ∑ ⎜ ⎟ ⇒ ⎜ γ ( m , n ) ⎟ + η ( m ) − π ( m ) + π ( m ) = ( i ) 1 log l 0 ⎢ ⎥ ⎜ ⎟ ⎜ ⎟ ∂ π 1 i i j ( m ) ⎝ ⎠ ⎝ ⎠ ⎣ ⎦ = = n 1 j 1 i ⎛ ⎞ W ∑ m ⎜ γ ⎟ + η − ( m , n ) ( m ) ( i ) 1 ⎜ ⎟ ⎡ ⎤ 1 i ⎛ ⎞ W 1 ⎝ ⎠ m ∑ = ⎜ γ ⎟ + η − + = ⇒ π = n 1 ( m , n ) ( m ) ( m ) ( i ) 1 l 0 ⎢ ⎥ ⎜ ⎟ 1 i i π ( m ) − l ⎝ ⎠ ⎣ ⎦ = n 1 i ⎛ ⎞ W ∑ m ⎜ ⎟ γ ( m , n ) + η ( m ) − ( j ) 1 ⎜ ⎟ 1 j ⎡ ⎤ ⎛ ⎞ W N N N ⎝ ⎠ ∑ ∑ ∑ ∑ m = π = ⇒ n 1 = ⇒ − = ⎜ γ ⎟ + η − ( m ) ( m , n ) ( m ) 1 1 l ( j ) 1 ⎢ ⎥ ⎜ ⎟ j 1 j − l ⎝ ⎠ ⎣ ⎦ = = = = j 1 j 1 j 1 n 1 ⎛ ⎞ W W m ∑ ∑ m ⎜ γ ⎟ + η − ( m , n ) ( m ) ( i ) 1 η − + γ ( m ) ( m , n ) 1 ( i ) ⎜ ⎟ 1 i i 1 ⎝ ⎠ = n 1 ∴ π ( m ) = = = n 1 i ⎡ ⎤ W ⎛ ⎞ N N W N m ∑ ∑ ∑ m ∑ ∑ η − + γ ( m ) ( m , n ) N ( j ) ⎜ γ ( m , n ) ⎟ + η ( m ) − ( j ) 1 ⎢ ⎥ ⎜ ⎟ j 1 1 j ⎝ ⎠ ⎣ ⎦ = = = j 1 j 1 n 1 = = j 1 n 1
SCHMM Transition probability • Differentiating w.r.t and ( m ) a R ( Λ | Λ ) ij equating it to zero. ⎧ ⎫ ⎡ ⎤ ⎛ ⎞ ∂ ∂ ( m , n ) ⎪ W ⎪ R ( Λ | Λ ) M N N T m ∑ ∑ ∑ ∑ ∑ ⎜ ⎟ = ⇒ γ + η − = ( m , n ) ( m ) ( m ) 0 ⎢ ( i , j ) 1 ⎥ log a 0 ⎨ ⎬ ⎜ ⎟ ∂ ∂ t ij ij ( m ) ( m ) a a ⎪ ⎪ ⎢ ⎥ ⎝ ⎠ ⎣ ⎦ ⎩ ⎭ = = = = = m 1 i 1 j 1 n 1 t 1 ij ij ⎡ ⎤ ⎛ ⎞ ⎛ ⎞ ∂ W ( m , n ) T N ∑ m ∑ ∑ ⎜ ⎟ ⎜ ⎟ ⇒ γ ( m , n ) + η ( m ) − ( m ) + ( m ) = ⎢ ( i , j ) 1 ⎥ log a l a 0 ⎜ ⎟ ⎜ ⎟ t ij ij ij ∂ ( m ) a ⎢ ⎥ ⎝ ⎠ ⎝ ⎠ ⎣ ⎦ = = = n 1 t 1 j 1 ij ⎛ ⎞ ( m , n ) W T m ∑ ∑ ⎜ ⎟ γ ( m , n ) + η ( m ) − ( i , j ) 1 ⎜ ⎟ ⎡ ⎤ t ij ⎛ ⎞ ( m , n ) W 1 T ⎝ ⎠ ∑ m ∑ = = ⎜ ⎟ n 1 t 1 γ + η − + = ⇒ = ( m , n ) ( m ) ( m ) ( i , j ) 1 l 0 a ⎢ ⎥ ⎜ ⎟ t ij ij ( m ) − a l ⎢ ⎥ ⎝ ⎠ ⎣ ⎦ = = n 1 t 1 ij ⎛ ⎞ ( m , n ) W T ∑ m ∑ ⎜ ⎟ γ ( m , n ) + η ( m ) − ( i , j ) 1 ⎜ ⎟ t ij ⎡ ⎤ ⎛ ⎞ ( m , n ) W N N N T ⎝ ⎠ m ∑ ∑ ∑ ∑ ∑ = = n 1 t 1 ⎜ ⎟ = ⇒ = ⇒ − = γ + η − ( m ) ( m , n ) ( m ) a 1 1 l ⎢ ( i , j ) 1 ⎥ ⎜ ⎟ ij − t ij l ⎢ ⎥ ⎝ ⎠ ⎣ ⎦ = = = = = j 1 j 1 j 1 n 1 t 1 ⎛ ⎞ ( m , n ) W T ( m , n ) W T ∑ m ∑ ⎜ ⎟ ∑ m ∑ γ + η − ( m , n ) ( m ) ( i , j ) 1 η ( m ) − + γ ( m , n ) 1 ( i , j ) ⎜ ⎟ t ij ij t ⎝ ⎠ = = n 1 t 1 ∴ ( m ) = = = = n 1 t 1 a ij ⎡ ⎤ ( m , n ) ⎛ ⎞ W ( m , n ) N N T W N T m ∑ ∑ ∑ ∑ ∑ ∑ m ∑ η − + γ ⎜ ⎟ ( m ) ( m , n ) N ( i , j ) γ ( m , n ) + η ( m ) − ( i , j ) 1 ⎢ ⎥ ⎜ ⎟ ij t t ij ⎢ ⎥ ⎝ ⎠ ⎣ ⎦ = = = = j 1 j 1 n 1 t 1 = = = j 1 n 1 t 1
SCHMM Mixture weight • Differentiating w.r.t and ( m ) w R ( Λ | Λ ) ik equate it to zero. ⎧ ⎫ ⎡ ⎤ ⎛ ⎞ ∂ ∂ ( m , n ) ⎪ W ⎪ R ( Λ | Λ ) M N K T m ∑ ∑ ∑ ∑ ∑ ⎜ ⎟ = ⇒ ξ + ν − = ( m , n ) ( m ) ( m ) 0 ⎢ ( i , k ) 1 ⎥ log w 0 ⎨ ⎬ ⎜ ⎟ ∂ ∂ t ik ik ( m ) ( m ) w w ⎪ ⎪ ⎢ ⎥ ⎝ ⎠ ⎣ ⎦ ⎩ ⎭ = = = = = m 1 i 1 k 1 n 1 t 1 ik ik ⎡ ⎤ ⎛ ⎞ ⎛ ⎞ ∂ W ( m , n ) T K ∑ m ∑ ∑ ⎜ ⎟ ⎜ ⎟ ⇒ ξ ( m , n ) + ν ( m ) − ( m ) + ( m ) = ⎢ ( i , k ) 1 ⎥ log w l w 0 ⎜ ⎟ ⎜ ⎟ t ik ik ij ∂ ( m ) w ⎢ ⎥ ⎝ ⎠ ⎝ ⎠ ⎣ ⎦ = = = n 1 t 1 j 1 ik ⎛ ⎞ ( m , n ) W T m ∑ ∑ ⎜ ⎟ ξ ( m , n ) + ν ( m ) − ( i , k ) 1 ⎜ ⎟ ⎡ ⎤ t ik ⎛ ⎞ ( m , n ) W 1 T ⎝ ⎠ ∑ m ∑ = = ⎜ ⎟ n 1 t 1 ξ + ν − + = ⇒ = ( m , n ) ( m ) ( m ) ( i , k ) 1 l 0 w ⎢ ⎥ ⎜ ⎟ t ik ik ( m ) − w l ⎢ ⎥ ⎝ ⎠ ⎣ ⎦ = = n 1 t 1 ik ⎛ ⎞ ( m , n ) W T ∑ m ∑ ⎜ ⎟ ξ ( m , n ) + ν ( m ) − ( i , j ) 1 ⎜ ⎟ t ij ⎡ ⎤ ⎛ ⎞ ( m , n ) W K N K T ⎝ ⎠ m ∑ ∑ ∑ ∑ ∑ = = n 1 t 1 ⎜ ⎟ = ⇒ = ⇒ − = ξ + ν − ( m ) ( m , n ) ( m ) w 1 1 l ⎢ ( i , j ) 1 ⎥ ⎜ ⎟ ij − t ij l ⎢ ⎥ ⎝ ⎠ ⎣ ⎦ = = = = = j 1 j 1 j 1 n 1 t 1 ⎛ ⎞ ( m , n ) W T ( m , n ) W T ∑ m ∑ ⎜ ⎟ ∑ m ∑ ξ + ν − ( m , n ) ( m ) ( i , k ) 1 ν ( m ) − + ξ ( m , n ) 1 ( i , k ) ⎜ ⎟ t ik ik t ⎝ ⎠ = = n 1 t 1 ∴ ( m ) = = = = n 1 t 1 w ik ⎡ ⎤ ( m , n ) ⎛ ⎞ W ( m , n ) K K T W K T m ∑ ∑ ∑ ∑ ∑ ∑ m ∑ ν − + ξ ⎜ ⎟ ( m ) ( m , n ) K ( i , j ) ξ ( m , n ) + ν ( m ) − ( i , j ) 1 ⎢ ⎥ ⎜ ⎟ ij t t ij ⎢ ⎥ ⎝ ⎠ ⎣ ⎦ = = = = j 1 j 1 n 1 t 1 = = = j 1 n 1 t 1
SCHMM • Differentiating w.r.t and ( m ) m R ( Λ | Λ ) k equating it to zero. ⎡ ⎤ ⎛ ⎞ ( m , n ) ∂ ∂ W ( m , n ) M T log N ( x | m , r ) log g ( m , r ) ∑ ∑ m ∑ ⎜ ξ ⎟ + = ( m , n ) ( k ) t k k k k 0 ⎢ ⎥ ⎜ ⎟ t ∂ ∂ m m ⎝ ⎠ ⎣ ⎦ = = = m 1 n 1 t 1 k k ( 55 ) • Differentiating w.r.t and ( m ) r R ( Λ | Λ ) k equating it to zero. ⎡ ⎤ ⎛ ⎞ ( m , n ) ∂ ∂ W ( m , n ) M T log N ( x | m , r ) log g ( m , r ) m ∑ ∑ ∑ ⎜ ⎟ ξ + = ( m , n ) t k k k k ( k ) 0 ⎢ ⎥ ⎜ ⎟ t ∂ ∂ r r ⎝ ⎠ ⎣ ⎦ = = = m 1 n 1 t 1 k k ( 56 )
SCHMM Full Covariance • Full Covariance matrix case : ′ ∂ ( m , n ) log N ( x | m , r ) ⎡ ⎤ 1 = − − − ( m , n ) T ( m , n ) t k k ( x m ) r ( x m ) ⎢ ⎥ t k k t k ∂ m ⎣ 2 ⎦ k 1 = − × + − × − T ( m , n ) ( ) ( r r )( x m ) ( 1 ) k k t k 2 = − ( m , n ) r ( x m ) k t k ′ ⎡ α − τ ⎤ D 1 ∂ log g ( m , r ) 1 ( ) T k − k − − − ( m µ ) r ( m µ ) tr u r = × k k k k k k k k k | r | e e 2 2 2 ⎢ ⎥ k ∂ m g ( m , r ) ⎣ ⎦ k k k α − τ D 1 τ 1 ( ) T k − − k − − tr u r ( m µ ) r ( m µ ) = × k k k k k k k × − − + k | r | e e ( m µ )( r r ) 2 2 2 k k k k k g ( m , r ) 2 k k = − τ − r ( m µ ) k k k k
SCHMM Full Covariance • Full Covariance matrix case : ⎡ ⎤ ( m , n ) W ( ) M T ∑ ∑ m ∑ ξ − − τ − = ( m , n ) ( m , n ) ( k ) r ( x m ) r ( m µ ) 0 ⎢ ⎥ t k t k k k k k ⎣ ⎦ = = = m 1 n 1 t 1 ( m , n ) ( m , n ) W W M T M T m m ∑ ∑ ∑ ∑ ∑ ∑ ξ − ξ − τ + τ = ( m , n ) ( m , n ) ( m , n ) ( k ) r x ( k ) r m r m r µ 0 t k t t k k k k k k k k = = = = = = m 1 n 1 t 1 m 1 n 1 t 1 ⎡ ⎤ ( m , n ) ( m , n ) W W M T M T m m ∑ ∑ ∑ ∑ ∑ ∑ ξ + τ = ξ + τ ( m , n ) ( m , n ) ( m , n ) ( k ) m ( k ) x µ ⎢ ⎥ t k k t t k k ⎣ ⎦ = = = = = = m 1 n 1 t 1 m 1 n 1 t 1 ( m , n ) W M T m ∑ ∑ ∑ τ + ξ ( m , n ) ( m , n ) µ ( k ) x k k t t ∴ = = = = m m 1 n 1 t 1 k ( m , n ) W M T m ∑ ∑ ∑ τ + ξ ( m , n ) ( k ) k t = = = m 1 n 1 t 1
SCHMM Full Covariance • Full Covariance matrix case : ′ ∂ [ ] ′ ( m , n ) log N ( x | m , r ) ⎡ ⎤ 1 = + − − − 1 / 2 ( m , n ) T ( m , n ) t k k log | r | ( x m ) r ( x m ) ⎢ ⎥ k t k k t k ∂ r ⎣ 2 ⎦ k [ ] ′ 1 ′ 1 [ ] − − = × × − − − 1 / 2 1 / 2 ( m , n ) T ( m , n ) | | | | | | r r r ( x m ) r ( x m ) k k k t k k t k 2 2 [ ] 1 − = − − − 1 ( m , n ) ( m , n ) T r ( x m )( x m ) k t k t k 2
SCHMM Full Covariance • Full Covariance matrix case : ′ ⎡ α − τ ⎤ ∂ D 1 log g ( m , r ) 1 ( ) k k T − − − − ( m µ ) r ( m µ ) tr u r = × × k k k k k k k k k | r | e e 2 2 2 ⎢ ⎥ k ∂ r g ( m , r ) ⎣ ⎦ k k k (1) (2) (3) ⎡ α − − ⎤ D 1 α − D k × − × × 1 k | r | | r | r ( 2 ) ( 3 ) 2 ⎢ ⎥ k k k 2 ⎢ ⎥ τ ⎢ ⎥ = + × × × − − − T ( 1 ) ( 3 ) ( 2 ) k ( m µ )( m µ ) ⎢ ⎥ k k k k 2 ⎢ ⎥ 1 ⎢ ⎥ + × × × − ( 1 ) ( 2 ) ( 3 ) u ⎢ k ⎥ 2 ⎣ ⎦ α − τ D 1 = − − − − − 1 T k r k ( m µ )( m µ ) u k k k k k k 2 2 2
SCHMM Full Covariance • Full Covariance matrix case : ⎡ ⎤ ( m , n ) [ ] W M T ⎛ 1 ⎞ ∑ ∑ m ∑ − ∴ ξ ( m , n ) 1 − ( m , n ) − ( m , n ) − T ⎜ ( k ) r ( x m )( x m ) ⎟ ⎢ ⎥ t k t k t k ⎝ 2 ⎠ ⎣ ⎦ = = = m 1 n 1 t 1 α − τ ⎡ ⎤ D 1 + − − − − − = 1 T k k r ( m µ )( m µ ) u 0 ⎢ ⎥ k k k k k k 2 2 2 ⎣ ⎦ ⎧ ⎫ ( m , n ) W M T m ∑ ∑ ∑ ⇒ − ξ + α − 1 ( m , n ) r ( k ) D ⎨ ⎬ k t k ⎩ ⎭ = = = m 1 n 1 t 1 ( m , n ) W M T m ∑ ∑ ∑ = + τ − − + ξ − − T ( m , n ) ( m , n ) ( m , n ) T u ( m µ )( m µ ) ( k )( x m )( x m ) k k k k k k t t k t k = = = m 1 n 1 t 1 ( m , n ) W M T m ∑ ∑ ∑ + τ − − + ξ − − T ( m , n ) ( m , n ) ( m , n ) T u ( m µ )( m µ ) ( k )( x m )( x m ) k k k k k k t t k t k ⇒ − = 1 = = = m 1 n 1 t 1 r k ( m , n ) W M T m ∑ ∑ ∑ ξ + α − ( m , n ) ( k ) D t k = = = m 1 n 1 t 1
SCHMM Full Covariance • The initial estimate can be chosen as the mode of the prior PDF π ( m ) ( m ) ( m ) , a , w same to DHMM i ij ik and = m µ k k ( ) − = α − 1 r D u k k k • And also can be chosen as the mean of the prior PDF π ( m ) ( m ) ( m ) , a , w same to DHMM i ij ik and = m µ k k − = α 1 r u k k k
SCHMM Diagonal Covariance • Diagonal Covariance matrix case : • Then D 1 ∑ 1 / 2 ( m , n ) 2 ⎛ ⎞ − − D ( x m ) r ∏ kd kd td 2 ∝ ( m , n ) ⎜ ⎟ N ( x | m , r ) r e = d 1 t k k kd ⎝ ⎠ = d 1 and ⎛ ⎞ 1 D 2 − τ − µ r ( m ) ∏ ⎜ ⎟ α − − β ∝ 1 / 2 kd kd kd kd r g ( m , r ) r e e 2 kd kd kd ⎜ ⎟ k k kd ⎝ ⎠ = d 1 ⎛ ⎞ 1 D 2 − τ − µ r ( m ) ∑ ⎜ ⎟ α − − β = + 1 / 2 kd kd kd kd r log g ( m , r ) log r e e C 2 kd kd kd ⎜ ⎟ k k kd ⎝ ⎠ = d 1
SCHMM Diagonal Covariance • Diagonal Covariance matrix case : ∂ ( m , n ) log N ( x | m , r ) 1 = t k k ∂ ( m , n ) m N ( x | m , r ) kd t k k D 1 1 / 2 ∑ ( m , n ) 2 ⎛ ⎞ − − D r ( x m ) 1 ∏ kd kd td 2 × ⎜ ⎟ × × − × × ( m , n ) − × − r e ( ) ( r 2 )( x m ) ( 1 ) = d 1 kd kd td kd 2 ⎝ ⎠ = d 1 = − ( m , n ) r ( x m ) kd td kd ′ ⎡ ⎤ ∂ 1 log g ( m , r ) 1 2 − τ − µ r ( m ) α − − β = × kd kd kd kd 1 / 2 r k k r e e 2 ⎢ ⎥ kd kd kd kd ∂ m g ( m , r ) ⎣ ⎦ kd k k 1 τ 1 2 − τ − µ r ( m ) α − 1 / 2 − β = × kd k kd kd r × − − µ × r e e kd ( m )( r 2 ) 2 kd kd kd kd kd kd kd g ( m , r ) 2 k k = − τ − µ r ( m ) kd kd kd kd
SCHMM Diagonal Covariance • Diagonal Covariance matrix case : ⎡ ⎤ ( m , n ) W ( ) M T m ∑ ∑ ∑ ∴ ξ − − τ − µ = ( m , n ) ( m , n ) ( k ) r ( x m ) r ( m ) 0 ⎢ ⎥ t kd td kd kd kd kd kd ⎣ ⎦ = = = m 1 n 1 t 1 ( m , n ) ( m , n ) W W M T M T ∑ ∑ m ∑ ∑ ∑ m ∑ ⇒ ξ + τ = ξ + τ µ ( m , n ) ( m , n ) ( m , n ) ( k ) m m ( k ) x t kd kd kd t td kd kd = = = = = = m 1 n 1 t 1 m 1 n 1 t 1 ( m , n ) W M T ∑ ∑ m ∑ τ µ + ξ ( m , n ) ( m , n ) ( k ) x kd kd t td ⇒ = = = = m m 1 n 1 t 1 kd ( m , n ) W M T ∑ ∑ m ∑ τ + ξ ( m , n ) ( k ) kd t = = = m 1 n 1 t 1
SCHMM Diagonal Covariance • Diagonal Covariance matrix case : ∂ ( m , n ) log N ( x | m , r ) 1 = t k k ∂ ( m , n ) r N ( x | m , r ) kd t k k ⎡ ⎤ D − 1 1 / 2 ∑ ⎛ ⎞ ⎛ ⎞ ( m , n ) 2 − − D D 1 ( x m ) r ∏ ∏ kd kd td ⎢ ⎥ 2 × ⎜ ⎟ ⎜ ⎟ r r e = d 1 kd ki ⎢ ⎥ 2 ⎝ ⎠ ⎝ ⎠ = ≠ × d 1 i d ⎢ ⎥ D 1 ∑ ⎢ ⎥ ( m , n ) 2 − − ( x m ) r 1 td kd kd 2 + × − × − ( m , n ) e ( ) ( x m ) ⎢ ⎥ = d 1 td kd ⎣ ⎦ 2 [ ] 1 − = − − 1 ( m , n ) 2 r ( x m ) k td kd 2
SCHMM Diagonal Covariance • Diagonal Covariance matrix case : ′ ⎡ ⎤ 1 ∂ log g ( m , r ) 1 2 − τ − µ r ( m ) α − − β = × × kd kd kd kd 1 / 2 r k k | r | e e 2 ⎢ ⎥ kd kd kd kd ∂ r g ( m , r ) ⎣ ⎦ kd k k ⎡ ⎤ α − 3 / 2 α − × × × ⎢ ⎥ ( 1 / 2 ) r ( 2 ) ( 3 ) kd kd kd ⎢ ⎥ τ 1 ⎢ ⎥ = × + × × × − − µ 2 ( 1 ) ( 3 ) ( 2 ) kd ( m ) kd kd ⎢ ⎥ g ( m , r ) 2 k k ⎢ ⎥ 1 + × × × − β ⎢ ( 1 ) ( 2 ) ( 3 ) ( ) ⎥ kd ⎣ ⎦ 2 τ = α − × − − − µ − β 1 2 ( 1 / 2 ) r kd ( m ) kd kd kd kd kd 2
SCHMM Diagonal Covariance • Diagonal Covariance matrix case : ⎡ [ ] ⎤ ( m , n ) ⎛ 1 ⎞ W M T m ∑ ∑ ∑ − ∴ ξ × − − ( m , n ) 1 ( m , n ) 2 ⎜ ( k ) r ( x m ) ⎟ ⎢ ⎥ t kd td kd 2 ⎝ ⎠ ⎣ ⎦ = = = m 1 n 1 t 1 τ + α − × − − − µ − β = 1 2 ( 1 / 2 ) r kd ( m ) 0 kd kd kd kd kd 2 ( m , n ) W M T m ∑ ∑ ∑ − − ξ + α − × ( m , n ) 1 1 ( k ) r ( 2 1 ) r t kd kd kd = = = m 1 n 1 t 1 ( m , n ) W M T m ∑ ∑ ∑ = β + τ − µ + ξ − 2 ( m , n ) ( m , n ) 2 2 ( m ) ( k )( x m ) kd kd kd kd t td kd = = = m 1 n 1 t 1 ( m , n ) W M T m β + τ − µ + ∑ ∑ ∑ ξ − 2 ( m , n ) ( m , n ) 2 2 ( m ) ( k )( x m ) kd kd kd kd t td kd − = = = = 1 r m 1 n 1 t 1 kd ( m , n ) W M T m ∑ ∑ ∑ α − + ξ ( m , n ) ( 2 1 ) ( k ) kd t = = = m 1 n 1 t 1
SCHMM Diagonal Covariance • The initial estimate can be chosen as the mode of the prior PDF π ( m ) ( m ) ( m ) , a , w same to DHMM i ij ik and = µ m kd kd ( ) α − 1 / 2 = r kd kd β kd • And also can be chosen as the mean of the prior PDF π ( m ) ( m ) ( m ) , a , w same to DHMM i ij ik and = µ m kd kd α = r kd kd β kd
CDHMM • Continuous Density HMM case: Then K ∑ = b ( x ) w N ( x | m , r ) i t ik t k k = k 1 ⇓ K ∑ = b ( x ) w N ( x | m , r ) i t ik t ik ik = k 1 and 1 T − − − ( x m ) r ( x m ) − = π t k k t k D / 2 1 / 2 where N ( x | m , r ) ( 2 ) | r | e 2 t k k k ⇓ 1 T − − − ( x m ) r ( x m ) = π − t ik ik t ik D / 2 1 / 2 where N ( x | m , r ) ( 2 ) | r | e 2 t ik ik ik
CDHMM − In Q function ( m , n ) ( ) W K M T m ∑ ∑ ∑ ∑ ξ ( m , n ) ( m , n ) ( k ) log N ( x | m , r ) t t k k = = = = k 1 m 1 n 1 t 1 ⇓ ( m , n ) W ( ) N K M T m ∑ ∑ ∑ ∑ ∑ ξ ( m , n ) ( m , n ) ( i , k ) log N ( x | m , r ) t t ik ik = = = = = i 1 k 1 m 1 n 1 t 1
CDHMM K ∑ In log g ( Λ ) log g ( m , r ) k k = k 1 ⇓ N K ∑ ∑ log g ( m , r ) ik ik = = i 1 k 1 α − τ D 1 T k − k − γ − − ( m µ ) ( m µ ) tr ( u r ) ∝ k k k k k k k and g ( m , r ) | r | e e 2 2 2 k k k ⇓ ( Full covariance case ) α − τ D 1 ik ik T − − − − ( m µ ) γ ( m µ ) tr ( u r ) ∝ ik ik ik ik ik ik ik g ( m , r ) | r | e e 2 2 2 ik ik ik τ D α − 1 / 2 2 − kd − µ ∏ r ( m ) kd − β ∝ kd kd kd r and g ( m , r ) r e e 2 kd kd 2 k k kd = d 1 ⇓ ( Diagonal covariance case ) τ D α − 1 / 2 2 − ikd − µ ∏ r ( m ) ikd − β ∝ ikd ikd ikd r g ( m , r ) r e e 2 ikd ikd 2 ik ik ikd = d 1
Maximum Likelihood Maximum Likelihood Maximum Likelihood Linear Regression Linear Regression Linear Regression
MLLR Background • Linear transformation of original model (SI) to maximize likelihood of adaptation • MLLR is multiplicative; MAP is additive • MLLR much less conservative than MAP – a few sec. of data may change model dramatically.
MLLR Reference : – Speaker Adaptation of HMMs Using Linear Regression – TR ’ 94 Leggetter and Woodland – Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models – CSL ’ 95 Leggetter and Woodland – MLLR:A Speaker Adaptation Technique for LVCSR – Hamaker
MLLR Single Gaussian Case • The regression transform is first derived for the single Gaussian distribution pre state, and later extended to the general case of Gaussian mixtures. • So, the p.d.f for the state s is 1 − T 1 − − − 1 / 2 ( x µ ) C ( x µ ) = b ( x ) e j j j ( ) j D / 2 π 1 / 2 2 | C | j µ is the mean and C is the covariance matrix j j
MLLR Single Gaussian Case ω ⎡ ⎤ µ ⎡ ⎤ ⎢ ⎥ 1 µ ⎢ ⎥ ⎢ ⎥ = = 1 If µ is the mean, then we define ξ M ⎢ ⎥ ⎢ ⎥ M ⎢ µ ⎥ ⎢ ⎥ ⎣ ⎦ D µ ⎣ ⎦ D ω where is the offset term for the regression = + = The estimate of the adapted mean is µ Aµ b W ξ ( ) = × + where W A , b is the linear transform ( an D ( D 1 ) matrix ) ω = ⇒ If 1 include an offset in the regression ω = ⇒ If 0 ignore the offsets 1 1 − T 1 − − − ( x W ξ ) C ( x W ξ ) = j j j j j So b ( x ) e 2 j π D / 2 1 / 2 ( 2 ) | C | j
MLLR Single Gaussian Case • A more general approach is adopted in which the same transformations matrix is used for several distributions. � Regression Class • If some of the distributions are not observed in the adaptation data, a transformation may still be applied. � Models would update whether correspond adaptation data observed or not.
MLLR Single Gaussian Case • MLLR estimates the regression matrices to maximize the likelihood of the adapted models generation the adaptation data. � Maximize the likelihood to obtain the regression matrices. • Full and Diagonal covariance cases will be discussed.
MLLR Single Gaussian Case Assume the adaptation data, X, is a series of T observations. = X x , x ,..., x 1 2 T λ Denote the current set of model parameters by λ and a re-estimated set of model parameters as ξ Current extended mean � µ Re-estimated mean �
MLLR Single Gaussian Case The total likelihood is ∑ = f ( X | λ ) f ( X , S | λ ) S f ( X , S | λ ) is the likelihood of generating X using the state sequence S given model λ f ( X | λ ) The quantity is the objective function to be maximized during adaptation.
MLLR Single Gaussian Case We define the auxiliary function [ ] ∑ = Q ( λ | λ ) f ( S | X , λ ) log f ( X , S | λ ) S W Since only the transformations are re-estimated, j b x ( ) only the output distributions are affected so j t the auxiliary function can be written as T ∑ ∑ = + = Q ( λ | λ ) constant f ( s j | X , λ ) log b ( x ) t j t = S t 1
MLLR Single Gaussian Case ∑ γ = = We define ( t ) f ( s j | X , λ ) j t S So…The Q-function can be rewritten as T ∑ = + γ Q ( λ | λ ) constant ( t ) log b ( x ) j j t = t 1
MLLR Single Gaussian Case log b x ( ) Expanding then the auxiliary function is j t [ ] N T 1 ∑ = ∑ = − × γ π + + Q ( λ | λ ) constant ( t ) D log( 2 ) log | C | h ( x , j ) j j t 2 = j 1 t 1 − = − − T 1 where h ( x , j ) ( x W ξ ) C ( x W ξ ) t t j j j t j j The differenti al of Q ( λ | λ ) w.r.t W is s ∂ ∂ [ ] N T Q ( λ | λ ) 1 ∑ ∑ = − γ π + + ( t ) D log( 2 ) log | C | h ( x , j ) j j t ∂ ∂ W 2 W = = j 1 t 1 s s
MLLR Single Gaussian Case The differenti al of h ( x , j ) w.r.t W is t j ∂ ∂ h ( x , j ) − = − − T 1 t ( x W ξ ) C ( x W ξ ) t j j j t j j ∂ ∂ W W j j ∂ = − − − T T T 1 ( x ξ W ) C ( x W ξ ) t j j j t j j ∂ W j ∂ [ ] − − − − = − − + T 1 T T 1 T 1 T T 1 x C x ξ W C x x C W ξ ξ W C W ξ t j t j j j t t j j j j j j j j ∂ W j [ ] ∂ ( ) T = − − − − + − T T 1 T T T 1 ξ W C x C x W ξ ξ W C W ξ j j j t j t j j j j j j j ∂ W j [ ] − − − − − − = − − + + = 1 T T T T T 1 T 1 T C x ξ C x ξ C W ξ ξ C W ξ ξ C C Q j t j j t j j j j j j j j j j j [ ] T − = − − 1 2 C x W ξ ξ j t j j j
MLLR Single Gaussian Case Then complete the differentiation, and equating to zero. ∂ [ ] T ∑ − = γ − = 1 T Q ( λ | λ ) ( t ) C x W ξ ξ 0 j j t j j j ∂ W = t 1 j T T ∑ ∑ ∴ γ − = γ − 1 T 1 T ( t ) C x ξ ( t ) C W ξ ξ j j t j j j j j j = = t 1 t 1 ⎛ ⎞ ⎛ ⎞ T T ∑ ∑ − − ⇒ γ = γ 1 ⎜ ⎟ T 1 ⎜ ⎟ T C ( t ) x ξ C W ( t ) ξ ξ j j t j j j j j j ⎝ ⎠ ⎝ ⎠ = = t 1 t 1 T ∑ γ ( t ) x ⎛ ⎞ j t T T ∑ ∑ ⇒ γ = γ ∴ = = ⎜ ⎟ = ( t ) x W ( t ) ξ µ W ξ t 1 j t j j j j j j T ⎝ ⎠ ∑ = = γ t 1 t 1 ( t ) j = t 1
MLLR Tied Regression Matrices • Regression Class Tree for MLLR
MLLR Tied Regression Matrices { } ( s ) = Consider t he s th regression class RC s ,..., s 1 R (s) If W is shared by the states in the regression class RC , then s T R T R ∑ ∑ ∑ ∑ − − γ = γ 1 T 1 T ( t ) C x ξ ( t ) C W ξ ξ s s t s s s s s s r r r r r r r = = = = t 1 r 1 t 1 r 1 ⎧ ⎫ ⎛ ⎞ R T ∑ ∑ ⇒ − γ 1 ⎜ ⎟ T C ( t ) x ξ ⎨ ⎬ s s t s ⎝ ⎠ r r r ⎩ ⎭ = = r 1 t 1 × + D ( D 1 ) ⎧ ⎫ ⎡ ⎤ ⎛ ⎞ [ ] ⎪ ⎪ R T [ ] ∑ ∑ − = ⎜ γ ⎟ 1 T ( t ) C W ξ ξ ⎨ ⎬ ⎢ ⎥ s s s × + s s D ( D 1 ) + × + ⎪ ( D 1 ) ( D 1 ) ⎪ ⎝ ⎠ ⎣ r r ⎦ r r ⎩ ⎭ = = r 1 t 1 × D D ⎧ ⎫ ⎡ ⎤ ⎛ ⎞ R R T [ ] ∑ ∑ ∑ − = = γ ( r ) ( r ) 1 ⎜ ⎟ T Z V W D where Z C ( t ) x ξ ⎨ ⎬ ⎢ ⎥ × + D ( D 1 ) s s s t s ⎣ ⎦ ⎝ ⎠ r r r ⎩ ⎭ = = = r 1 r 1 t 1 × + D ( D 1 ) × + D ( D 1 ) ⎡ ⎤ ⎛ ⎞ [ ] T ∑ − ( r ) = ⎜ γ ⎟ 1 ( r ) = T V ( t ) C D ξ ξ ⎢ ⎥ s s s s + × + ( D 1 ) ( D 1 ) ⎝ ⎠ ⎣ r r ⎦ r r = t 1 × D D
MLLR Tied Regression Matrices × + If right hand side is denoted by the D ( D 1 ) matrix Y [ ] [ ] [ ] [ ] = ⇒ = Z Y Z Y × + × + D ( D 1 ) D ( D 1 ) ij ij [ ] [ ] ⎡ ⎤ R [ ] ∑ = ( r ) ( r ) where Y V W D ⎢ ⎥ × + × + D D s × + ( D 1 ) ( D 1 ) D ( D 1 ) ⎣ ⎦ = r 1 × + D ( D 1 ) [ ] [ ] [ D ] ∑ = ( r ) ( r ) V W V W ik s s ij kj = k 1 ⎧ ⎫ ⎡ ⎤ { [ ] [ ] } [ ] [ [ ] + + ⎪ ⎪ R D 1 R D 1 D ] [ ] ∑ ∑ ∑ ∑ ∑ = = ( r ) ( r ) ( r ) ( r ) Y V W D V W D ⎨ ⎬ ⎢ ⎥ ij s qj ip s qj iq pq ⎪ ⎪ ⎣ ⎦ ⎩ ⎭ = = = = = r 1 q 1 r 1 q 1 p 1 [ ] [ ] ⎥ + ⎡ ⎤ D D 1 R [ ] ∑ ∑ ∑ = ( r ) ( r ) W V D ⎢ ip qj s pq ⎣ ⎦ = = = p 1 q 1 r 1
MLLR Tied Regression Matrices ⇒ ( r ) If the covariance matrix is diagonal V is diagonal ( r ) and D is symmetric [ ] [ ] ⎧ R ∑ [ ] [ ] ⎪ = R ( r ) ( r ) V D i p ∑ = ( r ) ( r ) V D ip jq ⎨ ip qj = r 1 ⎪ ≠ = r 1 0 i p ⎩ [ ] [ ] + ⎡ ⎤ D D 1 R [ ] [ ] [ ] [ ] ∑ ∑ ∑ ( i , j ) ∴ = = ( r ) ( r ) G Z Y W V D ⎢ ⎥ pq ij ij s ip qj pq ⎣ ⎦ = = = p 1 q 1 r 1 [ ] [ ] + ⎡ ⎤ + D 1 R D 1 [ ] [ ] [ ] ∑ ∑ ∑ = = ( i ) ( r ) ( r ) W V D W G ⎢ ⎥ s ii jq s jq iq iq ⎣ ⎦ = = = q 1 r 1 q 1
MLLR Tied Regression Matrices [ ] Then we can obtain a row i of W by solving below linear equations s [ ] [ ] [ ] [ ] [ ] [ ] [ ] ⎧ ( i ) + ( i ) + + ( i ) = G W G W G W Z ← j = 1 L + 1 , 1 1 , 2 1 , D 1 + i , 1 s s s i , 1 i , 2 i , D 1 ⎪ [ ] [ ] [ ] [ ] [ ] [ ] [ ] ( i ) ( i ) ( i ) + + + = ← j = G W G W G W Z 2 ⎪ L + 2 , 1 2 , 2 2 , D 1 + i , 2 s s s i , 1 i , 2 i , D 1 ⎨ M ⎪ [ ] [ ] [ ] [ ] [ ] [ ] [ ] ← = + j D 1 ⎪ ( i ) ( i ) ( i ) + + + = G W G W G W Z L ⎩ + + + + + D 1 , 1 D 1 , 2 D 1 , D 1 + i , D 1 s s s i , 1 i , 2 i , D 1
MLLR Tied Regression Matrices If the covariance matrix is still full, [ ] we could obtain W by solving below linear equations s [ ] [ ] [ ] [ ] [ ] ⎧ ( 1 , 1 ) ( 1 , 1 ) + + = ← = = G W G W Z i 1 , j 1 L + 1 , 1 s D , D 1 s + 1 , 1 1 , 1 D , D 1 ⎪ [ ] [ ] [ ] [ ] [ ] ( 1 , 2 ) + + ( 1 , 2 ) = G W G W Z ← = = ⎪ i 1 , j 2 L + 1 , 1 D , D 1 + 1 , 2 s s 1 , 1 D , D 1 ⎨ M ⎪ [ ] [ ] [ ] [ ] [ ] ⎪ + + ( D , D 1 ) + + ( D , D 1 ) = ← = = + G W G W Z i D , j D 1 L ⎩ + + 1 , 1 D , D 1 + D , D 1 s s 1 , 1 D , D 1
MLLR Mixture Gaussian Case • Then the p.d.f for the state j would be K 1 ∑ − T 1 − − − 1 / 2 ( x µ ) C ( x µ ) = b ( x ) w e jk jk jk ( ) j jk D / 2 π 1 / 2 2 | C | = k 1 jk µ is the mean and C is the covariance matrix jk jk and w is the mixture weight jk • Then likelihood ∑ ∑ = f ( X | λ ) f ( X , S , L | λ ) S L where S is one possible state sequence and L is one possible mixture sequence
MLLR Mixture Gaussian Case • Then Q-function will be ∑ ∑ = Q ( λ | λ ) f ( S , L | X , λ ) log f ( X , S , L | λ ) S L Only consider the term which dependent on the regression transform. T ∑ ∑ ∑ ∴ = = = Q ( λ | λ ) f ( s j , l k | X , λ ) log b ( x ) b t t jk t = S L t 1 T ∑ = γ ( t ) log b ( x ) jk jk t = t 1 ∑ ∑ γ = = = where ( t ) f ( s j , l k | X , λ ) jk t t S L • The derivation is the same as single Gaussian γ case, just substitute for ( t ) γ ( t ) jk j
MLLR Least Squares Regression • If all the covariance of the distributions tied to the same transformation are the same � a special case of MLLR • Then T R T R ∑ ∑ ∑ ∑ − − γ = γ 1 T 1 T ( t ) C x ξ ( t ) C W ξ ξ s s t s s s s s s r r r r r r r = = = = t 1 r 1 t 1 r 1 T R T R can be rewritten as ∑ ∑ ∑ ∑ ⇒ γ = γ T T ( t ) x ξ ( t ) W ξ ξ s t s s s s s r r r r r = = = = t 1 r 1 t 1 r 1
MLLR Least Squares Regression • If each frame is assigned to exactly one distribution ( Viterbi alignment ) ⎧ 1 if x is assigned to state distributi on s γ = t r ( t ) ⎨ s 0 otherwise ⎩ r T T ∑ ∑ • Then ξ = δ T T δ x W ξ ξ t s s s s ( n ) ( n ) RC , s RC , s t t t t t = = t 1 t 1 ⎧ ∈ ( n ) 1 s RC δ = t where ⎨ ( n ) RC , s 0 otherwise ⎩ t T T ∑ ∑ ⇒ δ = δ T T x ξ W ξ ξ t s ( n ) s s s ( n ) RC , s RC , s t t t t t = = t 1 t 1
MLLR Least Squares Regression Define matrices X , Y as [ ] = X ξ , ξ , , ξ L [ ] s s s 1 2 T = δ δ δ Y x , x , , x L 1 2 T ( n ) ( n ) ( n ) RC , s RC , s RC , s 1 2 T = T T then W XX YX s ( ) − 1 = T T W YX XX s
MLLR Single Variable Linear Regression • If the scaling portion of the regression matrix is assumed to be diagonal, the computation can be vastly reduced. = + It means that µ x yµ i i ⎡ ⎤ w 1 , 1 ⎢ ⎥ ⎡ ⎤ w w 0 0 L M ⎢ ⎥ 1 , 1 1 , 2 ⎢ ⎥ ⎢ ⎥ w 0 w 0 w L ⎢ ⎥ 2 , 1 2 , 3 D , 1 ∴ = ⇒ = W w ⎢ ⎥ ⎢ ⎥ s s w M M M O M ⎢ ⎥ 1 , 2 ⎢ ⎥ ⎢ ⎥ w 0 0 w L M ⎣ ⎦ + D , 1 D , D 1 × + ⎢ ⎥ D ( D 1 ) w ⎣ ⎦ + D , D 1 × ( 2 D ) 1
MLLR Single Variable Linear Regression × And define an D 2 D matrix D s ω µ ⎡ ⎤ 0 0 0 0 0 0 L L 1 ⎢ ⎥ ω µ 0 0 0 0 0 0 L L ⎢ ⎥ 2 = ⎢ ⎥ D M M O M M M M O M M s ⎢ ⎥ ω µ 0 0 0 0 0 0 L L ⎢ ⎥ − D 1 ⎢ ⎥ ω µ 0 0 0 0 0 0 L L ⎣ ⎦ D − = − − T 1 h ( x t s , ) ( x W ξ ) C ( x W ξ ) t s s s t s s − = − − T 1 ( x D w ) C ( x D w ) t s s s t s s
MLLR Single Variable Linear Regression ∂ ∂ h ( x , s ) − = − − T 1 t ( x D w ) C ( x D w ) t s s s t s s ∂ ∂ w w s s [ ] ∂ = − − − − − − − T 1 T 1 T T 1 T T 1 x C x x C D w w D C x w D C D w t s t t s ∂ s s s s s t s s s s s w s ( ) [ ] [ ] T T − − − − = − T 1 − T 1 − T 1 + T 1 0 x C D D C x D C D D C D w t s s s s t s s s s s s s ( ) = − − − T 1 2 D C x D w s s t s s ∂ T ( ) ∑ − ∴ = γ T 1 − = Q ( λ | λ ) ( t ) D C x D w 0 s s s t s s ∂ w = t 1 s ⎡ ⎤ ⎡ ⎤ T T ∑ ∑ − − ⇒ γ = γ T 1 T 1 D C ( t ) x ( t ) D C D w ⎢ ⎥ ⎢ ⎥ s s s t s s s s s ⎣ ⎦ ⎣ ⎦ = = t 1 t 1 − 1 ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ T T ∑ ∑ = γ − − γ T 1 T 1 w ( t ) D C D D C ( t ) x ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ s s s s s s s s t ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ = = t 1 t 1
MLLR Single Variable Linear Regression The extension to the tied regression matrix case : R T R T ∑ ∑ ∑ ∑ − − γ = γ T 1 T 1 ( t ) D C x ( t ) D C D w s s s t s s s s s r r r r r r r = = = = r 1 t 1 r 1 t 1 − 1 ⎡ ⎤ ⎡ ⎤ R T R T ∑ ∑ ∑ ∑ − − ⇒ = γ γ T 1 T 1 w ( t ) D C D ( t ) D C x ⎢ ⎥ ⎢ ⎥ s s s s s s s s t ⎣ ⎦ ⎣ ⎦ r r r r r r r = = = = r 1 t 1 r 1 t 1
MLLR Defining Regression Classes • Two approaches were considered: – 1.based on broad phonetic classes. • Models which represent the same broad phonetic class were placed in the same regression class. – 2.based on clustering of mixture components. • The mixture components were compared using a likelihood measure and similar components placed in the same regression class. • The data driven approach was found to be more appropriate for defining large numbers of classes.
MLLR Variance Adapted Reference : – Variance Compensation Within the MLLR Framework for Robust Speech Recognition and Speaker Adaptation – ICSLP ’ 96 Gales – Mean and variance adaptation within the MLLR framework – CSL ’ 96 Gales and Woodland – MLLR:A Speaker Adaptation Technique for LVCSR – Hamaker
MLLR Variance Adapted Single Gaussian Case • We apply Cholesky Decomposition to the inverse of covariance matrix: − = 1 T C L L where L is a lower triangular matrix s s s s − − ∴ = T 1 C L L [ ] s s s D [ ] [ ] jd ∑ • We can observe that − = 1 C L L s s s ij id = d 1 • Now the inverse of covariance matrix is updated by − = − 1 1 T C L H L where H is the linear transforma tion s s s s s • So − − = T 1 C L H L s s s s
MLLR Variance Adapted Single Gaussian Case • What does the transformation mean ? [ ] D – Origin : [ ] [ ] jd ∑ − = 1 C L L s s s ij id = d 1 – New : [ ] ] [ ] kd D D [ ] [ ∑ ∑ − = − 1 1 C L L H s s s s ij ik jd = = d 1 k 1 d-th column T L s i-th row k-th row − L 1 H s s j-th column
MLLR Variance Adapted Single Gaussian Case • The auxiliary can be obtained transition probabilit y Q ( λ | λ ) [ ] N T 1 ∑ ∑ = − × γ π + + − − − T 1 constant ( t ) D log( 2 ) log | C | ( x µ ) C ( x µ ) j j t j j t j 2 = = j 1 t 1 [ ] 1 N T ∑ ∑ = − × γ π + − − + − − − T 1 T 1 T ( t ) D log( 2 ) log | L H L | ( x µ ) L H L ( x µ ) j j j j t j j j j t j 2 = = j 1 t 1 [ [ ] ] N T 1 ∑ ∑ − − − = − × γ π + T ⋅ ⋅ 1 + T − T T 1 T − T ( t ) D log( 2 ) log | L | | H | | L | ( L x L µ ) H ( L x L µ ) j j j j j t j j j j t j j 2 = = j 1 t 1 − ⋅ ⋅ − = − ⋅ − ⋅ = − − ⋅ = ⋅ T 1 T 1 T 1 | L | | H | | L | | L | | L | | H | | L L | | H | | C | | H | Q j j j j j j j j j j j [ ] 1 N T ∑ ∑ = − × γ π + + + − − − T T T 1 T T ( t ) D log( 2 ) log | C | log | H | ( L x L µ ) H ( L x L µ ) j j j j t j j j j t j j 2 = = j 1 t 1
MLLR Variance Adapted Single Gaussian Case • Differentiate Q-function w.r.t and set it H j to zero then … [ ] ∂ N T ∑ ∑ − γ + − − = T T T 1 T T ( t ) log | H | ( L x L µ ) H ( L x L µ ) 0 j j j t j j j j t j j ∂ H = = j 1 t 1 j ⎡ ⎤ T 1 ∑ − γ × × T − T T − T T − T T T = ( t ) | H | H H ( L x L µ )( L x L µ ) H 0 ⎢ ⎥ j j j j j t j j j t j j j | H | ⎢ ⎥ ⎣ ⎦ = t 1 j T T ∑ ∑ − − − γ × T = γ T T − T T − T T T ( t ) H ( t ) H ( L x L µ )( L x L µ ) H j j j j j t j j j t j j j = = t 1 t 1 T T ∑ ∑ γ × = γ − − T T T T T T ( t ) H ( t )( L x L µ )( L x L µ ) j j j j t j j j t j j = = t 1 t 1 T ∑ γ − − T T T T T ( t )( L x L µ )( L x L µ ) j j t j j j t j j = T = t 1 H j T ∑ γ ( t ) j = t 1
MLLR Variance Adapted Single Gaussian Case T ∑ γ − − T T T T T ( t )( L x L µ )( L x L µ ) j j t j j j t j j ∴ = T = H t 1 j T ∑ γ ( t ) j = t 1 ⎡ ⎤ T ∑ γ − − T T L ( t )( x µ )( x µ ) L ⎢ ⎥ j j t j t j j ⎣ ⎦ = = t 1 T ∑ γ ( t ) j = t 1 T We can observe that H is symmetric . j ∴ = T H H j j
MLLR Variance Adapted Tied Regression Matrices Case If H is shared by R states { s , , s } then L s 1 R ⎧ ⎫ ⎡ ⎤ R T ∑ ∑ γ − − T T L ( t )( x µ )( x µ ) L ⎨ ⎬ ⎢ ⎥ s s t s t s s ⎣ ⎦ ⎩ r r r r r ⎭ = = r 1 t 1 = H s R T ∑ ∑ γ ( t ) s r = = r 1 t 1 ←
MLLR another approach MLLR another approach MLLR another approach
MLLR another approach • Reference: – Speaker Adaptation Using Constrained Estimation of Gaussian Mixtures – SAP ’ 95 Vassilios V. Digalakis
Introduction • This approach is an extension of model space MLLR where the covariances of the Gaussian components are constrained to share the same transforms as the means. • The transformed means and variances and are given as a function of the transform parameters: = + µ Aµ b = T Σ A Σ A
Single Gaussian Case • Assume the adaptation data, X, is a series of T observations. = X x , x ,..., x 1 2 T • For each state s • Denote the initial model by = ( 0 ) ( 0 ) ( 0 ) λ ( µ , Σ ) s s s • Current set of model parameters by applying the transformation A s = + ( 0 ) ( 0 ) T λ ( A µ b , A Σ A ) s s s s s s s
Single Gaussian Case • Re-estimated set of model parameters by applying the transformation A s = + ( 0 ) ( 0 ) T λ ( A µ b , A Σ A ) s s s s s s s • We denote the parameter set = µ µ µ Σ Σ Σ ( 0 ) ( 0 ) ( 0 ) ( 0 ) ( 0 ) ( 0 ) Λ { , , , , , , , } L L 1 2 N 1 2 N s s = η { A , A , , A , b , b , , b } L L 1 2 N 1 2 N s s N is the total state number s
Recommend
More recommend