Towards Characterization of Identifiability of Profile HMMs Srilakshmi Pattabiraman University of Illinois, Urbana-Champaign April 26, 2018 Joint work with Prof. Tandy Warnow. 1/11
Introduction ◮ Statistically consistent estimator ˆ θ 0 (asymptotic estimator) of a parameter θ 0 is one that identifies the correct parameter θ 0 when the data available is arbitrarily large. ◮ A necessary condition for any estimator’s asymptotic consistency is that the evolutionary model has to be identifiable. ◮ Identifiability - given the set of sequence profiles that are generated on a model tree, and the probabilities of their occurrences, can the underlying evolutionary model be identified correctly? ◮ Trivially, if there are two models that generate the same sequence profiles with matched probabilities, the models are not identifiable! 2/11
Central Question Are all profile HMMs identifiable? Figure 1: The standard profile HMM. ◮ φ : 1 path, A : 2 n + 1 paths, AA : n ( n − 1) + (2 n + 1)( n + 1) 3/11 2
Profile HMMs without deletion nodes Figure 2: Profile HMM with no deletion nodes. Theorem The model is identifiable iff no match state has the same distribution as the insertion states. 4/11
Proof Theorem The model is identifiable iff no match state has the same distribution as the insertion states. ◮ the sequence with the minimum length defines the topology p ?[ i − 1] A ?[ n − i ] ◮ z i A = � X ∈ A , T , G , C p ?[ i − 1] X ?[ n − i ] ◮ p A ∗ = x 1 z 1 A + (1 − x 1 ) 1 4 Figure 3: Finding x 1 . 5/11
Proof x 2 z 2 A + (1 − x 2 ) 1 y 1 z 1 A + (1 − y 1 ) 1 ◮ p ? A ∗ = x 1 � � � � + (1 − x 1 ) 4 4 x 2 z 2 T + (1 − x 2 ) 1 y 1 z 1 T + (1 − y 1 ) 1 ◮ p ? T ∗ = x 1 � � � � + (1 − x 1 ) 4 4 Figure 4: Finding x 2 , y 1 . ◮ p ? [ m − 1] A ∗ = x m − 1 , y m − 2 A + (1 − x m ) 1 � � + p ( m : m − 1) � x m z m � + f m , 1 1 1 4 p ( i : m − 2) � y m − 1 z m − 1 + (1 − y m − 1 ) 1 � A 4 ◮ p ? [ m − 1] T ∗ = x m − 1 , y m − 2 + p ( m : m − 1) � x m z m T + (1 − x m ) 1 � � � f m , 2 + 1 1 4 y m − 1 z m − 1 + (1 − y m − 1 ) 1 p ( i : m − 2) � � 4 T 6/11
Proof Figure 5: Two models that produce the same sequence profiles. ◮ p A = x 1 1 4 x 2 ◮ p AA = (1 − x 1 ) 1 4 y 1 1 4 x 2 + x 1 1 4 (1 − x 2 ) 1 4 y 2 ◮ p A [ n ] = n − 2 +(1 − x 1 ) 1 n − 2 y 1 1 x 1 1 4 (1 − x 2 ) 1 4 (1 − y 2 ) n − 2 1 4 (1 − y 1 ) n − 2 1 4 x 2 + 4 4 n 1 y 1 1 n 2 y 2 n 1 + n 2 = n − 3 (1 − x 1 ) 1 4 (1 − y 1 ) n 1 1 4 (1 − x 2 ) 1 4 (1 − y 2 ) n 2 1 � 4 4 7/11
Proof Figure 6: Two models that produce the same sequence profiles.. 8/11
What about the standard profile HMMs? ◮ Unfortunately, these methods don’t extend. ◮ Finding the number of match states itself is non-trivial. ◮ Standard ML tricks may not work! ◮ Maybe they are unidentifiable? 9/11
Bad news! Figure 7: Standard profile HMM with one match state. ◮ If we knew that the profile HMM had only one match state, then the model can be completely characterized. 10/11
Thank you! 11/11
Recommend
More recommend