robust hidden markov models inference in the presence of
play

Robust Hidden Markov Models Inference in the Presence of Label Noise - PowerPoint PPT Presentation

Robust Hidden Markov Models Inference in the Presence of Label Noise Benot Frnay 25 August 2014 Machine Learning in a Nutshell 1 Challenges in Machine Learning: Robust Inference 2 Overview of the Presentation Segmentation of


  1. Robust Hidden Markov Models Inference in the Presence of Label Noise Benoît Frénay 25 August 2014

  2. Machine Learning in a Nutshell 1

  3. Challenges in Machine Learning: Robust Inference 2

  4. Overview of the Presentation Segmentation of electrocardiogram signals : 3

  5. Overview of the Presentation Segmentation of electrocardiogram signals : goal: allow automated diagnosis of heart disease 3

  6. Overview of the Presentation Segmentation of electrocardiogram signals : goal: allow automated diagnosis of heart disease tools: hidden Markov models and wavelet transform 3

  7. Overview of the Presentation Segmentation of electrocardiogram signals : goal: allow automated diagnosis of heart disease tools: hidden Markov models and wavelet transform issue: robustness to label noise (i.e. expert errors) 3

  8. Overview of the Presentation Segmentation of electrocardiogram signals : goal: allow automated diagnosis of heart disease tools: hidden Markov models and wavelet transform issue: robustness to label noise (i.e. expert errors) solution: modelling of expert behaviour 3

  9. Electrocardiogram Signal Segmentation

  10. What is an Electrocardiogram Signal ? An ECG is a measure of the electrical activity of the human heart . Patterns of interest: P wave, QRS complex, T wave, baseline. 5

  11. Where Does it Come from ? The ECG results from the superposition of several signals. 6

  12. What it Looks Like in Real-World Cases Real ECGs are polluted by various sources of noise . 7

  13. What is our Goal in ECG Segmentation ? Task : split/segment an entire ECG into patterns . Available data : a few manual segmentations from experts. Issue : some of the annotations of the experts are incorrect . Probabilistic model of sequences with labels hidden Markov Models (with wavelet transform) 8

  14. Hidden Markov Models

  15. Hidden Markov Models in a Nutshell Hidden Markov models (HMMs) are probabilistic models of sequences. S 1 , . . . , S T is the sequence of annotations (ex.: state of the heart). P ( S t = s t | S t − 1 = s t − 1 ) 10

  16. Hidden Markov Models in a Nutshell Hidden Markov models (HMMs) are probabilistic models of sequences. S 1 , . . . , S T is the sequence of annotations (ex.: state of the heart). P ( S t = s t | S t − 1 = s t − 1 ) O 1 , . . . , O T is the sequence of observations (ex.: measured voltage). P ( O t = o t | S t = s t ) 10

  17. Hypotheses Behind Hidden Markov Models (1) Markov hypothesis : the next state only depend on the current state. 11

  18. Hypotheses Behind Hidden Markov Models (2) Observations are conditionally independent w.r.t. the hidden states: P ( O 1 , . . . , O T | S 1 , . . . , S T ) = � T t = 1 P ( O t | S t ) 12

  19. Learning Hidden Markov Models Learning an HMM means to estimate probabilities: P ( S t ) are prior probabilities P ( S t | S t − 1 ) are transition probabilities P ( O t | S t ) are emission probabilities . Parameters Θ = ( q , a , b ) : q i is the prior of state i a ij is the transition probability from state i to state j b i is the observation distributions for state i 13

  20. Standard Inference Algorithms for HMMs Supervised learning : assumes the observed labels are correct ; maximises the likelihood P ( S , O | Θ) ; learns the correct concepts; sensitive to label noise. Baum-Welch algorithm: unsupervised , i.e. observed labels are discarded; iteratively (i) label samples and (ii) learn a model; may learn concepts which differs significantly; theoretically insensitive to label noise. 14

  21. Supervised Learning for Hidden Markov Models Supervised: uses annotations , which are assumed to be reliable . � T � T Maximises the likelihood P ( S , O | Θ) = q s 1 t = 2 a s t − 1 s t t = 1 b s t ( o t ) . 15

  22. Supervised Learning for Hidden Markov Models Supervised: uses annotations , which are assumed to be reliable . � T � T Maximises the likelihood P ( S , O | Θ) = q s 1 t = 2 a s t − 1 s t t = 1 b s t ( o t ) . Transition probabilities P ( S t | S t − 1 ) are estimated by counting a ij = #( transitions from i to j ) / #( transitions from i ) Emission probabilities P ( O t | S t ) are obtained by PDF estimation standard models in ECG analysis: Gaussian mixture models (GMMs) 15

  23. Unsupervised Learning for Hidden Markov Models (1) Unsupervised : uses only observations, guesses hidden states . Maximises the likelihood P ( O | Θ) = � S P ( S , O | Θ) . 16

  24. Unsupervised Learning for Hidden Markov Models (1) Unsupervised : uses only observations, guesses hidden states . Maximises the likelihood P ( O | Θ) = � S P ( S , O | Θ) . Non-convex function to optimise: � T T � � � � log P ( O | Θ) = log q s 1 a s t − 1 s t b s t ( o t ) S t = 2 t = 1 Solution: expectation-maximisation algorithm (a.k.a. Baum-Welch). 16

  25. Unsupervised Learning for Hidden Markov Models (2) The log-likelihood is intractable, but what about a convex lower bound ? Source: Pattern Recognition and Machine Learning, C. Bishop, 2006. Two steps : find a tractable lower bound maximise this lower bound w.r.t. Θ 17

  26. Unsupervised Learning for Hidden Markov Models (3) Idea: use Jensen inequality to find a lower bound to the log-likelihood. � log P ( O | Θ) = log P ( S , O | Θ) S 18

  27. Unsupervised Learning for Hidden Markov Models (3) Idea: use Jensen inequality to find a lower bound to the log-likelihood. � log P ( O | Θ) = log P ( S , O | Θ) S q ( S ) P ( S , O | Θ) � = log q ( S ) S 18

  28. Unsupervised Learning for Hidden Markov Models (3) Idea: use Jensen inequality to find a lower bound to the log-likelihood. � log P ( O | Θ) = log P ( S , O | Θ) S q ( S ) P ( S , O | Θ) � = log q ( S ) S q ( S ) log P ( S , O | Θ) � ≥ q ( S ) S 18

  29. Unsupervised Learning for Hidden Markov Models (3) Idea: use Jensen inequality to find a lower bound to the log-likelihood. � log P ( O | Θ) = log P ( S , O | Θ) S q ( S ) P ( S , O | Θ) � = log q ( S ) S q ( S ) log P ( S , O | Θ) � ≥ q ( S ) S q ( S ) log P ( S | O , Θ) � = + const q ( S ) S 18

  30. Unsupervised Learning for Hidden Markov Models (3) Idea: use Jensen inequality to find a lower bound to the log-likelihood. � log P ( O | Θ) = log P ( S , O | Θ) S q ( S ) P ( S , O | Θ) � = log q ( S ) S q ( S ) log P ( S , O | Θ) � ≥ q ( S ) S q ( S ) log P ( S | O , Θ) � = + const q ( S ) S Best lower bound with q ( S ) = P ( S | O , Θ) . 18

  31. The Expectation-Maximisation / Baum-Welch Algorithm Expectation step : estimate the posteriors γ t ( i ) = P ( S t = i | O , Θ old ) ǫ t ( i , j ) = P ( S t − 1 = i , S t = j | O , Θ old ) 19

  32. The Expectation-Maximisation / Baum-Welch Algorithm Expectation step : estimate the posteriors γ t ( i ) = P ( S t = i | O , Θ old ) ǫ t ( i , j ) = P ( S t − 1 = i , S t = j | O , Θ old ) Maximisation step for q i and a ij : � T γ 1 ( i ) t = 2 ǫ t ( i , j ) q i = a ij = � |S| � |S| � T i = 1 γ 1 ( i ) j = 1 ǫ t ( i , j ) t = 2 The hidden states are estimated and used to compute the parameters. 19

  33. Wavelet Transform

  34. Why do we Need High-Dimensional Representations ? Using HMMs with raw ECG signals gives 70 % of accuracy . The Markov and conditionally independency hypotheses are strong: transitions do not depend only on the current state emissions are not independent, even when states are given 21

  35. Why do we Need High-Dimensional Representations ? Using HMMs with raw ECG signals gives 70 % of accuracy . The Markov and conditionally independency hypotheses are strong: transitions do not depend only on the current state emissions are not independent, even when states are given Solution: use a multi-dimensional representation of the ECG signal. Example: O ( t ) → ( O ( t ) , O ′ ( t ) , O ′′ ( t )) . the observation vector contains contextual information numerical estimations of derivative are unstable 21

  36. Wavelet Transform in a Nutshell Signals can be studied at different time scales (or frequencies). Fourrier transform only considers the whole signal (no localisation ) � ∞ f ( t ) e − 2 π i ω t dt f ( ω ) = −∞ 22

  37. Wavelet Transform in a Nutshell Signals can be studied at different time scales (or frequencies). Fourrier transform only considers the whole signal (no localisation ) � ∞ f ( t ) e − 2 π i ω t dt f ( ω ) = −∞ The wavelet transform uses a localised function ψ (a.k.a. wavelet) � ∞ 1 � t − b � f ψ ( a , b ) = ψ f ( t ) dt � a | a | −∞ where b is the translation factor and a is the scale factor. 22

  38. Example of Time-Frequency Analysis (1) Source: A Wavelet Tour of Signal Processing, Stéphane Mallat, 1999. 23

  39. Example of Time-Frequency Analysis (2) Source: A Wavelet Tour of Signal Processing, Stéphane Mallat, 1999. 24

  40. Information Extraction with Wavelet Transform filtered using a 3-30 Hz band-pass filter transformed using a continuous wavelet transform dyadic scales from 2 1 to 2 7 are kept and normalised 25

  41. Label Noise-Tolerant Hidden Markov Models

  42. Motivation For real datasets, perfect labelling is difficult : subjectivity of the labelling task; lack of information; communication noise . In particular, label noise arise in biomedical applications. Previous works by e.g. Lawrence et al. incorporated a noise model into a generative model for i.i.d. observations (classification). 27

  43. Example of Label Noise in ECGs 28

Recommend


More recommend