introduction to statistical machine learning
play

Introduction to (Statistical) Machine Learning Brown University - PowerPoint PPT Presentation

Introduction to (Statistical) Machine Learning Brown University CSCI1420 & ENGN2520 Prof. Erik Sudderth Lecture for Nov. 21, 2013: HMMs: Forward-Backward & EM Algorithms, Principal Components Analysis (PCA) Many figures courtesy


  1. Introduction to (Statistical) Machine Learning Brown University CSCI1420 & ENGN2520 Prof. Erik Sudderth Lecture for Nov. 21, 2013: HMMs: Forward-Backward & EM Algorithms, Principal Components Analysis (PCA) Many figures courtesy Kevin Murphy’s textbook, Machine Learning: A Probabilistic Perspective

  2. Inference for HMMs z 4 z 5 z 1 z 2 z 3 x 2 x 3 x 5 x 1 x 4 • Assume parameters defining the HMM are fixed and known: distributions of initial state, state transitions, observations • Given observation sequence, want to estimate hidden states Minimize sequence (word) error rate: L ( z, a ) = I ( z 6 = a ) " T " # # T Y Y z = arg max ˆ p ( z | x ) = arg max p ( z 1 ) p ( z t | z t − 1 ) · p ( x t | z t ) z z t =2 t =1 T X Minimize state (symbol) error rate: L ( z, a ) = I ( z t 6 = a t ) t =1 X X X X z t = arg max ˆ z t p ( z t | x ) = arg max · · · · · · p ( z, x ) z t z 1 z t − 1 z t +1 z T Problem: Naïve computation of either estimate requires O ( K T )

  3. Forward Filtering for HMMs z 4 z 5 z 1 z 2 z 3 x 2 x 3 x 5 x 1 x 4 " T " # # T Y Y p ( z, x ) = p ( z ) p ( x | z ) = p ( z 1 ) p ( z t | z t − 1 ) · p ( x t | z t ) t =2 t =1 α t ( z t ) = p ( z t | x t , x t − 1 , . . . , x 1 ) Filtered state estimates: • Directly useful for online inference or tracking with HMMs • Building block towards finding posterior given all observations Initialization: Easy from known HMM parameters multiply be proportionality α 1 ( z 1 ) = p ( z 1 | x 1 ) ∝ p ( z 1 ) p ( x 1 | z 1 ) constant so sums to one Recursion: Derivation will follow from Markov properties K X α t ( z t ) ∝ p ( x t | z t ) p ( z t | z t − 1 ) α t − 1 ( z t − 1 ) O ( K 2 ) z t − 1 =1

  4. Forward Filtering for HMMs z 4 z 5 z 1 z 2 z 3 x 2 x 3 x 5 x 1 x 4 α t ( z t ) = p ( z t | x t , x t − 1 , . . . , x 1 ) K X α t +1 ( z t +1 ) ∝ p ( x t +1 | z t +1 ) p ( z t +1 | z t ) α t ( z t ) z t =1 Prediction Step: Given current knowledge, what is next state? K X p ( z t +1 | x t , . . . , x 1 ) = p ( z t +1 | z t ) α t ( z t ) z t =1 Update Step: What does latest observation tell us about state? α t +1 ( z t +1 ) = p ( z t +1 | x t +1 , x t , . . . , x 1 ) ∝ p ( x t +1 | z t +1 ) p ( z t +1 | x t , . . . , x 1 ) Key Markov Identities: From generative structure of HMM, p ( z t +1 | z t , x t , . . . , x 1 ) = p ( z t +1 | z t ) p ( x t +1 | z t +1 , x t , . . . , x 1 ) = p ( x t +1 | z t +1 )

  5. Forward-Backward for HMMs z 4 z 5 z 1 z 2 z 3 x 2 x 3 x 5 x 1 x 4 Forward Recursion: Distribution of State Given Past Data α t ( z t ) = p ( z t | x t , x t − 1 , . . . , x 1 ) α 1 ( z 1 ) ∝ p ( z 1 ) p ( x 1 | z 1 ) K X α t +1 ( z t +1 ) ∝ p ( x t +1 | z t +1 ) p ( z t +1 | z t ) α t ( z t ) z t =1 Backward Recursion: Likelihood of Future Data Given State β t ( z t ) ∝ p ( x t +1 , . . . , x T | z t ) β T ( z T ) = 1 K X β t ( z t ) ∝ p ( x t +1 | z t +1 ) p ( z t +1 | z t ) β t +1 ( z t +1 ) z t +1 =1 Marginal: Posterior distribution of state given all data p ( z t | x 1 , . . . , x T ) ∝ α t ( z t ) β t ( z t )

  6. EM for Hidden Markov Models π z 4 z 5 z 1 z 2 z 3 x 2 x 3 x 5 x 1 x 4 θ parameters (state transition & emission dist.) π , θ hidden discrete state sequence z 1 , . . . , z N • Initialization: Randomly select starting parameters • E-Step: Given parameters, find posterior of hidden states • Dynamic programming to efficiently infer state marginals • M-Step: Given posterior distributions, find likely parameters • Like training of mixture models and Markov chains • Iteration: Alternate E-step & M-step until convergence

  7. E-Step: HMMs q ( t ) ( z ) = p ( z | x, π ( t − 1) , θ ( t − 1) ) ∝ p ( z | π ( t − 1) ) p ( x | z, θ ( t − 1) ) Mixture Models N Y q ( t ) ( z ) ∝ p ( z i | π ( t − 1) ) p ( x i | z i , θ ( t − 1) ) i =1 • Hidden states are conditionally independent given parameters • Naïve representation of full posterior has size O ( KN ) HMMs N Y q ( t ) ( z ) ∝ p ( z i | π ( t − 1) z i − 1 ) p ( x i | z i , θ ( t − 1) ) i =1 • Hidden states have Markov dependence given parameters O ( K N ) • Naïve representation of full posterior has size • But, our forward-backward dynamic programming can quickly find the marginals (at each time) of the posterior distribution

  8. M-Step: HMMs θ ( t ) = arg max L ( q ( t ) , θ ) = arg max X q ( z ) ln p ( x, z | θ ) θ θ z Initial state dist. State transition dist. emissions via State emission dist. (observation likelihoods) weighted moment matching Need posterior marginal distributions of single states, and pairs of sequential states p ( z t | x ) p ( z t , z t +1 | x )

  9. Unsupervised Learning Supervised Learning Unsupervised Learning Discrete classification or clustering categorization Continuous dimensionality regression reduction • Goal: Infer label/response y given only features x • Classical: Find latent variables y good for compression of x • Probabilistic learning: Estimate parameters of joint distribution p(x,y) which maximize marginal probability p(x)

  10. Dimensionality Reduction Isomap Algorithm: Tenenbaum et al., Science 2000.

  11. PCA Objective: Compression x n ∈ R D , • Observed feature vectors: n = 1 , 2 , . . . , N z n ∈ R M , • Hidden manifold coordinates: n = 1 , 2 , . . . , N W ∈ R D × M • Hidden linear mapping: x n = Wz n + b b ∈ R D × 1 ˜ N N x n || 2 = X X || x n − Wz n − b || 2 J ( z, W, b | x, M ) = || x n − ˜ n =1 n =1 • Unlike clustering objectives like K-means, we can find the global optimum of this objective efficiently: N Construct W from the top eigenvectors x = 1 X b = ¯ x n of the sample covariance matrix N (the directions of largest variance) n =1

  12. Principal Components Analysis Example PCA Analysis of MNIST Images of the Digit 3 N N x n || 2 = X X || x n − Wz n − b || 2 J ( z, W, b | x, M ) = || x n − ˜ n =1 n =1 • PCA models all translations of data equally well (by shifting b ) • PCA models all rotations of data equally well (by rotating W ) • Appropriate when modeling quantities over time, space, etc.

  13. PCA Derivation: One-Dimension x n ∈ R D , • Observed feature vectors: n = 1 , 2 , . . . , N • Hidden manifold coordinates: z n ∈ R , n = 1 , 2 , . . . , N w ∈ R D × 1 • Hidden linear mapping: x n = wz n ˜ w T w = 1 Assume mean already subtracted from data (centered) N N J ( z, w | x ) = 1 x n || 2 = 1 X X || x n − wz n || 2 || x n − ˜ N N n =1 n =1 • Step 1: Optimal manifold coordinate is always projection z n = w T x n ˆ • Step 2: Optimal mapping maximizes variance of projection N N z, w | x ) = C − 1 Σ = 1 ( w T x n )( x T n w ) = C − w T Σ w X X x n x T J (ˆ n N N n =1 n =1

  14. Gaussian Geometry • Eigenvalues and eigenvectors: U = [ u 1 , . . . , u d ] Σ u i = λ i u i , i = 1 , . . . , d Σ ∈ R d × d Λ = diag( λ 1 , . . . , λ d ) Σ U = U Λ • For a symmetric matrix: u T u T i u i = 1 i u j = 0 λ i ∈ R d X Σ = U Λ U T = λ i u i u T i i =1 • For a positive semidefinite matrix: λ i ≥ 0 • Quadratic forms: • For a positive definite matrix: λ i > 0 d y i = u T 1 i ( x − µ ) Σ − 1 = U Λ − 1 U T = X u i u T i λ i Projection of difference from i =1 mean onto eigenvector

  15. Maximizes Variance & Minimizes Error u x 2 x n e x n x 1 C. Bishop, Pattern Recognition & Machine Learning

  16. Principal Components Analysis (PCA) 2 3D 0 Data − 2 4 2 0 8 6 − 2 4 2 0 − 4 − 2 − 4 − 6 − 8 Best 2D Projection 4 2 0 5 − 2 0 − 4 − 5 Best 1D 2 4 0 2 Projection 0 − 2 − 2 − 4 − 6

  17. PCA Optimal Solution N N x n || 2 = X X || x n − Wz n − b || 2 J ( z, W, b | x, M ) = || x n − ˜ n =1 n =1 N x = 1 X X = [ x 1 − ¯ x, x 2 − ¯ x, . . . , x N − ¯ x ] b = ¯ x n N n =1 • Option A: Eigendecomposition of sample covariance matrix N Σ = 1 x ) T = 1 N XX T = U Λ U T X ( x n − ¯ x )( x n − ¯ N n =1 Construct W from eigenvectors with M largest eigenvalues • Option B: Singular value decomposition (SVD) of centered data X = USV T Construct W from singular vectors with M largest singular values

Recommend


More recommend