augmented statistical models exploiting generative models
play

Augmented Statistical Models: Exploiting Generative Models in - PowerPoint PPT Presentation

Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Martin Layton & Mark Gales 9 December 2005 Cambridge University Engineering Department NIPS 2005 Augmented Statistical Models: Exploiting Generative


  1. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Martin Layton & Mark Gales 9 December 2005 Cambridge University Engineering Department NIPS 2005

  2. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Overview • Generative models in discriminative classifiers – Fisher score-space – Generative score-space • Augmented Statistical Models – extension of standard models, e.g. GMMs and HMMs – allows additional dependencies to be represented • Discriminative training – maximum margin – conditional maximum likelihood • TIMIT results Cambridge University NIPS 2005 1 Engineering Department

  3. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Generative Models in Discriminative Classifiers Cambridge University NIPS 2005 2 Engineering Department

  4. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers The Hidden Markov Model o o o o o 1 2 3 4 T q t q t+1 () () b () b b 2 4 3 1 2 3 4 5 a a a 34 a o t o t+1 12 23 45 a a 33 a 22 44 (a) Standard HMM phone topology (b) HMM Dynamic Bayesian Network • Observations conditionally independent of other observations given state. • States conditionally independent of other states given previous states. • Poor model of the speech process - piecewise constant state-space. Cambridge University NIPS 2005 3 Engineering Department

  5. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Fisher Score-spaces • Jaakkola & Haussler (1999) • Method of incorporating generative models within a discriminative framework • Define a base generative model ˆ p ( O ; λ ) – 1-dimensional log-likelihood – not enough information for good classification • Instead use a score-space φ F ( O ; λ ) – tangent-space captures essence of generative process � � φ F ( O ; λ ) = ∇ λ ln ˆ p ( O ; λ ) – dimensionality of score-space: parameters λ – suitable for discriminative training (SVMs, etc) – has been applied to many tasks, e.g. comp. biology and speech recognition Cambridge University NIPS 2005 4 Engineering Department

  6. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Generative Score-spaces • Smith & Gales (2002) • Extension for supervised binary classification tasks p ( O ; λ (1) ) and ˆ p ( O ; λ (2) ) • Define class-conditional base models ˆ – includes log-likelihood ratio to improve discrimination – avoids wrap-around (different O ’s map to the same point in score-space) • Score-space φ LL ( O ; λ )   p ( O ; λ (1) ) − ln ˆ p ( O ; λ (2) ) ln ˆ p ( O ; λ (1) ) φ LL ( O ; λ ) = ∇ λ (1) ln ˆ   p ( O ; λ (2) ) −∇ λ (2) ln ˆ – suitable for discriminative training — SVMs – no probabilistic interpretation – restricted to binary problems Cambridge University NIPS 2005 5 Engineering Department

  7. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Augmented Statistical Models Cambridge University NIPS 2005 6 Engineering Department

  8. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Dependency Modelling • Speech data is dynamic — observations are not of a fixed length • Dependency modelling essential part of speech recognition p ( o 1 , . . . , o T ; λ ) = p ( o 1 ; λ ) p ( o 2 | o 1 ; λ ) . . . p ( o T | o 1 , . . . , o T − 1 ; λ ) – impractical to directly model in this form – make extensive use of conditional independence • Two possible forms of conditional independence – latent (unobserved) variables – observed variables • Even if given a set of dependencies (form of Bayesian Network) – need to determine how dependencies interact Cambridge University NIPS 2005 7 Engineering Department

  9. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Dependency Modelling q t−1 q q q t t+2 t+1 o t−1 o o o t+2 t t+1 • Commonly use a member (or mixture) of the exponential family 1 � � α T T ( O ) p ( O ; α ) = τ ( α ) h ( O ) exp h ( O ) is the reference distribution α are the natural parameters τ is the normalisation term T ( O ) are sufficient statistics • What is the appropriate form of statistics ( T ( O ) )? – for diagram above, T ( O ) = � T − 2 t =1 o t o t +1 o t +2 Cambridge University NIPS 2005 8 Engineering Department

  10. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Augmented Statistical Models • Augmented statistical models (related to fibre bundles)     ∇ λ ln ˆ p ( O ; λ ) � � 1 ∇ 2 1 2! vec λ ln ˆ p ( O ; λ )      α T p ( O ; λ , α ) = τ ( λ , α )ˆ p ( O ; λ ) exp   .   . .    � � ∇ ρ 1 ρ ! vec λ ln ˆ p ( O ; λ ) • Two sets of parameters: – λ - parameters of base distribution ( ˆ p ( O ; λ ) ) – α - natural parameters of local exponential model • Normalisation term τ ( λ , α ) ensures valid PDF � p ( O ; λ , α ) = ¯ p ( O ; λ , α ) p ( O ; λ , α ) d O = 1; τ ( λ , α ) – can be very difficult to estimate Cambridge University NIPS 2005 9 Engineering Department

  11. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Example: Augmented GMM p ( o ; λ ) = � M • Use a GMM as the base distribution: ˆ m =1 c m N ( o ; µ m , Σ m ) � M � M p ( o ; λ , α ) = 1 � � n Σ − 1 P ( n | o ; λ ) α T c m N ( o ; µ m , Σ m )exp n ( o − µ n ) τ m =1 n =1 • Simple two component one-dimensional example: 0.20 0.7 0.35 0.18 0.6 0.30 0.16 0.5 0.25 0.14 0.12 0.4 0.20 0.10 0.3 0.15 0.08 0.06 0.2 0.10 0.04 0.1 0.05 0.02 0.00 0.0 0.00 −10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10 α = [0 . 0 , 0 . 0] T α = [ − 1 . 0 , − 1 . 0] T α = [1 . 0 , − 1 . 0] T Cambridge University NIPS 2005 10 Engineering Department

  12. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Augmented Model Dependencies • If the base distribution is a latent-variable model — GMM,HMM,... – Sufficient statistics contain a first-order differential T � P ( θ t = { s j , m }| O ; λ ) Σ − 1 ∇ µ jm ln ˆ p ( O ; λ ) = jm ( O t − µ jm ) t =1 – depends on a posterior – compact representation of effects of all observations • Augmented models of this form: – retain independence assumptions of the base model – remove conditional independence assumptions of the base model... ... since the local exponential model depends on a posterior • For HMM base models, – observations are dependent on all observations and all latent states – higher-order derivatives create increasingly powerful models Cambridge University NIPS 2005 11 Engineering Department

  13. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Discriminative Training Cambridge University NIPS 2005 12 Engineering Department

  14. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Maximum Margin Estimation • Consider the simplified two-class problem • Bayes’ decision rule (consider λ fixed) ω 1 P ( ω 2 | O ) = P ( ω 1 ) τ ( λ (2) , α (2) ) ¯ p ( O ; λ (1) , α (1) ) P ( ω 1 | O ) > 1 < P ( ω 2 ) τ ( λ (1) , α (1) ) ¯ p ( O ; λ (2) , α (2) ) ω 2 – class priors P ( ω 1 ) and P ( ω 2 ) • Can be rewritten as a linear decision boundary in a generative score-space, ω 1 � ¯ � � P ( ω 1 ) τ ( λ (2) , α (2) ) � p ( O ; λ (1) , α (1) ) 1 + 1 > T ln T ln 0 < p ( O ; λ (2) , α (2) ) P ( ω 2 ) τ ( λ (1) , α (1) ) ¯ ω 2 � �� � � �� � w T φ LL ( O ; λ ) b – no need to explicitly calculate τ ( λ (1) , α (1) ) or τ ( λ (2) , α (2) ) • Note: restrictions on α ’s required to ensure a valid PDF Cambridge University NIPS 2005 13 Engineering Department

  15. Augmented Statistical Models: Exploiting Generative Models in Discriminative Classifiers Maximum Margin Estimation (cont.) • First-order Generative score-space given by   p ( O ; λ (1) ) − ln ˆ p ( O ; λ (2) ) ln ˆ φ LL ( O ; λ ) = 1 p ( O ; λ (1) ) ∇ λ (1) ln ˆ   T p ( O ; λ (2) ) −∇ λ (2) ln ˆ – independent of augmented parameters α • Linear decision boundary specified by � α (2) T � T w T = α (1) T 1 – only a function of the exponential model parameters α • Bias calculated as a by-product of training — depends on both α and λ • Potentially many parameters to estimate: – maximum margin estimation (MME) good choice — SVM training Cambridge University NIPS 2005 14 Engineering Department

Recommend


More recommend