variational decoding for statistical machine translation
play

Variational Decoding for Statistical Machine Translation Zhifei Li, - PowerPoint PPT Presentation

Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Computer Science Department Johns Hopkins University 1 Monday, August 17, 2009 Spurious


  1. Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 8 Monday, August 17, 2009

  2. Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 8 Monday, August 17, 2009

  3. Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 9 Monday, August 17, 2009

  4. Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 9 Monday, August 17, 2009

  5. Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 9 Monday, August 17, 2009

  6. Hypergraph as a search space Monday, August 17, 2009

  7. Hypergraph as a search space S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

  8. Hypergraph as a search space A hypergraph is a compact structure to encode exponentially many trees. S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

  9. Hypergraph as a search space S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

  10. Hypergraph as a search space S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � Probabilistic X →� X 0 de X 1 , X 0 X 1 � Hypergraph X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

  11. Hypergraph as a search space The hypergraph defines a probability distribution over derivation trees , i.e. p(y, d | x), S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � Probabilistic X →� X 0 de X 1 , X 0 X 1 � Hypergraph X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

  12. Hypergraph as a search space The hypergraph defines a probability distribution over derivation trees , i.e. p(y, d | x), and also a distribution (implicit) over strings , i.e. p(y | x). S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � Probabilistic X →� X 0 de X 1 , X 0 X 1 � Hypergraph X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

  13. Hypergraph as a search space The hypergraph defines a probability distribution over derivation trees , i.e. p(y, d | x), • Exact MAP decoding and also a distribution (implicit) over strings , i.e. p(y | x). y ∗ = arg y ∈ HG( x ) p ( y | x ) max � = arg max p ( y, d | x ) S 0,4 y ∈ HG( x ) d ∈ D( x,y ) S →� X 0 , X 0 � exponential size S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat NP-hard (Sima’an 1996) X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � Probabilistic X →� X 0 de X 1 , X 0 X 1 � Hypergraph X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

  14. Decoding with spurious ambiguity? • Maximum a posterior (MAP) decoding • Viterbi approximation • N-best approximation (crunching) (May and Knight 2006) Monday, August 17, 2009

  15. Viterbi Approximation translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 0.10 • Viterbi approximation y ∗ = arg max d ∈ D( x,y ) p ( y, d | x ) max y ∈ Trans( x ) = Y(arg max d ∈ D( x ) p ( y, d | x )) Monday, August 17, 2009

  16. Viterbi Approximation translation probability derivation MAP Viterbi string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 0.10 • Viterbi approximation y ∗ = arg max d ∈ D( x,y ) p ( y, d | x ) max y ∈ Trans( x ) = Y(arg max d ∈ D( x ) p ( y, d | x )) Monday, August 17, 2009

  17. Viterbi Approximation translation probability derivation MAP Viterbi string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 0.10 • Viterbi approximation y ∗ = arg max d ∈ D( x,y ) p ( y, d | x ) max y ∈ Trans( x ) = Y(arg max d ∈ D( x ) p ( y, d | x )) Monday, August 17, 2009

  18. Viterbi Approximation translation probability derivation MAP Viterbi string 0.16 red translation 0.28 0.16 0.14 0.14 blue translation 0.28 0.14 0.13 green translation 0.44 0.13 0.12 0.11 0.10 0.10 • Viterbi approximation y ∗ = arg max d ∈ D( x,y ) p ( y, d | x ) max y ∈ Trans( x ) = Y(arg max d ∈ D( x ) p ( y, d | x )) Monday, August 17, 2009

  19. Viterbi Approximation translation probability derivation MAP Viterbi string 0.16 red translation 0.28 0.16 0.14 0.14 blue translation 0.28 0.14 0.13 green translation 0.44 0.13 0.12 0.11 0.10 0.10 • Viterbi approximation y ∗ = arg max d ∈ D( x,y ) p ( y, d | x ) max y ∈ Trans( x ) = Y(arg max d ∈ D( x ) p ( y, d | x )) Monday, August 17, 2009

  20. N-best Approximation translation probability derivation MAP Viterbi string 0.16 red translation 0.28 0.16 0.14 0.14 blue translation 0.28 0.14 0.13 green translation 0.44 0.13 0.12 0.11 0.10 0.10 • N-best approximation (crunching) (May and Knight 2006) � y ∗ = arg max p ( y, d | x ) y ∈ Trans( x ) d ∈ D( x,y ) ∩ ND( x ) Monday, August 17, 2009

  21. N-best Approximation translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.14 0.14 blue translation 0.28 0.14 0.13 green translation 0.44 0.13 0.12 0.11 0.10 0.10 • N-best approximation (crunching) (May and Knight 2006) � y ∗ = arg max p ( y, d | x ) y ∈ Trans( x ) d ∈ D( x,y ) ∩ ND( x ) Monday, August 17, 2009

  22. N-best Approximation translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.14 0.14 blue translation 0.28 0.14 0.13 green translation 0.44 0.13 0.12 0.11 0.10 0.10 • N-best approximation (crunching) (May and Knight 2006) � y ∗ = arg max p ( y, d | x ) y ∈ Trans( x ) d ∈ D( x,y ) ∩ ND( x ) Monday, August 17, 2009

  23. N-best Approximation translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 • N-best approximation (crunching) (May and Knight 2006) � y ∗ = arg max p ( y, d | x ) y ∈ Trans( x ) d ∈ D( x,y ) ∩ ND( x ) Monday, August 17, 2009

  24. N-best Approximation translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 • N-best approximation (crunching) (May and Knight 2006) � y ∗ = arg max p ( y, d | x ) y ∈ Trans( x ) d ∈ D( x,y ) ∩ ND( x ) Monday, August 17, 2009

  25. MAP vs. Approximations translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 Monday, August 17, 2009

  26. MAP vs. Approximations translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 • Exact MAP decoding under spurious ambiguity is intractable Monday, August 17, 2009

  27. MAP vs. Approximations translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 • Exact MAP decoding under spurious ambiguity is intractable • Viterbi and crunching are efficient, but ignore most derivations Monday, August 17, 2009

  28. MAP vs. Approximations translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 • Exact MAP decoding under spurious ambiguity is intractable • Viterbi and crunching are efficient, but ignore most derivations • Our goal: develop an approximation that considers all the derivations but still allows tractable decoding Monday, August 17, 2009

  29. Variational Decoding 18 Monday, August 17, 2009

  30. Variational Decoding Decoding using Variational approximation 18 Monday, August 17, 2009

  31. Variational Decoding Decoding using Variational approximation Decoding using a sentence-specific approximate distribution 18 Monday, August 17, 2009

  32. Variational Decoding for MT: an Overview Monday, August 17, 2009

  33. Variational Decoding for MT: an Overview Sentence-specific decoding Monday, August 17, 2009

  34. Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: Monday, August 17, 2009

  35. Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: 1 Generate a hypergraph Monday, August 17, 2009

  36. Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: 1 Generate a hypergraph Foreign sentence x Monday, August 17, 2009

  37. Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: 1 Generate a hypergraph Foreign SMT sentence x Monday, August 17, 2009

  38. Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: 1 Generate a hypergraph S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Foreign X →� X 0 de X 1 , X 1 on X 0 � SMT X →� X 0 de X 1 , X 0 ’s X 1 � sentence x X →� X 0 de X 1 , X 1 of X X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

  39. Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: 1 Generate a hypergraph S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat p(y, d | x) Foreign X →� X 0 de X 1 , X 1 on X 0 � SMT X →� X 0 de X 1 , X 0 ’s X 1 � sentence x X →� X 0 de X 1 , X 1 of X X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

  40. Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: 1 Generate a hypergraph S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat p(y, d | x) Foreign X →� X 0 de X 1 , X 1 on X 0 � SMT X →� X 0 de X 1 , X 0 ’s X 1 � sentence x X →� X 0 de X 1 , X 1 of X X →� X 0 de X 1 , X 0 X 1 � p(y | x) X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

  41. Variational Decoding for MT: an Overview Sentence-specific decoding MAP decoding under P is intractable Three steps: 1 Generate a hypergraph S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat p(y, d | x) Foreign X →� X 0 de X 1 , X 1 on X 0 � SMT X →� X 0 de X 1 , X 0 ’s X 1 � sentence x X →� X 0 de X 1 , X 1 of X X →� X 0 de X 1 , X 0 X 1 � p(y | x) X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

  42. 1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

  43. 1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

  44. 1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 2 Monday, August 17, 2009

  45. 1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 2 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

  46. 1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 2 S 0,4 Estimate a model S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat from the hypergraph X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � q*(y | x) X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

  47. 1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 q* is an n-gram model 2 S 0,4 Estimate a model S →� X 0 , X 0 � over output strings. S →� X 0 , X 0 � X 0,4 a · · · mat from the hypergraph X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � q*(y | x) X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

  48. 1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 q* is an n-gram model 2 S 0,4 Estimate a model S →� X 0 , X 0 � over output strings. S →� X 0 , X 0 � X 0,4 a · · · mat from the hypergraph X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � q*(y | x) X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat ≈∑ d ∈ D(x,y) p(y,d|x) X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

  49. 1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 q* is an n-gram model 2 S 0,4 Estimate a model S →� X 0 , X 0 � over output strings. S →� X 0 , X 0 � X 0,4 a · · · mat from the hypergraph X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � q*(y | x) X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat ≈∑ d ∈ D(x,y) p(y,d|x) X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 3 Monday, August 17, 2009

  50. 1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 q* is an n-gram model 2 S 0,4 Estimate a model S →� X 0 , X 0 � over output strings. S →� X 0 , X 0 � X 0,4 a · · · mat from the hypergraph X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � q*(y | x) X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat ≈∑ d ∈ D(x,y) p(y,d|x) X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 3 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat Decode using q* X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � q*(y | x) X →� X 0 de X 1 , X 0 ’s X 1 � on the hypergraph X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

  51. Variational Inference 21 Monday, August 17, 2009

  52. Variational Inference • We want to do inference under p, but it is intractable 21 Monday, August 17, 2009

  53. Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y 21 Monday, August 17, 2009

  54. Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* 21 Monday, August 17, 2009

  55. Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* q ∗ = arg min q ∈ Q KL( p || q ) 21 Monday, August 17, 2009

  56. Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* q ∗ = arg min q ∈ Q KL( p || q ) • Then, we will use q* as a surrogate for p in inference 21 Monday, August 17, 2009

  57. Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* q ∗ = arg min q ∈ Q KL( p || q ) • Then, we will use q* as a surrogate for p in inference = arg max q ∗ ( y | x ) y ∗ y 21 Monday, August 17, 2009

  58. Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* P q ∗ = arg min q ∈ Q KL( p || q ) • Then, we will use q* as a surrogate for p in inference = arg max q ∗ ( y | x ) y ∗ y 21 Monday, August 17, 2009

  59. Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* P p q ∗ = arg min q ∈ Q KL( p || q ) • Then, we will use q* as a surrogate for p in inference = arg max q ∗ ( y | x ) y ∗ y 21 Monday, August 17, 2009

  60. Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* P p q ∗ = arg min q ∈ Q KL( p || q ) Q • Then, we will use q* as a surrogate for p in inference = arg max q ∗ ( y | x ) y ∗ y 21 Monday, August 17, 2009

  61. Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* P p q ∗ = arg min q ∈ Q KL( p || q ) Q • Then, we will use q* as a surrogate for p in inference = arg max q ∗ ( y | x ) y ∗ y 21 Monday, August 17, 2009

  62. Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* P p q ∗ = arg min q ∈ Q KL( p || q ) q* Q • Then, we will use q* as a surrogate for p in inference = arg max q ∗ ( y | x ) y ∗ y 21 Monday, August 17, 2009

  63. Variational Approximation • q* : an approximation having minimum distance to p q ∗ = arg min q ∈ Q KL( p || q ) a family of distributions 22 Monday, August 17, 2009

Recommend


More recommend