Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 8 Monday, August 17, 2009
Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 8 Monday, August 17, 2009
Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 9 Monday, August 17, 2009
Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 9 Monday, August 17, 2009
Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 9 Monday, August 17, 2009
Hypergraph as a search space Monday, August 17, 2009
Hypergraph as a search space S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
Hypergraph as a search space A hypergraph is a compact structure to encode exponentially many trees. S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
Hypergraph as a search space S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
Hypergraph as a search space S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � Probabilistic X →� X 0 de X 1 , X 0 X 1 � Hypergraph X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
Hypergraph as a search space The hypergraph defines a probability distribution over derivation trees , i.e. p(y, d | x), S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � Probabilistic X →� X 0 de X 1 , X 0 X 1 � Hypergraph X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
Hypergraph as a search space The hypergraph defines a probability distribution over derivation trees , i.e. p(y, d | x), and also a distribution (implicit) over strings , i.e. p(y | x). S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � Probabilistic X →� X 0 de X 1 , X 0 X 1 � Hypergraph X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
Hypergraph as a search space The hypergraph defines a probability distribution over derivation trees , i.e. p(y, d | x), • Exact MAP decoding and also a distribution (implicit) over strings , i.e. p(y | x). y ∗ = arg y ∈ HG( x ) p ( y | x ) max � = arg max p ( y, d | x ) S 0,4 y ∈ HG( x ) d ∈ D( x,y ) S →� X 0 , X 0 � exponential size S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat NP-hard (Sima’an 1996) X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � Probabilistic X →� X 0 de X 1 , X 0 X 1 � Hypergraph X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
Decoding with spurious ambiguity? • Maximum a posterior (MAP) decoding • Viterbi approximation • N-best approximation (crunching) (May and Knight 2006) Monday, August 17, 2009
Viterbi Approximation translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 0.10 • Viterbi approximation y ∗ = arg max d ∈ D( x,y ) p ( y, d | x ) max y ∈ Trans( x ) = Y(arg max d ∈ D( x ) p ( y, d | x )) Monday, August 17, 2009
Viterbi Approximation translation probability derivation MAP Viterbi string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 0.10 • Viterbi approximation y ∗ = arg max d ∈ D( x,y ) p ( y, d | x ) max y ∈ Trans( x ) = Y(arg max d ∈ D( x ) p ( y, d | x )) Monday, August 17, 2009
Viterbi Approximation translation probability derivation MAP Viterbi string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 0.10 • Viterbi approximation y ∗ = arg max d ∈ D( x,y ) p ( y, d | x ) max y ∈ Trans( x ) = Y(arg max d ∈ D( x ) p ( y, d | x )) Monday, August 17, 2009
Viterbi Approximation translation probability derivation MAP Viterbi string 0.16 red translation 0.28 0.16 0.14 0.14 blue translation 0.28 0.14 0.13 green translation 0.44 0.13 0.12 0.11 0.10 0.10 • Viterbi approximation y ∗ = arg max d ∈ D( x,y ) p ( y, d | x ) max y ∈ Trans( x ) = Y(arg max d ∈ D( x ) p ( y, d | x )) Monday, August 17, 2009
Viterbi Approximation translation probability derivation MAP Viterbi string 0.16 red translation 0.28 0.16 0.14 0.14 blue translation 0.28 0.14 0.13 green translation 0.44 0.13 0.12 0.11 0.10 0.10 • Viterbi approximation y ∗ = arg max d ∈ D( x,y ) p ( y, d | x ) max y ∈ Trans( x ) = Y(arg max d ∈ D( x ) p ( y, d | x )) Monday, August 17, 2009
N-best Approximation translation probability derivation MAP Viterbi string 0.16 red translation 0.28 0.16 0.14 0.14 blue translation 0.28 0.14 0.13 green translation 0.44 0.13 0.12 0.11 0.10 0.10 • N-best approximation (crunching) (May and Knight 2006) � y ∗ = arg max p ( y, d | x ) y ∈ Trans( x ) d ∈ D( x,y ) ∩ ND( x ) Monday, August 17, 2009
N-best Approximation translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.14 0.14 blue translation 0.28 0.14 0.13 green translation 0.44 0.13 0.12 0.11 0.10 0.10 • N-best approximation (crunching) (May and Knight 2006) � y ∗ = arg max p ( y, d | x ) y ∈ Trans( x ) d ∈ D( x,y ) ∩ ND( x ) Monday, August 17, 2009
N-best Approximation translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.14 0.14 blue translation 0.28 0.14 0.13 green translation 0.44 0.13 0.12 0.11 0.10 0.10 • N-best approximation (crunching) (May and Knight 2006) � y ∗ = arg max p ( y, d | x ) y ∈ Trans( x ) d ∈ D( x,y ) ∩ ND( x ) Monday, August 17, 2009
N-best Approximation translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 • N-best approximation (crunching) (May and Knight 2006) � y ∗ = arg max p ( y, d | x ) y ∈ Trans( x ) d ∈ D( x,y ) ∩ ND( x ) Monday, August 17, 2009
N-best Approximation translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 • N-best approximation (crunching) (May and Knight 2006) � y ∗ = arg max p ( y, d | x ) y ∈ Trans( x ) d ∈ D( x,y ) ∩ ND( x ) Monday, August 17, 2009
MAP vs. Approximations translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 Monday, August 17, 2009
MAP vs. Approximations translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 • Exact MAP decoding under spurious ambiguity is intractable Monday, August 17, 2009
MAP vs. Approximations translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 • Exact MAP decoding under spurious ambiguity is intractable • Viterbi and crunching are efficient, but ignore most derivations Monday, August 17, 2009
MAP vs. Approximations translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 • Exact MAP decoding under spurious ambiguity is intractable • Viterbi and crunching are efficient, but ignore most derivations • Our goal: develop an approximation that considers all the derivations but still allows tractable decoding Monday, August 17, 2009
Variational Decoding 18 Monday, August 17, 2009
Variational Decoding Decoding using Variational approximation 18 Monday, August 17, 2009
Variational Decoding Decoding using Variational approximation Decoding using a sentence-specific approximate distribution 18 Monday, August 17, 2009
Variational Decoding for MT: an Overview Monday, August 17, 2009
Variational Decoding for MT: an Overview Sentence-specific decoding Monday, August 17, 2009
Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: Monday, August 17, 2009
Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: 1 Generate a hypergraph Monday, August 17, 2009
Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: 1 Generate a hypergraph Foreign sentence x Monday, August 17, 2009
Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: 1 Generate a hypergraph Foreign SMT sentence x Monday, August 17, 2009
Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: 1 Generate a hypergraph S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Foreign X →� X 0 de X 1 , X 1 on X 0 � SMT X →� X 0 de X 1 , X 0 ’s X 1 � sentence x X →� X 0 de X 1 , X 1 of X X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: 1 Generate a hypergraph S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat p(y, d | x) Foreign X →� X 0 de X 1 , X 1 on X 0 � SMT X →� X 0 de X 1 , X 0 ’s X 1 � sentence x X →� X 0 de X 1 , X 1 of X X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: 1 Generate a hypergraph S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat p(y, d | x) Foreign X →� X 0 de X 1 , X 1 on X 0 � SMT X →� X 0 de X 1 , X 0 ’s X 1 � sentence x X →� X 0 de X 1 , X 1 of X X →� X 0 de X 1 , X 0 X 1 � p(y | x) X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
Variational Decoding for MT: an Overview Sentence-specific decoding MAP decoding under P is intractable Three steps: 1 Generate a hypergraph S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat p(y, d | x) Foreign X →� X 0 de X 1 , X 1 on X 0 � SMT X →� X 0 de X 1 , X 0 ’s X 1 � sentence x X →� X 0 de X 1 , X 1 of X X →� X 0 de X 1 , X 0 X 1 � p(y | x) X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 2 Monday, August 17, 2009
1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 2 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 2 S 0,4 Estimate a model S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat from the hypergraph X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � q*(y | x) X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 q* is an n-gram model 2 S 0,4 Estimate a model S →� X 0 , X 0 � over output strings. S →� X 0 , X 0 � X 0,4 a · · · mat from the hypergraph X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � q*(y | x) X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 q* is an n-gram model 2 S 0,4 Estimate a model S →� X 0 , X 0 � over output strings. S →� X 0 , X 0 � X 0,4 a · · · mat from the hypergraph X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � q*(y | x) X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat ≈∑ d ∈ D(x,y) p(y,d|x) X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 q* is an n-gram model 2 S 0,4 Estimate a model S →� X 0 , X 0 � over output strings. S →� X 0 , X 0 � X 0,4 a · · · mat from the hypergraph X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � q*(y | x) X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat ≈∑ d ∈ D(x,y) p(y,d|x) X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 3 Monday, August 17, 2009
1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 q* is an n-gram model 2 S 0,4 Estimate a model S →� X 0 , X 0 � over output strings. S →� X 0 , X 0 � X 0,4 a · · · mat from the hypergraph X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � q*(y | x) X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat ≈∑ d ∈ D(x,y) p(y,d|x) X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 3 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat Decode using q* X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � q*(y | x) X →� X 0 de X 1 , X 0 ’s X 1 � on the hypergraph X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009
Variational Inference 21 Monday, August 17, 2009
Variational Inference • We want to do inference under p, but it is intractable 21 Monday, August 17, 2009
Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y 21 Monday, August 17, 2009
Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* 21 Monday, August 17, 2009
Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* q ∗ = arg min q ∈ Q KL( p || q ) 21 Monday, August 17, 2009
Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* q ∗ = arg min q ∈ Q KL( p || q ) • Then, we will use q* as a surrogate for p in inference 21 Monday, August 17, 2009
Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* q ∗ = arg min q ∈ Q KL( p || q ) • Then, we will use q* as a surrogate for p in inference = arg max q ∗ ( y | x ) y ∗ y 21 Monday, August 17, 2009
Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* P q ∗ = arg min q ∈ Q KL( p || q ) • Then, we will use q* as a surrogate for p in inference = arg max q ∗ ( y | x ) y ∗ y 21 Monday, August 17, 2009
Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* P p q ∗ = arg min q ∈ Q KL( p || q ) • Then, we will use q* as a surrogate for p in inference = arg max q ∗ ( y | x ) y ∗ y 21 Monday, August 17, 2009
Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* P p q ∗ = arg min q ∈ Q KL( p || q ) Q • Then, we will use q* as a surrogate for p in inference = arg max q ∗ ( y | x ) y ∗ y 21 Monday, August 17, 2009
Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* P p q ∗ = arg min q ∈ Q KL( p || q ) Q • Then, we will use q* as a surrogate for p in inference = arg max q ∗ ( y | x ) y ∗ y 21 Monday, August 17, 2009
Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* P p q ∗ = arg min q ∈ Q KL( p || q ) q* Q • Then, we will use q* as a surrogate for p in inference = arg max q ∗ ( y | x ) y ∗ y 21 Monday, August 17, 2009
Variational Approximation • q* : an approximation having minimum distance to p q ∗ = arg min q ∈ Q KL( p || q ) a family of distributions 22 Monday, August 17, 2009
Recommend
More recommend