A Hierarchical Bayesian Language Model Based on Pitman-Yor Processes Author: Yee Whye Teh, 2006 Reviewer: Xueqing Liu
Dirichlet Process (CRP) Recap Model sequence of words as a sequence of customers coming to a restaurant: x 1 , x 2 , …… Model vocabulary set as a sequence of tables (dishes) : y 1 , y 2 , …… 𝑢 The c. = 𝑑 𝑙 th customer comes and chooses table according to 𝑙=1 𝑑 𝑙 𝛽 customer number in each table: 𝛽+𝑑. , chose a new table: 𝛽+𝑑. y 1 y 2 y 3 c 1 = 3 c 2 = 2 c 1 = 1
Pitman-Yor Process: a generalization 𝐻~𝑄𝑍(𝑒, 𝛽, 𝐻 0 ) G: customers sequence; G 0 : tables sequence 𝑑 𝑙 −𝑒 𝛽+𝑒𝑢 Probability customer sits at table y k : 𝛽+𝑑. ; chooses new table: 𝛽+𝑑. Assumes a finite vocabulary set/table sequence of size V: W Power- law distribution: “rich -gets- richer”; number of unique words scale exponentially as 𝑃(𝛽𝑈 𝑒 )
Hierarchical Pitman-Yor Language Model 𝐻 𝒗 ~𝑄𝑍(𝑒 𝒗 , 𝛽 |𝒗| , 𝐻 𝜌(𝒗) ) Draw a sequence of customers 𝐻 𝒗 from another sequence of customers 𝐻 𝜌(𝒗) , 𝜌 𝒗 = 𝒗 2 𝒗 3 … 𝒗 |𝒗| Consider W = {a, b, c} 𝐻 ∅ ~𝑄𝑍(𝑒 0 , 𝛽 0 , 𝐻 0 ) 𝐻 𝑐 ~𝑄𝑍(𝑒 1 , 𝛽 1 , 𝐻 ∅ ) 𝐻 𝑑 ~𝑄𝑍(𝑒 1 , 𝛽 1 , 𝐻 ∅ ) 𝐻 𝑏 ~𝑄𝑍(𝑒 1 , 𝛽 1 , 𝐻 ∅ ) 𝐻 𝑏𝑏 ~𝑄𝑍(𝑒 2 , 𝛽 2 , 𝐻 𝑏 ) 𝐻 𝑐𝑏 ~𝑄𝑍(𝑒 2 , 𝛽 2 , 𝐻 𝑏 ) 𝐻 𝑑𝑏 ~𝑄𝑍(𝑒 2 , 𝛽 2 , 𝐻 𝑏 ) ………….
Hierarchical CRP: an Example W = { a, b, c } Context u = c a c ? Sequence x u 1 , x u 2 , ……drawn from G cac x u 1 x u 2 x u 3 ……. G cac ? ? ?
Hierarchical CRP: an Example W = { a, b, c } Context u = c a c ? Sequence x u 1 , x u 2 , ……drawn from G cac x u 1 x u 2 x u 3 ……. G cac ? ? ? G ac ?
Hierarchical CRP: an Example W = { a, b, c } Context u = c a c ? Sequence x u 1 , x u 2 , ……drawn from G cac x u 1 x u 2 x u 3 ……. G cac ? ? ? G ac ? G c ? G ∅ ? G 0 (uni a form)
Hierarchical CRP: an Example W = { a, b, c } Context u = c a c ? Sequence x u 1 , x u 2 , ……drawn from G cac x u 1 x u 2 x u 3 ……. G cac =a ? ? 1 − 𝑒 3 𝛽 3 + 1 𝛽 3 + 𝑒 3 𝛽 3 + 1 G ac a ? G c a G ∅ a G 0 (uni a form)
Hierarchical CRP: an Example W = { a, b, c } Context u = c a c ? Sequence x u 1 , x u 2 , ……drawn from G cac x u 1 x u 2 x u 3 ……. G cac =a ? ? G ac a ? 1 − 𝑒 2 𝛽 2 + 1 𝛽 2 + 𝑒 2 𝛽 2 + 1 G c a ? G ∅ a G 0 (uni a form)
Hierarchical CRP: an Example W = { a, b, c } Context u = c a c ? Sequence x u 1 , x u 2 , ……drawn from G cac x u 1 x u 2 x u 3 ……. G cac =a a ? G ac a a G c a G ∅ a G 0 (uni a form)
Hierarchical CRP: an Example W = { a, b, c } Context u = c a c ? Sequence x u 1 , x u 2 , ……drawn from G cac x u 1 x u 2 x u 3 ……. G cac =a a ? 1 − 𝑒 3 1 − 𝑒 3 𝛽 3 + 2𝑒 3 𝛽 3 + 2 𝛽 3 + 2 𝛽 3 + 2 G ac a a ? 2 − 𝑒 2 𝛽 2 + 𝑒 2 𝛽 2 + 2 𝛽 2 + 2 G c a ? 1 − 𝑒 1 𝛽 1 + 1 𝛽 1 + 𝑒 1 𝛽 1 + 1 G ∅ a ? 1 − 𝑒 0 𝛽 0 + 1 𝛽 0 + 𝑒 0 G 0 (uni 𝛽 0 + 1 a b form)
Hierarchical CRP: an Example W = { a, b, c } Context u = c a c ? Sequence x u 1 , x u 2 , ……drawn from G cac x u 1 x u 2 x u 3 ……. G cac =a a b G ac a a b G c a b G ∅ a b G 0 (uni a b form)
Inference with Gibbs Sampling 𝐻 𝑣 s are marginalized out; use 𝑇 𝑣 , 𝑻 = {𝑇 𝑤 } , 𝚰 = {𝛽 𝑛 , 𝑒 𝑛 } 𝑞 𝑥 𝒗, 𝐸 = 𝑞 𝑥 𝒗, 𝑻, 𝚰 𝑞 𝑻, 𝚰 𝐸 𝑒(𝑻, 𝚰) Approximate the integral with samples: 𝑞 𝑥 𝒗, 𝐸 ≈ 𝑞 𝑥 𝒗, 𝑻 𝑗 , 𝚰 𝑗 𝑗 Recursively compute 𝑞(𝑥|𝒗, 𝑻, 𝚰) : 𝑑 𝒗𝑥∙ − 𝑒 |𝒗| 𝑢 𝒗𝑥∙ + 𝜄 𝒗 + 𝑒 𝒗 𝑢 𝒗𝑥∙ 𝑞 𝑥 𝒗, 𝑻, 𝚰 = 𝑞(𝑥|𝜌 𝒗 , 𝑻, 𝚰) 𝜄 |𝒗| + 𝑑 𝒗∙∙ 𝜄 𝒗 + 𝑑 𝒗∙∙
Inference with Gibbs Sampling Gibbs sampling: −𝒗𝑚 max 0, 𝑑 𝒗𝑦 𝒗𝑚 𝑙 − 𝑒 𝑞 𝑙 𝒗𝑚 = 𝑙 𝑻 −𝒗𝑚 , 𝚰) ∝ −𝒗𝑚 𝜄 + 𝑑 𝒗∙∙ −𝒗𝑚 𝑞 𝑙 𝒗𝑚 = 𝑙 𝑜𝑓𝑥 𝑥𝑗𝑢ℎ 𝑧 𝒗𝑙 𝑜𝑓𝑥 = 𝑦 𝒗𝑚 𝑇 −𝒗𝑚 , Θ ∝ 𝜄 + 𝑒𝑢 𝑣∙∙ −𝒗𝑚 𝑞 𝑦 𝒗𝑚 𝜌 𝒗 , 𝑻 −𝒗𝑚 , 𝚰) 𝜄 + 𝑑 𝑣∙∙
Experimental Results IKN: interpolated Kneser-Ney MKN: modified Kneser-Ney HPYLM: Pitman-Yor using Gibbs sampler HPYCV: parameters obtained by cross-validaion
Recommend
More recommend