bea beams
play

BEA BEAMS ND WH WHERE AND TO FIND ND THEM TO The Th e Gumb - PowerPoint PPT Presentation

ll ne neve ver hav Wouter Wo er K Kool, He Herke van Ho Hoof, Max Welling - T Thi his is ho how w you - Yo You wi will have ra randomize a a be beam am sear arch! h! du duplic icate te samples es aga gain


  1. ll ne neve ver hav Wouter Wo er K Kool, He Herke van Ho Hoof, Max Welling - “T “Thi his is ho how w you - “Yo “You wi will have ra randomize a a be beam am sear arch! h!” du duplic icate te samples es aga gain in!” !” STO STOCH CHASTIC STIC BEA BEAMS ND WH WHERE AND TO FIND ND THEM TO The Th e Gumb umbel el-To Top- 𝒍 Tri Trick for r Sampl mpling Se Sequ quences With thou out t Repla lacement ICM CML 2019 19 BE BEST PAPER HONO NORABLE MENT NTION

  2. TL ; DR TL DR Stoch Stochasti tic B c Bea eam Sea Search ch fi finds a a s set of et of un unique ue sampl ples es (w (without replacement) ) fr from a a s sequen equence m ce model el.

  3. Exa Example Binarese language model (log-)probability Vocabulary: { A bra , C adabra } 𝑄(𝐷) A C 𝑄(𝐵|𝐷) What if we want AA AA AC AC CA CA CC CC 𝑄(𝐷|𝐷𝐵) ple from a sam sampl our model? AAA AAA AAC AAC ACA AC ACC AC AC ACC CAA CA CAC CA CCA CCA CCC CCC 𝑄 𝐷 𝑄 𝐵 𝐷 𝑄 𝐷 𝐷𝐵 = 𝑄 𝐷𝐵𝐷

  4. “Prof. Gumbeldore” The G Th Gumbe bel-Max Max Tr Trick ck (Gumbel, 1945; Maddison et al., 2014) + = 𝜚 * = log 𝑞 * 𝐻 * ∼ Gumbel(0) 𝐻 7 8 ∼ Gumbel 𝜚 * perturbed log-probability log-probability Gumbel noise

  5. “Prof. Gumbeldore” The G Th Gumbe bel-Max Max Tr Trick ck (Gumbel, 1945; Maddison et al., 2014) max 𝐻 7 8 ∼ Gumbel log : exp 𝜚 * * * max and argmax are independent 𝐽 = argmax 𝐻 7 8 ∼ Categorical 𝑞 * 𝑄 𝐽 = 𝑗 = 𝑞 * *

  6. Exa Example Binarese language model (log-)probability Vocabulary: { A bra , C adabra } A C What if we want AA AA AC AC CA CA CC CC ple from a sam sampl our model? AAA AAA AAC AAC AC ACA ACC AC ACC AC CA CAA CAC CA CCA CCA CCC CCC This will be our sample!

  7. What happens if, instead of 1 (one), we take the 𝑙 largest elements (top 𝑙 )? 𝑙 = 3 𝐽 F , … , 𝐽 I = arg top 𝑙 𝐻 7 8 *

  8. Top- 𝑙 ’ Th The ‘ ‘Gumbe bel-To ’ Trick 𝑄 𝐽 F = 𝑗 F , … , 𝐽 I = 𝑗 I M 8P M 8N = 𝑞 * K ⋅ FOM 8K ⋅ … ⋅ PTK M 8ℓ FO∑ ℓSK M 8X I = ∏ VWF XTK M 8ℓ Also known as FO∑ ℓSK Plackett-Luce This is equivalent to repeated sampling without replacement! 𝐽 F , … , 𝐽 I = arg top 𝑙 𝐻 7 8 * (Vieira, 2014)

  9. Example Exa Binarese language model (log-)probability Vocabulary: { A bra , C adabra } A C We can get a set of AA AA AC AC CA CA CC CC unique samples from our model! AAA AAA AAC AAC AC ACA AC ACA AC ACC AC ACC CAA CA CA CAA CAC CA CCA CCA CCC CCC This will be our set of samples!

  10. PR PROBLEM In general, constructing the full tree is not possible… … but we don’t have to!

  11. Pe Pert rturb rbed log-pr probability ty of partial seq of equen ence e ( “ C ” ) Noise 𝐻 [ ∼ Gumbel(0) is inferred ∼ Gumbel log : exp 𝜚 * 𝐻 7 Y = max *∈[ 𝐻 7 8 A C *∈[ 𝜚 [ = log-probability of “ C ” AA AA AC AC CA CA CC CC We can sample 𝐻 7 Y ∼ Gumbel 𝜚 [ AAA AAA AAC AAC ACA AC AC ACC CA CAA CA CAC CCA CCA CCC CCC 𝑻 di directly ly Look at maximum of perturbed log-probabilities in subtree

  12. Start from root, sample 𝐻 7 Y ∼ Gumbel(𝜚 [ ) Sample children 𝐻 7 Y] conditionally on [ ] ∈^_`abcde([) 𝐻 7 Y] = 𝐻 7 Y max g 𝐻 7 Y] 1. sample 𝐻 7 Y] independently, compute Z = max [ ] 𝐻 7 Y] A C 2. ‘shift’ Gumbels in (negative) exponential space: g 𝐻 7 Y] = − log exp −𝐻 7 Y − exp −𝑎 + exp −𝐻 7 Y] AA AA AC AC CA CA CC CC … the result is equiv eq ivalen ent to AAA AAA AAC AAC AC AC ACA ACA ACC ACC AC AC CA CAA CA CAA CAC CA CCA CCA CCC CCC sampling G j k for leaves directly! Top-dow To down sam sampl pling (Maddison et al., 2014)

  13. We only need to expand the top 𝑙 nodes at each level in the tree A C Threshold AA AA AC AC CA CA CC CC Each top 𝑙 node generates (at least) one leaf (maximum) above threshold At least 𝑙 leafs will be above threshold AC ACA ACA AC AC ACC ACC AC CAA CAA CA CA CAC CA CCA CCA CCC CCC Other nodes only generate leafs below threshold No need to expand The K Th Key I Insight

  14. We only need to expand the top 𝑙 nodes at each level in the tree This is a A C beam search AA AA AC AC CA CA CC CC Top 𝑙 according to perturbed log-probability = Gumbel-Top- 𝑙 ACA ACA AC AC AC ACC AC ACC CA CAA CA CAA CA CAC CCA CCA CCC CCC trick Sampling (without replacement) Stoc Stochasti stic B Beam Se Search

  15. Stoc Stochasti stic B Beam Se Search • A beam search that samples the nodes to expand Important! • But… samples children conditionally on parent • The result is a sample without replacement from the full sequence model • Is a generalization of ancestral sampling ( 𝑙 = 1 )

  16. Ex Experim iments ts

  17. Tr Tran anslat ation D Diversity • Generate 𝑙 translations • Plot BLEU against diversity • Vary softmax temperature • Compare: • Beam Search • Stochastic Beam Search • Sampling • Diverse Beam Search (Vijayakumar et al., 2018)

  18. BL BLEU S Scor ore E Est stimat ation on • Estimate expected sentence- level BLEU • Plot mean and 95% interval vs. num samples • Compare: • Monte Carlo Sampling • Stochastic Beam Search with (normalized) Importance Weighted estimator • Beam Search with deterministic estimate

  19. Wo Wouter er K Kool, He Herke van Ho Hoof, Max Welling STOCH STO CHASTIC STIC BEAMS BEA ND WH WHERE AND TO FIND ND THE TO PO POST STER? #4 #41

Recommend


More recommend