CS11-747 Neural Networks for NLP Advanced Search Algorithms Graham Neubig https://phontron.com/class/nn4nlp2020/ (Some Slides by Daniel Clothiaux)
The Generation Problem • We have a model of P(Y|X), how do we use it to generate a sentence? • Two methods: • Sampling: Try to generate a random sentence according to the probability distribution. • Argmax: Try to generate the sentence with the highest probability.
Which to Use? • We want the best possible single output → Search • We want to observe multiple outputs according to the probability distribution → Sampling • We want to generate diverse outputs so that we are not boring → Sampling? Search?
Sampling
Ancestral Sampling • Randomly generate words one-by-one. while y j-1 != “</s>”: y j ~ P(y j | X, y 1 , …, y j-1 ) • An exact method for sampling from P(X), no further work needed. • Any other sampling method is not an appropriate way of visualizing/understanding the underlying distribution.
Search Basics
<latexit sha1_base64="KTzU+V/eNKvP2hqb752FtwYbJj8=">ACQ3icbVBNaxsxFNQ6bZK6beKkx15ETbENxeyGQpNDwLSXHh2oExuvMVrtsy0saRfpbYjZbP5bLv0BvfUX9JDWnItRP6gtHYHBMPMPN7TRKkUFn3/u1faevJ0e2f3Wfn5i5d7+5WDw3ObZIZDhycyMd2IWZBCQwcFSuimBpiKJFxE09z/+ISjBWJ/oKzFAaKjbUYCc7QScNKP5wzHsFPaVhpmOXBMxDFDIGpxaOwhXmzIyV0EVxQ0ObqWHeqxW0Xe/VrsNukyAMYkpnPTuz3BjWKn6TX8BukmCFamSFdrDyrcwTnimQCOXzNp+4Kc4cNtRcAlFOcwspIxP2Rj6jmqmwA7yRQkFfeuUmI4S45GulD/nsiZsnamIpdUDCd23ZuL/P6GY6OB7nQaYag+XLRKJMUEzpvlMbCAEc5c4RxI9ytlE+YRxdm2VXQrD+5U3SOWqeNIOz9XWx1Ubu+Q1eUPqJCAfSIt8Jm3SIZzckh/knvz0vnp3i/vYRkteauZV+QfeL8fARbSsto=</latexit> <latexit sha1_base64="KTzU+V/eNKvP2hqb752FtwYbJj8=">ACQ3icbVBNaxsxFNQ6bZK6beKkx15ETbENxeyGQpNDwLSXHh2oExuvMVrtsy0saRfpbYjZbP5bLv0BvfUX9JDWnItRP6gtHYHBMPMPN7TRKkUFn3/u1faevJ0e2f3Wfn5i5d7+5WDw3ObZIZDhycyMd2IWZBCQwcFSuimBpiKJFxE09z/+ISjBWJ/oKzFAaKjbUYCc7QScNKP5wzHsFPaVhpmOXBMxDFDIGpxaOwhXmzIyV0EVxQ0ObqWHeqxW0Xe/VrsNukyAMYkpnPTuz3BjWKn6TX8BukmCFamSFdrDyrcwTnimQCOXzNp+4Kc4cNtRcAlFOcwspIxP2Rj6jmqmwA7yRQkFfeuUmI4S45GulD/nsiZsnamIpdUDCd23ZuL/P6GY6OB7nQaYag+XLRKJMUEzpvlMbCAEc5c4RxI9ytlE+YRxdm2VXQrD+5U3SOWqeNIOz9XWx1Ubu+Q1eUPqJCAfSIt8Jm3SIZzckh/knvz0vnp3i/vYRkteauZV+QfeL8fARbSsto=</latexit> <latexit sha1_base64="KTzU+V/eNKvP2hqb752FtwYbJj8=">ACQ3icbVBNaxsxFNQ6bZK6beKkx15ETbENxeyGQpNDwLSXHh2oExuvMVrtsy0saRfpbYjZbP5bLv0BvfUX9JDWnItRP6gtHYHBMPMPN7TRKkUFn3/u1faevJ0e2f3Wfn5i5d7+5WDw3ObZIZDhycyMd2IWZBCQwcFSuimBpiKJFxE09z/+ISjBWJ/oKzFAaKjbUYCc7QScNKP5wzHsFPaVhpmOXBMxDFDIGpxaOwhXmzIyV0EVxQ0ObqWHeqxW0Xe/VrsNukyAMYkpnPTuz3BjWKn6TX8BukmCFamSFdrDyrcwTnimQCOXzNp+4Kc4cNtRcAlFOcwspIxP2Rj6jmqmwA7yRQkFfeuUmI4S45GulD/nsiZsnamIpdUDCd23ZuL/P6GY6OB7nQaYag+XLRKJMUEzpvlMbCAEc5c4RxI9ytlE+YRxdm2VXQrD+5U3SOWqeNIOz9XWx1Ubu+Q1eUPqJCAfSIt8Jm3SIZzckh/knvz0vnp3i/vYRkteauZV+QfeL8fARbSsto=</latexit> <latexit sha1_base64="daJTM+c0MGzbLqAwsNEmTltAuIU=">ACJnicbVDLSgMxFM34tr6qLt0Ei6CbMiOCulBENy4rWK10SslkbtgJjMkd6RlHL/Gjb/iRvCBuPNTGsRXwcCh3PO5eaeIJHCoOu+OSOjY+MTk1PThZnZufmF4uLSmYlTzaHKYxnrWsAMSKGgigIl1BINLAoknAeXR3/Aq0EbE6xV4CjYi1lWgJztBKzeK+32GYXeR0j/qpCm0SMPNRyBCsmlsKXcyYbkesm+c3lfUvj17T2kazWHL7gD0L/GpESGqDSLj34Y8zQChVwyY+qem2DLkDBJeQFPzWQMH7J2lC3VLEITCMb3JnTNauEtBVr+xTSgfp9ImORMb0osMmIYcf89vrif149xdZOIxMqSREU/1zUSiXFmPZLo6HQwFH2LGFcC/tXyjtM462sItwft98l9S3Szvlr2TrdLB4bCNKbJCVsk68cg2OSDHpEKqhJNbck+eyLNz5zw4L87rZ3TEGc4skx9w3j8AzM6nUg=</latexit> <latexit sha1_base64="daJTM+c0MGzbLqAwsNEmTltAuIU=">ACJnicbVDLSgMxFM34tr6qLt0Ei6CbMiOCulBENy4rWK10SslkbtgJjMkd6RlHL/Gjb/iRvCBuPNTGsRXwcCh3PO5eaeIJHCoOu+OSOjY+MTk1PThZnZufmF4uLSmYlTzaHKYxnrWsAMSKGgigIl1BINLAoknAeXR3/Aq0EbE6xV4CjYi1lWgJztBKzeK+32GYXeR0j/qpCm0SMPNRyBCsmlsKXcyYbkesm+c3lfUvj17T2kazWHL7gD0L/GpESGqDSLj34Y8zQChVwyY+qem2DLkDBJeQFPzWQMH7J2lC3VLEITCMb3JnTNauEtBVr+xTSgfp9ImORMb0osMmIYcf89vrif149xdZOIxMqSREU/1zUSiXFmPZLo6HQwFH2LGFcC/tXyjtM462sItwft98l9S3Szvlr2TrdLB4bCNKbJCVsk68cg2OSDHpEKqhJNbck+eyLNz5zw4L87rZ3TEGc4skx9w3j8AzM6nUg=</latexit> <latexit sha1_base64="daJTM+c0MGzbLqAwsNEmTltAuIU=">ACJnicbVDLSgMxFM34tr6qLt0Ei6CbMiOCulBENy4rWK10SslkbtgJjMkd6RlHL/Gjb/iRvCBuPNTGsRXwcCh3PO5eaeIJHCoOu+OSOjY+MTk1PThZnZufmF4uLSmYlTzaHKYxnrWsAMSKGgigIl1BINLAoknAeXR3/Aq0EbE6xV4CjYi1lWgJztBKzeK+32GYXeR0j/qpCm0SMPNRyBCsmlsKXcyYbkesm+c3lfUvj17T2kazWHL7gD0L/GpESGqDSLj34Y8zQChVwyY+qem2DLkDBJeQFPzWQMH7J2lC3VLEITCMb3JnTNauEtBVr+xTSgfp9ImORMb0osMmIYcf89vrif149xdZOIxMqSREU/1zUSiXFmPZLo6HQwFH2LGFcC/tXyjtM462sItwft98l9S3Szvlr2TrdLB4bCNKbJCVsk68cg2OSDHpEKqhJNbck+eyLNz5zw4L87rZ3TEGc4skx9w3j8AzM6nUg=</latexit> <latexit sha1_base64="pazaO1OUOgQ/R/MsnOhbEaj7I3Q=">ACMHicbVDLSgNBEJz1GeMr6tHLYBAiSNgVQT0IQS8eFYwPsiHMznaSwdnZaZXDMv6SV78E/HiQcWrX+HkgWi0YKC6upqeriCRwqDrvjgTk1PTM7OFueL8wuLScml9cLEqeZQ57GM9VXADEihoI4CJVwlGlgUSLgMbo7/ctb0EbE6hx7CTQj1lGiLThDK7VKJ36XYXad0Pqpyq0TsDMRyFDsGpuKdxhxnQnEirP74claB3rvHK9Tb+dW61S2a26A9C/xBuRMhnhtFV68sOYpxEo5JIZ0/DcBJt2FQouIS/6qYGE8RvWgYalikVgmtng4pxuWiWk7Vjbp5AO1J8TGYuM6UWBdUYMu2a81xf/6zVSbO83M6GSFEHx4aJ2KinGtB8fDYUGjrJnCeNa2L9S3mWacbTRFW0I3vjJf0l9p3pQ9c52y7WjURoFsk42SIV4ZI/UyAk5JXCyQN5Jq/kzXl0Xpx352NonXBGM2vkF5zPLylArDg=</latexit> <latexit sha1_base64="pazaO1OUOgQ/R/MsnOhbEaj7I3Q=">ACMHicbVDLSgNBEJz1GeMr6tHLYBAiSNgVQT0IQS8eFYwPsiHMznaSwdnZaZXDMv6SV78E/HiQcWrX+HkgWi0YKC6upqeriCRwqDrvjgTk1PTM7OFueL8wuLScml9cLEqeZQ57GM9VXADEihoI4CJVwlGlgUSLgMbo7/ctb0EbE6hx7CTQj1lGiLThDK7VKJ36XYXad0Pqpyq0TsDMRyFDsGpuKdxhxnQnEirP74claB3rvHK9Tb+dW61S2a26A9C/xBuRMhnhtFV68sOYpxEo5JIZ0/DcBJt2FQouIS/6qYGE8RvWgYalikVgmtng4pxuWiWk7Vjbp5AO1J8TGYuM6UWBdUYMu2a81xf/6zVSbO83M6GSFEHx4aJ2KinGtB8fDYUGjrJnCeNa2L9S3mWacbTRFW0I3vjJf0l9p3pQ9c52y7WjURoFsk42SIV4ZI/UyAk5JXCyQN5Jq/kzXl0Xpx352NonXBGM2vkF5zPLylArDg=</latexit> <latexit sha1_base64="pazaO1OUOgQ/R/MsnOhbEaj7I3Q=">ACMHicbVDLSgNBEJz1GeMr6tHLYBAiSNgVQT0IQS8eFYwPsiHMznaSwdnZaZXDMv6SV78E/HiQcWrX+HkgWi0YKC6upqeriCRwqDrvjgTk1PTM7OFueL8wuLScml9cLEqeZQ57GM9VXADEihoI4CJVwlGlgUSLgMbo7/ctb0EbE6hx7CTQj1lGiLThDK7VKJ36XYXad0Pqpyq0TsDMRyFDsGpuKdxhxnQnEirP74claB3rvHK9Tb+dW61S2a26A9C/xBuRMhnhtFV68sOYpxEo5JIZ0/DcBJt2FQouIS/6qYGE8RvWgYalikVgmtng4pxuWiWk7Vjbp5AO1J8TGYuM6UWBdUYMu2a81xf/6zVSbO83M6GSFEHx4aJ2KinGtB8fDYUGjrJnCeNa2L9S3mWacbTRFW0I3vjJf0l9p3pQ9c52y7WjURoFsk42SIV4ZI/UyAk5JXCyQN5Jq/kzXl0Xpx352NonXBGM2vkF5zPLylArDg=</latexit> Why do we Search? • We want to find the best output • What is "best"? • The most accurate output ˆ error( Y, ˜ Y = argmin Y ) ˜ Y → impossible! we don't know the reference • The most probable output according to the model ˆ P ( ˜ Y = argmax Y | X ) ˜ Y → simple, but not necessarily tied to accuracy • The output with the lowest Bayes risk ˆ P ( Y 0 | X )error( Y 0 , ˜ X Y = argmin Y ) ˜ Y Y 0 → which output looks like it has the lowest error?
Search Errors, Model Errors (example from Neubig (2015)) • Search error: the search algorithm fails to find an output that optimizes its search criterion • Model error: the output that optimizes the search criterion does not optimize accuracy
Searching Probable Outputs
Greedy Search • One by one, pick the single highest-probability word while y j-1 != “</s>”: y j = argmax P(y j | X, y 1 , …, y j-1 ) • Not exact, real problems: • Will often generate the “easy” words first • Will prefer multiple common words to one rare word
Why will this Help Next word P(next word) Pittsburgh 0.4 New York 0.3 New Jersey 0.25 Other 0.05
Beam Search • Instead of picking the highest probability/score, maintain multiple paths • At each time step • Expand each path • Choose a subset paths from the expanded set
Basic Pruning Methods (Steinbiss et al. 1994) • How to select which paths to keep expanding? • Histogram Pruning: Keep exactly k hypotheses at every time step • Score Threshold Pruning: Keep all hypotheses where score is within a threshold α of best score s 1 s n + α > s 1 • Probability Mass Pruning: Keep all hypotheses up until probability mass α
What beam size should I use? • Larger beam sizes will be slower • May not give better results due to model errors • Sometimes result in shorter sequences • May favor high-frequency words • Mostly done empirically -> experiment (range of 5-100 for histogram?)
Problems w/ Disparate Search Difficulty • Sometimes need to cover specific content, some easy some hard I saw the escarpment watashi mita dangai? zeppeki? kyushamen? iwa? • Can cause the search algorithm to select the easy thing first, then hard thing later watashi wa dangai wo mita watashi ga mita dangai (I saw the escarpment) (the escarpment I saw)
Future Cost • also predict how hard it will be to process as-of-yet- unprocessed words, and search for maximum of sum f(n) = g(n) + h(n) • g(n): cost to current point • h(n): estimated cost to goal • See Koehn (2010 Chapter 6), or Li et al. (2017) for a neural approximation
Search and Problems with Modeling
Better Search can Hurt Results! (Koehn and Knowles 2017) • Better search (=better model score) can result in worse BLEU score! • Why? Model errors!
How to Fix Model Errors? • Train the model to maximize accuracy/minimize risk (best!, covered previously) • Change the decision rule to minimize risk (best!) • Heuristically modify the model score post-hoc (OK) • Hobble the search algorithm so it makes more search errors, but the kind of errors you want (meh)
Minimum Bayes Risk Decoding
Recommend
More recommend