low rank ensembles
play

Low Rank Ensembles Eric Xing Ankur Parikh Avneesh Saluja Chris - PowerPoint PPT Presentation

Language Modeling with Power Low Rank Ensembles Eric Xing Ankur Parikh Avneesh Saluja Chris Dyer 1 Overview 2 Overview Model: Framework for language modeling using ensembles of low rank matrices and tensors Relations: Includes


  1. Proble lem: Low Rank Methods Operate at Fix ixed Granularity If rank is too small…… ≈ 11 Introduction Background Rank Power Ensembles Experiments

  2. Proble lem: Low Rank Methods Operate at Fix ixed Granularity If rank is too small…… ≈ (break, spring) 11 Introduction Background Rank Power Ensembles Experiments

  3. Proble lem: Low Rank Methods Operate at Fix ixed Granularity If rank is too small…… ≈ Probability gets diluted since (break, spring) “break” has many synonyms 11 Introduction Background Rank Power Ensembles Experiments

  4. Proble lem: Low Rank Methods Operate at Fix ixed Granularity If rank is too large…. ≈ 12 Introduction Background Rank Power Ensembles Experiments

  5. Proble lem: Low Rank Methods Operate at Fix ixed Granularity If rank is too large…. ≈ (domicile, dilapidated) 12 Introduction Background Rank Power Ensembles Experiments

  6. Proble lem: Low Rank Methods Operate at Fix ixed Granularity If rank is too large…. ≈ Probabilities of rare words a (domicile, dilapidated) problem, since representation is too fine grained 12 Introduction Background Rank Power Ensembles Experiments

  7. Our Approach 13 Introduction Background Rank Power Ensembles Experiments

  8. Our Approach • Construct ensembles of low rank matrices/tensors to model language at multiple granularities 13 Introduction Background Rank Power Ensembles Experiments

  9. Our Approach • Construct ensembles of low rank matrices/tensors to model language at multiple granularities • Includes existing n -gram techniques as special cases • Absolute discounting • Jelinek Mercer (deleted-interpolation) • Kneser Ney 13 Introduction Background Rank Power Ensembles Experiments

  10. Our Approach • Construct ensembles of low rank matrices/tensors to model language at multiple granularities • Includes existing n -gram techniques as special cases • Absolute discounting • Jelinek Mercer (deleted-interpolation) • Kneser Ney • Preserves advantages of standard n -gram approaches • Effective for short context lengths • Fast evaluation at test time 13 Introduction Background Rank Power Ensembles Experiments

  11. Outline • Introduction • Background on Kneser Ney smoothing • Our Approach • Rank • Power • Constructing the Ensemble • Experiments 14 Introduction Background Rank Power Ensembles Experiments

  12. Kneser Ney - Intuition • Lower order distribution should be altered 56 Introduction Background Rank Power Ensembles Experiments

  13. Kneser Ney - Intuition • Lower order distribution should be altered • Consider two words, York and door • York only follows very few words i.e. New York • Door can follow many words i.e. “the door”, “red door”, “my door” etc. 𝑄 𝑥 𝑗 = door backed − off on 𝑥 𝑗−1 ) > 𝑄(𝑥 𝑗 = York | backed − off on 𝑥 𝑗−1 ) 57 Introduction Background Rank Power Ensembles Experiments

  14. Kneser Ney - Intuition • Lower order distribution should be altered • Consider two words, York and door • York only follows very few words i.e. New York • Door can follow many words i.e. “the door”, “red door”, “my door” etc. 𝑄 𝑥 𝑗 = door backed − off on 𝑥 𝑗−1 ) > 𝑄(𝑥 𝑗 = York | backed − off on 𝑥 𝑗−1 ) 58 Introduction Background Rank Power Ensembles Experiments

  15. Kneser Ney Unigram Distribution 𝑂 − 𝑥 𝑗 = | 𝑥 ∶ 𝑑 𝑥 𝑗 , 𝑥 > 0 | ′ 𝒕 history Diversity of 𝒙 𝒋 16 Introduction Background Rank Power Ensembles Experiments

  16. Kneser Ney Unigram Distribution 𝑂 − 𝑥 𝑗 = | 𝑥 ∶ 𝑑 𝑥 𝑗 , 𝑥 > 0 | ′ 𝒕 history Diversity of 𝒙 𝒋 𝑄 𝑙𝑜−𝑣𝑜𝑗 (𝑥 𝑗 ) = 𝑂 − 𝑥 𝑗 𝑥 𝑂 − 𝑥 16 Introduction Background Rank Power Ensembles Experiments

  17. Discounting 17 Introduction Background Rank Power Ensembles Experiments

  18. Discounting 𝑄 𝑒 𝑥 𝑗 𝑥 𝑗−1 ) = max(𝑑 𝑥 𝑗 , 𝑥 𝑗−1 − 𝑒, 0) 𝑥 𝑑 𝑥, 𝑥 𝑗−1 17 Introduction Background Rank Power Ensembles Experiments

  19. Discounting 𝑄 𝑒 𝑥 𝑗 𝑥 𝑗−1 ) = max(𝑑 𝑥 𝑗 , 𝑥 𝑗−1 − 𝑒, 0) 𝑥 𝑑 𝑥, 𝑥 𝑗−1 𝑙𝑜𝑓𝑧 𝑥 𝑗 𝑥 𝑗−1 ) = 𝑒 𝑥 𝑗 𝑥 𝑗−1 ) + 𝛿 𝑥 𝑗−1 𝑄 𝑄 𝑄 𝑙𝑜−𝑣𝑜𝑗 (𝑥 𝑗 ) 17 Introduction Background Rank Power Ensembles Experiments

  20. Discounting 𝑄 𝑒 𝑥 𝑗 𝑥 𝑗−1 ) = max(𝑑 𝑥 𝑗 , 𝑥 𝑗−1 − 𝑒, 0) 𝑥 𝑑 𝑥, 𝑥 𝑗−1 𝑙𝑜𝑓𝑧 𝑥 𝑗 𝑥 𝑗−1 ) = 𝑒 𝑥 𝑗 𝑥 𝑗−1 ) + 𝛿 𝑥 𝑗−1 𝑄 𝑄 𝑄 𝑙𝑜−𝑣𝑜𝑗 (𝑥 𝑗 ) Where 𝜹 𝒙 𝒋−𝟐 is the leftover probability 17 Introduction Background Rank Power Ensembles Experiments

  21. Lower Order Marginal Aligns! 𝑄 𝑙𝑜𝑓𝑧 𝑥 𝑗 𝑥 𝑗−1 ) 𝑄 𝑥 𝑗 = 𝑄 𝑥 𝑗−1 𝑥 𝑗−1 18 Introduction Background Rank Power Ensembles Experiments

  22. Generalizing KN to PLRE Kneser Ney Power Low Rank Ensembles 19 Introduction Background Rank Power Ensembles Experiments

  23. Generalizing KN to PLRE Kneser Ney Power Low Rank Ensembles • Ensemble composed of unsmoothed n -grams 19 Introduction Background Rank Power Ensembles Experiments

  24. Generalizing KN to PLRE Kneser Ney Power Low Rank Ensembles • Ensemble composed of unsmoothed n -grams • Alter lower order distributions by using count of unique histories 19 Introduction Background Rank Power Ensembles Experiments

  25. Generalizing KN to PLRE Kneser Ney Power Low Rank Ensembles • Ensemble composed of unsmoothed n -grams • Alter lower order distributions by using count of unique histories • Use absolute discounting to interpolate different n -grams and preserve lower order marginal constraint 19 Introduction Background Rank Power Ensembles Experiments

  26. Generalizing KN to PLRE Kneser Ney Power Low Rank Ensembles • Ensemble composed of ? unsmoothed n -grams • Alter lower order distributions by ? using count of unique histories • Use absolute discounting to ? interpolate different n -grams and preserve lower order marginal constraint 19 Introduction Background Rank Power Ensembles Experiments

  27. Generalizing KN to PLRE Kneser Ney Power Low Rank Ensembles • Ensemble composed of ? unsmoothed n -grams • Alter lower order distributions by ? using count of unique histories • Use absolute discounting to ? interpolate different n -grams and preserve lower order marginal constraint 20 Introduction Background Rank Power Ensembles Experiments

  28. In In General, Bigram is Full Rank 21 Introduction Background Rank Power Ensembles Experiments

  29. In Independence = Rank 1 • If 𝑥 𝑗 and 𝑥 𝑗−1 are independent 𝑄(𝑥 𝑗 , 𝑥 𝑗−1 ) = 𝑄 𝑥 𝑗 𝑄 𝑥 𝑗−1 73 Introduction Background Rank Power Ensembles Experiments

  30. In Independence = Rank 1 • If 𝑥 𝑗 and 𝑥 𝑗−1 are independent 𝑄(𝑥 𝑗 , 𝑥 𝑗−1 ) = 𝑄 𝑥 𝑗 𝑄 𝑥 𝑗−1 74 Introduction Background Rank Power Ensembles Experiments

  31. In Independence = Rank 1 • If 𝑥 𝑗 and 𝑥 𝑗−1 are independent 𝑄(𝑥 𝑗 , 𝑥 𝑗−1 ) = 𝑄 𝑥 𝑗 𝑄 𝑥 𝑗−1 𝑄(ℎ𝑝𝑣𝑡𝑓, 𝑝𝑚𝑒) = 𝑄(𝑝𝑚𝑒) 𝑄(ℎ𝑝𝑣𝑡𝑓) 75 Introduction Background Rank Power Ensembles Experiments

  32. In Independence = Rank 1 • If 𝑥 𝑗 and 𝑥 𝑗−1 are independent 𝑄(𝑥 𝑗 , 𝑥 𝑗−1 ) = 𝑄 𝑥 𝑗 𝑄 𝑥 𝑗−1 𝑄(ℎ𝑝𝑣𝑡𝑓, 𝑝𝑚𝑒) = 𝑄(𝑝𝑚𝑒) 𝑄(ℎ𝑝𝑣𝑡𝑓) • But what if 𝑥 𝑗 and 𝑥 𝑗−1 are not independent? What does the best rank 1 approximation give? 76 Introduction Background Rank Power Ensembles Experiments

  33. Rank • Let 𝑪 be the matrix such that 𝑪 𝑥 𝑗 , 𝑥 𝑗−1 = 𝑑 𝑥 𝑗 , 𝑥 𝑗−1 • Let 𝑵 1 = 𝑛𝑗𝑜 𝑵:𝑵≥0,𝑠𝑏𝑜𝑙 𝑵 =1 𝑪 − 𝑵 𝐿𝑀 = Generalized KL [ Lee and Seung 2001 ] • Then 𝑵 1 𝑥 𝑗 , 𝑥 𝑗−1 ∝ 𝑄 𝑥 𝑗 𝑄 𝑥 𝑗−1 77 Introduction Background Rank Power Ensembles Experiments

  34. Rank • MLE unigram is normalized rank 1 approx. of MLE bigram under KL: 𝑵 1 𝑥 𝑗 , 𝑥 𝑗−1 𝑄 𝑥 𝑗 = 𝑥 𝑗 𝑵 1 (𝑥 𝑗 , 𝑥 𝑗−1 ) 24 Introduction Background Rank Power Ensembles Experiments

  35. Rank • MLE unigram is normalized rank 1 approx. of MLE bigram under KL: 𝑵 1 𝑥 𝑗 , 𝑥 𝑗−1 𝑄 𝑥 𝑗 = 𝑥 𝑗 𝑵 1 (𝑥 𝑗 , 𝑥 𝑗−1 ) • Vary rank to obtain quantities between bigram and unigram 24 Introduction Background Rank Power Ensembles Experiments

  36. Rank • MLE unigram is normalized rank 1 approx. of MLE bigram under KL: 𝑵 1 𝑥 𝑗 , 𝑥 𝑗−1 𝑄 𝑥 𝑗 = 𝑥 𝑗 𝑵 1 (𝑥 𝑗 , 𝑥 𝑗−1 ) • Vary rank to obtain quantities between bigram and unigram full rank rank 1 24 Introduction Background Rank Power Ensembles Experiments

  37. Rank • MLE unigram is normalized rank 1 approx. of MLE bigram under KL: 𝑵 1 𝑥 𝑗 , 𝑥 𝑗−1 𝑄 𝑥 𝑗 = 𝑥 𝑗 𝑵 1 (𝑥 𝑗 , 𝑥 𝑗−1 ) • Vary rank to obtain quantities between bigram and unigram full rank low rank rank 1 24 Introduction Background Rank Power Ensembles Experiments

  38. Generalizing KN to PLRE Kneser Ney Power Low Rank Ensembles • Ensemble composed of • Ensemble composed of unsmoothed n -grams unsmoothed n -grams plus other low rank matrices/tensors • Alter lower order distributions by ? using count of unique histories • Use absolute discounting to ? interpolate different n -grams and preserve lower order marginal constraint 25 Introduction Background Rank Power Ensembles Experiments

  39. Generalizing KN to PLRE Kneser Ney Power Low Rank Ensembles • Ensemble composed of • Ensemble composed of unsmoothed n -grams unsmoothed n -grams plus other low rank matrices/tensors • Alter lower order distributions by ? using count of unique histories • Use absolute discounting to ? interpolate different n -grams and preserve lower order marginal constraint 26 Introduction Background Rank Power Ensembles Experiments

  40. Consider Elementwise Power 27 Introduction Background Rank Power Ensembles Experiments

  41. Consider Elementwise Power 𝑪 𝟐 𝟑 𝟐 𝟏 𝟔 𝟏 𝟑 𝟏 𝟏 27 Introduction Background Rank Power Ensembles Experiments

  42. Consider Elementwise Power 𝑪 𝟐 𝟑 𝟐 𝟏 𝟔 𝟏 𝟑 𝟏 𝟏 row sum 𝟓 𝟔 𝟑 27 Introduction Background Rank Power Ensembles Experiments

  43. Consider Elementwise Power 𝑪 𝟐 𝟑 𝟐 𝟏 𝟔 𝟏 𝟑 𝟏 𝟏 row sum 𝟓 𝟔 𝟑 27 Introduction Background Rank Power Ensembles Experiments

  44. Consider Elementwise Power 𝑪 𝟏.𝟔 𝑪 𝟐 𝟑 𝟐 𝟐 𝟐. 𝟓 𝟐 𝟏 𝟔 𝟏 𝟏 𝟑. 𝟑 𝟏 𝟑 𝟏 𝟏 𝟏 𝟏 𝟐. 𝟓 row sum 𝟓 𝟔 𝟑 27 Introduction Background Rank Power Ensembles Experiments

  45. Consider Elementwise Power 𝑪 𝟏.𝟔 𝑪 𝟐 𝟑 𝟐 𝟐 𝟐. 𝟓 𝟐 𝟏 𝟔 𝟏 𝟏 𝟑. 𝟑 𝟏 𝟑 𝟏 𝟏 𝟏 𝟏 𝟐. 𝟓 row sum row sum 𝟓 𝟒. 𝟓 𝟔 𝟑. 𝟑 𝟑 𝟐. 𝟓 27 Introduction Background Rank Power Ensembles Experiments

  46. Consider Elementwise Power 𝑪 𝟏.𝟔 𝑪 𝟐 𝟑 𝟐 𝟐 𝟐. 𝟓 𝟐 𝟏 𝟔 𝟏 𝟏 𝟑. 𝟑 𝟏 𝟑 𝟏 𝟏 𝟏 𝟏 𝟐. 𝟓 row sum row sum 𝟓 𝟒. 𝟓 𝟔 𝟑. 𝟑 𝟑 𝟐. 𝟓 27 Introduction Background Rank Power Ensembles Experiments

  47. Consider Elementwise Power 𝑪 𝟏.𝟔 𝑪 𝟏 𝑪 𝟐 𝟑 𝟐 𝟐 𝟐 𝟐 𝟐 𝟐. 𝟓 𝟐 𝟏 𝟔 𝟏 𝟏 𝟐 𝟏 𝟏 𝟑. 𝟑 𝟏 𝟑 𝟏 𝟏 𝟏 𝟏 𝟐 𝟏 𝟏 𝟐. 𝟓 row sum row sum 𝟓 𝟒. 𝟓 𝟔 𝟑. 𝟑 𝟑 𝟐. 𝟓 27 Introduction Background Rank Power Ensembles Experiments

  48. Consider Elementwise Power 𝑪 𝟏.𝟔 𝑪 𝟏 𝑪 𝟐 𝟑 𝟐 𝟐 𝟐 𝟐 𝟐 𝟐. 𝟓 𝟐 𝟏 𝟔 𝟏 𝟏 𝟐 𝟏 𝟏 𝟑. 𝟑 𝟏 𝟑 𝟏 𝟏 𝟏 𝟏 𝟐 𝟏 𝟏 𝟐. 𝟓 row sum row sum row sum 𝟓 𝟒. 𝟓 𝟒 𝟔 𝟑. 𝟑 𝟐 𝟑 𝟐. 𝟓 𝟐 27 Introduction Background Rank Power Ensembles Experiments

  49. Consider Elementwise Power 𝑪 𝟏.𝟔 𝑪 𝟏 𝑪 𝟐 𝟑 𝟐 𝟐 𝟐 𝟐 𝟐 𝟐. 𝟓 𝟐 𝟏 𝟔 𝟏 𝟏 𝟐 𝟏 𝟏 𝟑. 𝟑 𝟏 𝟑 𝟏 𝟏 𝟏 𝟏 𝟐 𝟏 𝟏 𝟐. 𝟓 row sum row sum row sum 𝟓 𝟒. 𝟓 𝟒 𝟔 𝟑. 𝟑 𝟐 𝟑 𝟐. 𝟓 𝟐 emphasis on diversity 27 Introduction Background Rank Power Ensembles Experiments

  50. Consider Elementwise Power 28 Introduction Background Rank Power Ensembles Experiments

  51. Consider Elementwise Power 0 = 𝑛𝑗𝑜 𝑵:𝑵≥0,𝑠𝑏𝑜𝑙 𝑵 =1 𝑪 𝟏 − 𝑵 𝐿𝑀 𝑵 1 28 Introduction Background Rank Power Ensembles Experiments

  52. Consider Elementwise Power 0 = 𝑛𝑗𝑜 𝑵:𝑵≥0,𝑠𝑏𝑜𝑙 𝑵 =1 𝑪 𝟏 − 𝑵 𝐿𝑀 𝑵 1 0 𝑥 𝑗 , 𝑥 𝑗−1 𝑄 𝑙𝑜−𝑣𝑜𝑗 (𝑥 𝑗 ) = 𝑵 1 0 𝑥, 𝑥 𝑗−1 𝑥 𝑵 1 28 Introduction Background Rank Power Ensembles Experiments

  53. Consider Elementwise Power 0 = 𝑛𝑗𝑜 𝑵:𝑵≥0,𝑠𝑏𝑜𝑙 𝑵 =1 𝑪 𝟏 − 𝑵 𝐿𝑀 𝑵 1 0 𝑥 𝑗 , 𝑥 𝑗−1 𝑄 𝑙𝑜−𝑣𝑜𝑗 (𝑥 𝑗 ) = 𝑵 1 0 𝑥, 𝑥 𝑗−1 𝑥 𝑵 1 power = 1 full rank 28 Introduction Background Rank Power Ensembles Experiments

  54. Consider Elementwise Power 0 = 𝑛𝑗𝑜 𝑵:𝑵≥0,𝑠𝑏𝑜𝑙 𝑵 =1 𝑪 𝟏 − 𝑵 𝐿𝑀 𝑵 1 0 𝑥 𝑗 , 𝑥 𝑗−1 𝑄 𝑙𝑜−𝑣𝑜𝑗 (𝑥 𝑗 ) = 𝑵 1 0 𝑥, 𝑥 𝑗−1 𝑥 𝑵 1 power = 1 power = 0 full rank full rank power 28 Introduction Background Rank Power Ensembles Experiments

  55. Consider Elementwise Power 0 = 𝑛𝑗𝑜 𝑵:𝑵≥0,𝑠𝑏𝑜𝑙 𝑵 =1 𝑪 𝟏 − 𝑵 𝐿𝑀 𝑵 1 0 𝑥 𝑗 , 𝑥 𝑗−1 𝑄 𝑙𝑜−𝑣𝑜𝑗 (𝑥 𝑗 ) = 𝑵 1 0 𝑥, 𝑥 𝑗−1 𝑥 𝑵 1 power = 0 power = 1 power = 0 full rank rank = 1 full rank power low rank 28 Introduction Background Rank Power Ensembles Experiments

  56. Vary rying Rank and Power • Construct matrices of varying rank and power power = 1 power = 0 full rank rank = 1 29 Introduction Background Rank Power Ensembles Experiments

Recommend


More recommend