Perplexity on Reduced Corpora Analysis of Cutoff by Power Law - PowerPoint PPT Presentation

Perplexity on Reduced Corpora — Analysis of Cutoff by Power Law Hayato Kobayashi Yahoo Japan Corporation

Cutoff  Removing low-frequency words from a corpus  Common practice to save computational costs in learning  Language modeling  Needed even in a distributed environment, since the feature space of k-grams is quite large [Brants+ 2007]  Topic modeling  Enough for roughly analyzing topics, since low-frequency words have a small impact on the statistics [Steyvers&Griffiths 2007] 2

Question  How many low-frequency words can we remove while maintaining sufficient performance?  More generally, how much can we reduce a corpus/model using a certain strategy?  Many experimental studies addressing the question  [Stoleke 1998], [Buchsbaum+ 1998], [Goodman&Gao 2000], [Gao&Zhang 2002], [Ha+ 2006], [Hirsimaki 2007], [Church+ 2007]  Discussing trade-off relationships between the size of reduced corpus/model and its performance  No theoretical study! 3

This work  First address the question from a theoretical standpoint  Derive the trade-off formulae of the cutoff strategy for k- gram models and topic models  Perplexity vs. reduced vocabulary size  Verify the correctness of our theory on synthetic corpora and examine the gap between theory and practice on several real corpora 4

Approach  Assume a corpus follows Zipf’s law (power law)  Empirical rule representing a long-tail property in a corpus  Essentially the same approach as in physics  Constructing a theory while believing experimentally observed results (e.g., gravity acceleration g) ( 0   2 v , ) v sin( 2 ) 0 g We can derive the landing point of a ball by believing g. Similarly, we try to clarify the trade-off relationships by believing Zipf’s law. 5

Outline  Preliminaries  Zipf’s law  Perplexity (PP)  Cutoff and restoring  PP of unigram models  PP of k-gram models  PP of topic models  Conclusion 6

Zipf’s law  Empirical rule discovered on real corpora [Zipf, 1935]  Word frequency f(w) is inversely proportional to its frequency ranking r(w) Real corpora roughly follow Zipf’s law Max. frequency C  f ( w ) r ( w ) f(w) Frequency Frequency ranking Zipf random (Linear on a log-log graph) Log-log graph r(w) 7

Perplexity (PP)  Widely used evaluation measure of statistical models  Geometric mean of the inverse of the per-word likelihood on the held-out test corpus Corpus size Test corpus  PP means how many possibilities one has for estimating the next word  Lower perplexity means better generalization performance 8

Cutoff  Removing low frequency words  f(remaining word) ≥ f(removed word) holds Learned prob. Learn from w’ Need to infer Reduced corpus w’ f(w) Probability ranking r(w) 9

Constant restoring  Infer the prob. of the removed words as a constant  Approximate the result learned from the original corpus Learned from w’ Reduced corpus Inferred prob. Constant λ Probability ranking 10

Perplexity of unigram models  Predictive distribution of unigram models  f ( w )    p ( w )  N Reduced corpus size  Optimal restoring constant  Obtained by minimizing PP w.r.t. a constant λ , after substituting ˆ p ( w ) the restored probability into PP Corpus size Vocab. size Reduced vocab. size 12

Theorem (PP of unigram models)  For any reduced vocabulary size W’, the perplexity PP 1 of the optimal restored distribution of a unigram model is calculated as Harmonic series Bertrand series (special form) 13

Approximation of PP of unigrams  H(X) and B(X) can be approximated by definite integrals Euler-Mascheroni const.  Approximate formula o is obtained as Reduced vocab. size is quasi polynomial (quadratic)   Behaves as a quadratic function on a log-log graph 14

PP of unigrams vs. reduced vocab. size Zipf random same size as Reuters Maximum f(w) Theory Zipf rand: 234,705 Reuters: 136,371 Real (Reuters) Log-log graph Our theory is suited for inferring the growth rate of perplexity 15 rather than the perplexity value itself

Perplexity of k-gram models  Simple model where k-grams are calculated from a random word sequence based on Zipf’s law  The model is “ s tupid”  B igram “is is” is quite frequent  p (" is is " ) p (" is " ) p (" is " )  T wo bigrams “is a” and “a is” have the same frequency   p (" is a " ) p (" is " ) p (" a " ) p (" a is " )  Later experiment will uncover the fact that the model can roughly capture the behavior of real corpora 17

Frequency of a k-gram  Frequency f k of a k-gram w k is defined by Decay function  Decay function g 2 of bigrams is as follows  Decay function g k of k-grams is defined through its inverse: Piltz divisor function that represents # of divisors of n 18

Exponent of k-gram distributions  Assume k-gram frequencies follow a power law  [Ha+ 2006] found k-gram frequencies roughly follow a power law, whose exponent π k is smaller than 1 (k>1)  Optimal exponent in our model based on the assumption  By minimizing the sum of squared errors between the inverse -1 (r) and r 1/ πk on a log-log graph gradients g k 19

Exponent of k-grams vs. gram size Theory Real (Reuters) Normal graph 20

Corollary (PP of k-gram models)  For any reduced vocabulary size W’, the perplexity of the optimal restored distribution of a k-gram model is calculated as 1   X  H ( X ) : a a x 1 x Hyper harmonic series a ln x   X  B ( X ) : a a x 1 x Bertrand series (another special form) 21

PP of k-grams vs. reduced vocab. size Theory (Trigram) Due to Zipf (Trigram) Sparseness Zipf (Bigram) Theory (Bigram) Unigram Log-log graph We need to make assumptions that include 22 backoff and smoothing for higher order k-grams

Additional properties by power-law  Treat as a variant of the coupon collector’s problem  How many trials are needed for collecting all coupons whose occurrence probabilities follow some stable distribution  There exists several works about power law distributions  Corpus size for collecting all of the k-grams, according to [Boneh&Papanicolaou 1996] k kW  When π k = 1, , otherwise,   W ln 2 W 1 k  Lower and upper bound of the number of k-grams from the corpus size N and vocab. size W, according to [Atsonios+ 2011] 23

Perplexity of topic models  Latent Dirichlet Allocation (LDA) [Blei+ 2003] [Griffiths&Steyvers 2004]  Learning with Gibbs sampling  Obtain a “good” topic assignment z i for each word w i  Posterior distributions of two hidden parameters ˆ Document-topic distribution     ( d ) ( z ) n d z Mixture rate of topic z in document d     ˆ ( w ) Topic-word distribution ( w ) n z z Occurrence rate of word w in topic z 25

Rough assumptions of ϕ and θ  Assumption of ϕ  Word distribution ϕ z of each topic z follows Zipf’s law It is natural, regarding each topic as a corpus  Assumptions of θ (two extreme cases)  Case All: Each document evenly has all topics  Case One: Each document only has one topic (uniform dist.) The curve of actual perplexity is expected to be between their values  Case All: PP of a topic model ≈ PP of a unigram  Marginal predictive distribution is independent of d =1/T 26

Theorem(PP of LDA models: Case One)  For any reduced vocabulary size W’, the perplexity of the optimal restored distribution of a topic model in the Case One is calculated as T : # of topics in LDA 27

PP of LDA models vs. reduced vocab. size Zipf Theory (Case All) (Case One + Case All) / 2 T=20 Log-log graph CGS w/ 100 iter. Mix of 20 Zipf Real (Reuters) α = β =0.1 Theory (Case One) 28

Time, memory, and PP of LDA learning  Results of Reuters corpus  Memory usage of the (1/10)-corpus is only 60% of that of the original corpus  Helps in-memory computing for a larger corpus, although the computational time decreased a little 29

Perplexity on Reduced Corpora Analysis of Cutoff by Power Law - PowerPoint PPT Presentation

Perplexity on Reduced Corpora Analysis of Cutoff by Power Law Hayato Kobayashi Yahoo Japan Corporation Cutoff Removing low-frequency words from a corpus Common practice to save computational costs in learning Language modeling

Morphology and Corpora: Introduction Marco Baroni University of Bologna Granada Morphology

Dialogue corpora NPFL070 December 11, 2019 (NPFL070) Dialogue corpora December 11, 2019 1 /

Some Immediately Noticeable Benefits of using Polytron Reduced temperature n Reduced vibrations n

General overview types of gambling activities Online gambling perplexity or go-slow

East Slavic parallel corpora: diachronic and diatopic variaton in Belarusian, Ukrainian, and

Data and Analysis Note 8 Introduction to Corpora Alex Simpson Note 8 Introduction to corpora

Data and Analysis Part III Corpora Alex Simpson Part III: Corpora Inf1, Data & Analysis,

Roadmap On annotating On annotating learner corpora learner corpora Detmar Meurers Detmar

Semi-supervised Transliteration Mining from Parallel and Comparable Corpora Walid Aransa, Holger

Towards Continuous Qvality Control for Spoken Language Corpora Anne Ferger and Hanna Hedeland

Beyond Parallel Corpora Philipp Koehn 29 October 2020 Philipp Koehn Machine Translation: Beyond

Reduced Basis Collocation Methods for Partial Differential Equations with Random Coefficients

Constructing E-Language Corpora: a focus on CorCenCC (The National Corpus of Contemporary Welsh)

Parallel Corpora & Alignment Aaron Smith Machine Translation VT 2016 Uppsala, 20th April

Focusing on Tighter Integration of CAT Tools and Corpora Milo Jakubek Translating and the

Creating and exploiting multimodal annotated corpora Philippe Blache, Roxane Bertrand & Ga

Bridge By: Mr. Chung KHChung@interact.ccsd.net Main Materials: Grid paper (12x18)

Student Recital Planning Request a Date Check the website for available recital dates

FrankWood Gatsby UCL

STAGE SETTING (OR AIMS OF OUR PROJECT) Create carrier bacteria able to deliver chosen genes

Hand Washing T O P R O T E C T T H E C U S T O M E R S H E A L T H P R E S E N T E D B Y

Joshua Hartigan Supervisor: Judy-anne Osborn Heres a matrix And heres its Gram

I N V E S TO R P R E S E N TAT I O N J u l y 2 0 1 9 HORIZONTALLY DIVERSIFIED INTEGRATED

From New Mechanisms to New Standards of Care Corporate Presentation August 2019 Forward-Looking

Perplexity on Reduced Corpora Analysis of Cutoff by Power Law - PowerPoint PPT Presentation

Perplexity on Reduced Corpora Analysis of Cutoff by Power Law Hayato Kobayashi Yahoo Japan Corporation Cutoff Removing low-frequency words from a corpus Common practice to save computational costs in learning Language modeling

Morphology and Corpora: Introduction Marco Baroni University of Bologna Granada Morphology

Dialogue corpora NPFL070 December 11, 2019 (NPFL070) Dialogue corpora December 11, 2019 1 /

Some Immediately Noticeable Benefits of using Polytron Reduced temperature n Reduced vibrations n

General overview types of gambling activities Online gambling perplexity or go-slow

East Slavic parallel corpora: diachronic and diatopic variaton in Belarusian, Ukrainian, and

Data and Analysis Note 8 Introduction to Corpora Alex Simpson Note 8 Introduction to corpora

Data and Analysis Part III Corpora Alex Simpson Part III: Corpora Inf1, Data &amp; Analysis,

Roadmap On annotating On annotating learner corpora learner corpora Detmar Meurers Detmar

Semi-supervised Transliteration Mining from Parallel and Comparable Corpora Walid Aransa, Holger

Towards Continuous Qvality Control for Spoken Language Corpora Anne Ferger and Hanna Hedeland

Beyond Parallel Corpora Philipp Koehn 29 October 2020 Philipp Koehn Machine Translation: Beyond

Reduced Basis Collocation Methods for Partial Differential Equations with Random Coefficients

Constructing E-Language Corpora: a focus on CorCenCC (The National Corpus of Contemporary Welsh)

Parallel Corpora &amp; Alignment Aaron Smith Machine Translation VT 2016 Uppsala, 20th April

Focusing on Tighter Integration of CAT Tools and Corpora Milo Jakubek Translating and the

Creating and exploiting multimodal annotated corpora Philippe Blache, Roxane Bertrand &amp; Ga

Bridge By: Mr. Chung KHChung@interact.ccsd.net Main Materials: Grid paper (12x18)

Student Recital Planning Request a Date Check the website for available recital dates

FrankWood Gatsby UCL

STAGE SETTING (OR AIMS OF OUR PROJECT) Create carrier bacteria able to deliver chosen genes

Hand Washing T O P R O T E C T T H E C U S T O M E R S H E A L T H P R E S E N T E D B Y

Joshua Hartigan Supervisor: Judy-anne Osborn Heres a matrix And heres its Gram

I N V E S TO R P R E S E N TAT I O N J u l y 2 0 1 9 HORIZONTALLY DIVERSIFIED INTEGRATED

From New Mechanisms to New Standards of Care Corporate Presentation August 2019 Forward-Looking

Data and Analysis Part III Corpora Alex Simpson Part III: Corpora Inf1, Data & Analysis,

Parallel Corpora & Alignment Aaron Smith Machine Translation VT 2016 Uppsala, 20th April

Creating and exploiting multimodal annotated corpora Philippe Blache, Roxane Bertrand & Ga