what every computational linguist should know about type
play

What Every Computational Linguist Should Know About Type-Token - PowerPoint PPT Presentation

What Every Computational Linguist Should Know About Type-Token Distributions and Zipfs Law Tutorial 1, 7 May 2018 Stefan Evert FAU Erlangen-Nrnberg http://zipfr.r-forge.r-project.org/lrec2018.html Licensed under CC-by-sa version 3.0


  1. Part 1 Descriptive statistics & notation Vocabulary growth curve our sample: recently , very , not , otherwise , much , very , very , merely , not , now , very , much , merely , not , very ◮ N = 1, V ( N ) = 1, V 1 ( N ) = 1 ◮ N = 3, V ( N ) = 3, V 1 ( N ) = 3 ◮ N = 7, V ( N ) = 5, V 1 ( N ) = 4 ◮ N = 12, V ( N ) = 7, V 1 ( N ) = 4 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 15 / 99

  2. Part 1 Descriptive statistics & notation Vocabulary growth curve our sample: recently , very , not , otherwise , much , very , very , merely , not , now , very , much , merely , not , very ◮ N = 1, V ( N ) = 1, V 1 ( N ) = 1 ◮ N = 3, V ( N ) = 3, V 1 ( N ) = 3 ◮ N = 7, V ( N ) = 5, V 1 ( N ) = 4 ◮ N = 12, V ( N ) = 7, V 1 ( N ) = 4 ◮ N = 15, V ( N ) = 7, V 1 ( N ) = 3 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 15 / 99

  3. Part 1 Descriptive statistics & notation Vocabulary growth curve our sample: recently , very , not , otherwise , much , very , very , merely , not , now , very , much , merely , not , very vocabulary growth curve: adverbs 10 V ( N ) V 1 ( N ) 8 ◮ N = 1, V ( N ) = 1, V 1 ( N ) = 1 ◮ N = 3, V ( N ) = 3, V 1 ( N ) = 3 V ( N ) V 1 ( N ) 6 ◮ N = 7, V ( N ) = 5, V 1 ( N ) = 4 4 ◮ N = 12, V ( N ) = 7, V 1 ( N ) = 4 2 ◮ N = 15, V ( N ) = 7, V 1 ( N ) = 3 0 0 2 4 6 8 10 12 14 N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 15 / 99

  4. Part 1 Descriptive statistics & notation A realistic vocabulary growth curve: the Brown corpus vocabulary growth curve: Brown corpus 50000 V ( N ) V 1 ( N ) 40000 30000 V ( N ) V 1 ( N ) 20000 10000 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 16 / 99

  5. Part 1 Descriptive statistics & notation Vocabulary growth in authorship attribution ◮ Authorship attribution by n-gram tracing applied to the case of the Bixby letter (Grieve et al. submitted) ◮ Word or character n-grams in disputed text are compared against large “training” corpora from candidate authors 323 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 17 / 99

  6. Part 1 Descriptive statistics & notation Observing Zipf’s law across languages and different linguistic units Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 18 / 99

  7. Part 1 Descriptive statistics & notation Observing Zipf’s law The Italian prefix ri- in the la Repubblica corpus Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 19 / 99

  8. Part 1 Descriptive statistics & notation Observing Zipf’s law Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 20 / 99

  9. Part 1 Descriptive statistics & notation Observing Zipf’s law Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 20 / 99

  10. Part 1 Descriptive statistics & notation Observing Zipf’s law ◮ Straight line in double-logarithmic space corresponds to power law for original variables ◮ This leads to Zipf’s (1949; 1965) famous law: f r = C r a Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 21 / 99

  11. Part 1 Descriptive statistics & notation Observing Zipf’s law ◮ Straight line in double-logarithmic space corresponds to power law for original variables ◮ This leads to Zipf’s (1949; 1965) famous law: f r = C r a ◮ If we take logarithm on both sides, we obtain: log f r = log C − a · log r Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 21 / 99

  12. Part 1 Descriptive statistics & notation Observing Zipf’s law ◮ Straight line in double-logarithmic space corresponds to power law for original variables ◮ This leads to Zipf’s (1949; 1965) famous law: f r = C r a ◮ If we take logarithm on both sides, we obtain: log f r = log C − a · log r � �� � ���� y x Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 21 / 99

  13. Part 1 Descriptive statistics & notation Observing Zipf’s law ◮ Straight line in double-logarithmic space corresponds to power law for original variables ◮ This leads to Zipf’s (1949; 1965) famous law: f r = C r a ◮ If we take logarithm on both sides, we obtain: log f r = log C − a · log r � �� � ���� y x ◮ Intuitive interpretation of a and C : ◮ a is slope determining how fast log frequency decreases ◮ log C is intercept , i.e. log frequency of most frequent word ( r = 1 ➜ log r = 0) Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 21 / 99

  14. Part 1 Descriptive statistics & notation Observing Zipf’s law Least-squares fit = linear regression in log-space (Brown corpus) Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 22 / 99

  15. Part 1 Descriptive statistics & notation Zipf-Mandelbrot law Mandelbrot (1953, 1962) ◮ Mandelbrot’s extra parameter: C f r = ( r + b ) a ◮ Zipf’s law is special case with b = 0 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 23 / 99

  16. Part 1 Descriptive statistics & notation Zipf-Mandelbrot law Mandelbrot (1953, 1962) ◮ Mandelbrot’s extra parameter: C f r = ( r + b ) a ◮ Zipf’s law is special case with b = 0 ◮ Assuming a = 1, C = 60,000, b = 1: ◮ For word with rank 1, Zipf’s law predicts frequency of 60,000; Mandelbrot’s variation predicts frequency of 30,000 ◮ For word with rank 1,000, Zipf’s law predicts frequency of 60; Mandelbrot’s variation predicts frequency of 59.94 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 23 / 99

  17. Part 1 Descriptive statistics & notation Zipf-Mandelbrot law Mandelbrot (1953, 1962) ◮ Mandelbrot’s extra parameter: C f r = ( r + b ) a ◮ Zipf’s law is special case with b = 0 ◮ Assuming a = 1, C = 60,000, b = 1: ◮ For word with rank 1, Zipf’s law predicts frequency of 60,000; Mandelbrot’s variation predicts frequency of 30,000 ◮ For word with rank 1,000, Zipf’s law predicts frequency of 60; Mandelbrot’s variation predicts frequency of 59.94 ◮ Zipf-Mandelbrot law forms basis of statistical LNRE models ◮ ZM law derived mathematically as limiting distribution of vocabulary generated by a character-level Markov process Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 23 / 99

  18. Part 1 Descriptive statistics & notation Zipf-Mandelbrot law Non-linear least-squares fit (Brown corpus) Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 24 / 99

  19. Part 1 Some examples (zipfR) Outline Part 1 Motivation Descriptive statistics & notation Some examples (zipfR) LNRE models: intuition LNRE models: mathematics Part 2 Applications & examples (zipfR) Limitations Non-randomness Conclusion & outlook Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 25 / 99

  20. Part 1 Some examples (zipfR) zipfR Evert and Baroni (2007) ◮ http://zipfR.R-Forge.R-Project.org/ ◮ Conveniently available from CRAN repository ◮ Package vignette = gentle tutorial introduction Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 26 / 99

  21. Part 1 Some examples (zipfR) First steps with zipfR ◮ Set up a folder for this course, and make sure it is your working directory in R (preferably as an RStudio project) ◮ Install the most recent version of the zipfR package ◮ Package, handouts, code samples & data sets available from http://zipfr.r-forge.r-project.org/lrec2018.html > library(zipfR) > ?zipfR # documentation entry point > vignette("zipfr-tutorial") # read the zipfR tutorial Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 27 / 99

  22. Part 1 Some examples (zipfR) Loading type-token data ◮ Most convenient input: sequence of tokens as text file in vertical format (“one token per line”) ☞ mapped to appropriate types: normalized word forms, word pairs, lemmatized, semantic class, n-gram of POS tags, . . . ☞ language data should always be in UTF-8 encoding! ☞ large files can be compressed ( .gz , .bz2 , .xz ) Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 28 / 99

  23. Part 1 Some examples (zipfR) Loading type-token data ◮ Most convenient input: sequence of tokens as text file in vertical format (“one token per line”) ☞ mapped to appropriate types: normalized word forms, word pairs, lemmatized, semantic class, n-gram of POS tags, . . . ☞ language data should always be in UTF-8 encoding! ☞ large files can be compressed ( .gz , .bz2 , .xz ) ◮ Sample data: brown_adverbs.txt on tutorial homepage ◮ lowercased adverb tokens from Brown corpus (original order) ☞ download and save to your working directory Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 28 / 99

  24. Part 1 Some examples (zipfR) Loading type-token data ◮ Most convenient input: sequence of tokens as text file in vertical format (“one token per line”) ☞ mapped to appropriate types: normalized word forms, word pairs, lemmatized, semantic class, n-gram of POS tags, . . . ☞ language data should always be in UTF-8 encoding! ☞ large files can be compressed ( .gz , .bz2 , .xz ) ◮ Sample data: brown_adverbs.txt on tutorial homepage ◮ lowercased adverb tokens from Brown corpus (original order) ☞ download and save to your working directory > adv <- readLines("brown_adverbs.txt", encoding="UTF-8") > head(adv, 30) # mathematically, a ‘‘vector’’ of tokens > length(adv) # sample size = 52,037 tokens Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 28 / 99

  25. Part 1 Some examples (zipfR) Descriptive statistics: type-frequency list > adv.tfl <- vec2tfl(adv) > adv.tfl k f type 1 1 4859 not 2 2 2084 n’t 3 3 1464 so 4 4 1381 only 5 5 1374 then 6 6 1309 now 7 7 1134 even 8 8 1089 as . . . . . . . . . N V 52037 1907 > N(adv.tfl) # sample size > V(adv.tfl) # type count Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 29 / 99

  26. Part 1 Some examples (zipfR) Descriptive statistics: frequency spectrum > adv.spc <- tfl2spc(adv.tfl) # or directly with vec2spc > adv.spc m Vm 1 1 762 2 2 260 3 3 144 4 4 99 5 5 69 6 6 50 7 7 40 8 8 34 . . . . . . N V 52037 1907 > N(adv.spc) # sample size > V(adv.spc) # type count Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 30 / 99

  27. Part 1 Some examples (zipfR) Descriptive statistics: vocabulary growth ◮ VGC lists vocabulary size V ( N ) at different sample sizes N ◮ Optionally also spectrum elements V m ( N ) up to m.max > adv.vgc <- vec2vgc(adv, m.max=2) ◮ Visualize descriptive statistics with plot method > plot(adv.tfl) # Zipf ranking > plot(adv.tfl, log="xy") # logarithmic scale recommended > plot(adv.spc) # barplot of frequency spectrum > plot(adv.vgc, add.m = 1:2) # vocabulary growth curve Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 31 / 99

  28. Part 1 Some examples (zipfR) Further example data sets ?Brown words from Brown corpus ?BrownSubsets various subsets ?Dickens words from novels by Charles Dickens ?ItaPref Italian word-formation prefixes ?TigerNP NP and PP patterns from German Tiger treebank ?Baayen2001 frequency spectra from Baayen (2001) ?EvertLuedeling2001 German word-formation affixes (manually corrected data from Evert and Lüdeling 2001) Practice: ◮ Explore these data sets with descriptive statistics ◮ Try different plot options (from help pages ?plot.tfl , ?plot.spc , ?plot.vgc ) Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 32 / 99

  29. Part 1 LNRE models: intuition Outline Part 1 Motivation Descriptive statistics & notation Some examples (zipfR) LNRE models: intuition LNRE models: mathematics Part 2 Applications & examples (zipfR) Limitations Non-randomness Conclusion & outlook Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 33 / 99

  30. Part 1 LNRE models: intuition Motivation ◮ Interested in productivity of affix, vocabulary of author, . . . ; not in a particular text or sample ☞ statistical inference from sample to population ◮ Discrete frequency counts are difficult to capture with generalizations such as Zipf’s law ◮ Zipf’s law predicts many impossible types with 1 < f r < 2 ☞ population does not suffer from such quantization effects Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 34 / 99

  31. Part 1 LNRE models: intuition LNRE models ◮ This tutorial introduces the state-of-the-art LNRE approach proposed by Baayen (2001) ◮ LNRE = Large Number of Rare Events ◮ LNRE uses various approximations and simplifications to obtain a tractable and elegant model ◮ Of course, we could also estimate the precise discrete distributions using MCMC simulations, but . . . 1. LNRE model usually minor component of complex procedure 2. often applied to very large samples ( N > 1 M tokens) Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 35 / 99

  32. Part 1 LNRE models: intuition The LNRE population ◮ Population: set of S types w i with occurrence probabilities π i ◮ S = population diversity can be finite or infinite ( S = ∞ ) ◮ Not interested in specific types ➜ arrange by decreasing probability: π 1 ≥ π 2 ≥ π 3 ≥ · · · ☞ impossible to determine probabilities of all individual types ◮ Normalization: π 1 + π 2 + . . . + π S = 1 ◮ Need parametric statistical model to describe full population (esp. for S = ∞ ), i.e. a function i �→ π i ◮ type probabilities π i cannot be estimated reliably from a sample, but parameters of this function can ◮ NB: population index i � = Zipf rank r Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 36 / 99

  33. Part 1 LNRE models: intuition Examples of population models 0.10 0.10 ● 0.08 0.08 ●●●● ● ● ● 0.06 ● 0.06 ● ● ● ● π k ● π k ● ● 0.04 0.04 ● ● ● ● ● ● ● ● ● ● 0.02 0.02 ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●● 0.00 0.00 0 10 20 30 40 50 0 10 20 30 40 50 k k 0.10 0.10 ● 0.08 0.08 ● ● ● 0.06 0.06 ● ● ● ● π k π k ● ● 0.04 ● 0.04 ● ● ● ● ● ● ● ● ● ● ● ● 0.02 0.02 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0.00 0.00 0 10 20 30 40 50 0 10 20 30 40 50 k k Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 37 / 99

  34. Part 1 LNRE models: intuition The Zipf-Mandelbrot law as a population model What is the right family of models for lexical frequency distributions? ◮ We have already seen that the Zipf-Mandelbrot law captures the distribution of observed frequencies very well Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 38 / 99

  35. Part 1 LNRE models: intuition The Zipf-Mandelbrot law as a population model What is the right family of models for lexical frequency distributions? ◮ We have already seen that the Zipf-Mandelbrot law captures the distribution of observed frequencies very well ◮ Re-phrase the law for type probabilities: C π i := ( i + b ) a ◮ Two free parameters: a > 1 and b ≥ 0 ◮ C is not a parameter but a normalization constant, needed to ensure that � i π i = 1 ◮ This is the Zipf-Mandelbrot population model Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 38 / 99

  36. Part 1 LNRE models: intuition The parameters of the Zipf-Mandelbrot model 0.10 0.10 ● a = 1.2 ● a = 2 0.08 0.08 b = 1.5 b = 10 ● 0.06 0.06 ● ● ● π k π k ● 0.04 0.04 ● ● ● ● ● ● ● ● 0.02 0.02 ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0.00 0.00 0 10 20 30 40 50 0 10 20 30 40 50 k k 0.10 0.10 ● a = 2 a = 5 0.08 0.08 ● b = 15 b = 40 ● ● 0.06 0.06 ● ● ● ● π k π k ● ● 0.04 ● 0.04 ● ● ● ● ● ● ● ● ● ● ● ● 0.02 0.02 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0.00 0.00 0 10 20 30 40 50 0 10 20 30 40 50 k k Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 39 / 99

  37. Part 1 LNRE models: intuition The parameters of the Zipf-Mandelbrot model ● ● 5e−02 5e−02 ● ● ● a = 1.2 ● a = 2 ● ● ● ●●●●● ● b = 1.5 b = 10 ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5e−03 ● 5e−03 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● π k ● ● π k ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5e−04 ● ● ● ● ● ● ● 5e−04 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1e−04 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1e−04 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 5 10 20 50 100 1 2 5 10 20 50 100 k k ● ● 5e−02 5e−02 ● ● ● ● ● ●●●●● ● ● a = 2 a = 5 ● ● ● ●●●●● b = 15 ● b = 40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5e−03 ● ● 5e−03 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● π k ● ● π k ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5e−04 ● ● ● ● ● ● 5e−04 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1e−04 ● ● ● ● ● ● ● ● ● ● ● ● 1e−04 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 2 5 10 20 50 100 1 2 5 10 20 50 100 k k Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 40 / 99

  38. Part 1 LNRE models: intuition The finite Zipf-Mandelbrot model Evert (2004) ◮ Zipf-Mandelbrot population model characterizes an infinite type population: there is no upper bound on i , and the type probabilities π i can become arbitrarily small ◮ π = 10 − 6 (once every million words), π = 10 − 9 (once every billion words), π = 10 − 15 (once on the entire Internet), π = 10 − 100 (once in the universe?) Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 41 / 99

  39. Part 1 LNRE models: intuition The finite Zipf-Mandelbrot model Evert (2004) ◮ Zipf-Mandelbrot population model characterizes an infinite type population: there is no upper bound on i , and the type probabilities π i can become arbitrarily small ◮ π = 10 − 6 (once every million words), π = 10 − 9 (once every billion words), π = 10 − 15 (once on the entire Internet), π = 10 − 100 (once in the universe?) ◮ The finite Zipf-Mandelbrot model stops after first S types ◮ Population diversity S becomes a parameter of the model → the finite Zipf-Mandelbrot model has 3 parameters Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 41 / 99

  40. Part 1 LNRE models: intuition The finite Zipf-Mandelbrot model Evert (2004) ◮ Zipf-Mandelbrot population model characterizes an infinite type population: there is no upper bound on i , and the type probabilities π i can become arbitrarily small ◮ π = 10 − 6 (once every million words), π = 10 − 9 (once every billion words), π = 10 − 15 (once on the entire Internet), π = 10 − 100 (once in the universe?) ◮ The finite Zipf-Mandelbrot model stops after first S types ◮ Population diversity S becomes a parameter of the model → the finite Zipf-Mandelbrot model has 3 parameters Abbreviations: ◮ ZM for Zipf-Mandelbrot model ◮ fZM for finite Zipf-Mandelbrot model Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 41 / 99

  41. Part 1 LNRE models: intuition Sampling from a population model Assume we believe that the population we are interested in can be described by a Zipf-Mandelbrot model: 0.05 5e−02 a = 3 a = 3 ● ● 0.04 ● ● ● ● ●●●●● b = 50 b = 50 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.03 ● ● ● ● 5e−03 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● π k π k ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.02 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5e−04 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.01 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1e−04 ● ● ● ● ● ● ● ● ● ● ● 0.00 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 10 20 30 40 50 1 2 5 10 20 50 100 k k Use computer simulation to generate random samples: ◮ Draw N tokens from the population such that in each step, type w i has probability π i to be picked ◮ This allows us to make predictions for samples (= corpora) of arbitrary size N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 42 / 99

  42. Part 1 LNRE models: intuition Sampling from a population model 1 42 34 23 108 18 48 18 1 . . . #1: Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 43 / 99

  43. Part 1 LNRE models: intuition Sampling from a population model 1 42 34 23 108 18 48 18 1 . . . #1: time order room school town course area course time . . . Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 43 / 99

  44. Part 1 LNRE models: intuition Sampling from a population model 1 42 34 23 108 18 48 18 1 . . . #1: time order room school town course area course time . . . #2: 286 28 23 36 3 4 7 4 8 . . . Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 43 / 99

  45. Part 1 LNRE models: intuition Sampling from a population model 1 42 34 23 108 18 48 18 1 . . . #1: time order room school town course area course time . . . #2: 286 28 23 36 3 4 7 4 8 . . . 2 11 105 21 11 17 17 1 16 . . . #3: Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 43 / 99

  46. Part 1 LNRE models: intuition Sampling from a population model 1 42 34 23 108 18 48 18 1 . . . #1: time order room school town course area course time . . . #2: 286 28 23 36 3 4 7 4 8 . . . 2 11 105 21 11 17 17 1 16 . . . #3: #4: 44 3 110 34 223 2 25 20 28 . . . #5: 24 81 54 11 8 61 1 31 35 . . . #6: 3 65 9 165 5 42 16 20 7 . . . #7: 10 21 11 60 164 54 18 16 203 . . . #8: 11 7 147 5 24 19 15 85 37 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 43 / 99

  47. Part 1 LNRE models: intuition Samples: type frequency list & spectrum rank r f r type i m V m 1 37 6 1 83 2 36 1 2 22 3 33 3 3 20 4 31 7 4 12 5 31 10 5 10 6 30 5 6 5 7 28 12 7 5 8 27 2 8 3 9 24 4 9 3 10 24 16 10 3 . . 11 23 8 . . . . 12 22 14 . . . . . . . . . sample #1 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 44 / 99

  48. Part 1 LNRE models: intuition Samples: type frequency list & spectrum rank r f r type i m V m 1 39 2 1 76 2 34 3 2 27 3 30 5 3 17 4 29 10 4 10 5 28 8 5 6 6 26 1 6 5 7 25 13 7 7 8 24 7 8 3 9 23 6 10 4 10 23 11 11 2 . . 11 20 4 . . . . 12 19 17 . . . . . . . . . sample #2 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 45 / 99

  49. Part 1 LNRE models: intuition Random variation in type-frequency lists Sample #1 Sample #2 40 40 ● ● ● ● ● ●● 30 30 ● ● ● ● ● ● ● ● ●● ● ● ●● ● 20 ● 20 f r f r ● r ↔ f r ● ●● ● ●● ●●● ●●●●● ● ● ●●● ● ● ●● ● ●●●● ●●● ● ●● 10 10 ●●● ●●●● ●●● ●●● ●●● ●●●●● ●●●●●●● ●●●●● ●●●●● ●●●● 0 0 0 10 20 30 40 50 0 10 20 30 40 50 r r Sample #1 Sample #2 40 40 ● ● ● ● ● ● ● 30 30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 20 f k f k ● ● ● ● i ↔ f i ● ● ● ● ● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● 10 ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● 0 0 0 10 20 30 40 50 0 10 20 30 40 50 k k Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 46 / 99

  50. Part 1 LNRE models: intuition Random variation: frequency spectrum Sample #1 100 80 60 V m 40 20 0 m Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 47 / 99

  51. Part 1 LNRE models: intuition Random variation: frequency spectrum Sample #2 100 80 60 V m 40 20 0 m Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 47 / 99

  52. Part 1 LNRE models: intuition Random variation: frequency spectrum Sample #3 100 80 60 V m 40 20 0 m Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 47 / 99

  53. Part 1 LNRE models: intuition Random variation: frequency spectrum Sample #4 100 80 60 V m 40 20 0 m Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 47 / 99

  54. Part 1 LNRE models: intuition Random variation: vocabulary growth curve Sample #1 200 150 V ( N ) V 1 ( N ) 100 50 0 0 200 400 600 800 1000 N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 48 / 99

  55. Part 1 LNRE models: intuition Random variation: vocabulary growth curve Sample #2 200 150 V ( N ) V 1 ( N ) 100 50 0 0 200 400 600 800 1000 N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 48 / 99

  56. Part 1 LNRE models: intuition Random variation: vocabulary growth curve Sample #3 200 150 V ( N ) V 1 ( N ) 100 50 0 0 200 400 600 800 1000 N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 48 / 99

  57. Part 1 LNRE models: intuition Random variation: vocabulary growth curve Sample #4 200 150 V ( N ) V 1 ( N ) 100 50 0 0 200 400 600 800 1000 N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 48 / 99

  58. Part 1 LNRE models: intuition Expected values ◮ There is no reason why we should choose a particular sample to compare to the real data or make a prediction – each one is equally likely or unlikely ◮ Take the average over a large number of samples, called expected value or expectation in statistics � and E � V ( N ) � V m ( N ) � ◮ Notation: E ◮ indicates that we are referring to expected values for a sample of size N ◮ rather than to the specific values V and V m observed in a particular sample or a real-world data set ◮ Expected values can be calculated efficiently without generating thousands of random samples Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 49 / 99

  59. Part 1 LNRE models: intuition The expected frequency spectrum Sample #1 100 V m E [ [ V m ] 80 60 V m E [ V m ] 40 20 0 m Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 50 / 99

  60. Part 1 LNRE models: intuition The expected frequency spectrum Sample #2 100 V m E [ [ V m ] 80 60 V m E [ V m ] 40 20 0 m Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 50 / 99

  61. Part 1 LNRE models: intuition The expected frequency spectrum Sample #3 100 V m E [ [ V m ] 80 60 V m E [ V m ] 40 20 0 m Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 50 / 99

  62. Part 1 LNRE models: intuition The expected frequency spectrum Sample #4 100 V m E [ [ V m ] 80 60 V m E [ V m ] 40 20 0 m Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 50 / 99

  63. Part 1 LNRE models: intuition The expected vocabulary growth curve Sample #1 Sample #1 200 200 150 150 E [ V 1 ( N )] E [ V ( N )] 100 100 50 50 V ( N ) V 1 ( N ) E [ V ( N )] E [ V 1 ( N )] 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 N N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 51 / 99

  64. Part 1 LNRE models: intuition Prediction intervals for the expected VGC Sample #1 Sample #1 200 200 150 150 E [ V 1 ( N )] E [ V ( N )] 100 100 50 50 V ( N ) V 1 ( N ) E [ V ( N )] E [ V 1 ( N )] 0 0 0 200 400 600 800 1000 0 200 400 600 800 1000 N N “Confidence intervals” indicate predicted sampling distribution: ☞ for 95% of samples generated by the LNRE model, VGC will fall within the range delimited by the thin red lines Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 52 / 99

  65. Part 1 LNRE models: intuition Parameter estimation by trial & error 25000 a = 1.5 , b = 7.5 50000 a = 1.5 , b = = 7.5 observed ZM model 20000 40000 15000 30000 V ( N ) E [ V ( N )] V m E [ V m ] 10000 20000 10000 5000 observed ZM model 0 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 m N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

  66. Part 1 LNRE models: intuition Parameter estimation by trial & error 25000 a = 1.3 , b = 7.5 50000 a = 1.3 , b = = 7.5 observed ZM model 20000 40000 15000 30000 V ( N ) E [ V ( N )] V m E [ V m ] 10000 20000 10000 5000 observed ZM model 0 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 m N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

  67. Part 1 LNRE models: intuition Parameter estimation by trial & error 25000 a = 1.3 , b = 0.2 50000 a = 1.3 , b = = 0.2 observed ZM model 20000 40000 15000 30000 V ( N ) E [ V ( N )] V m E [ V m ] 10000 20000 10000 5000 observed ZM model 0 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 m N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

  68. Part 1 LNRE models: intuition Parameter estimation by trial & error 25000 a = 1.5 , b = 7.5 50000 a = 1.5 , b = = 7.5 observed ZM model 20000 40000 15000 30000 V ( N ) E [ V ( N )] V m E [ V m ] 10000 20000 10000 5000 observed ZM model 0 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 m N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

  69. Part 1 LNRE models: intuition Parameter estimation by trial & error 25000 a = 1.7 , b = 7.5 50000 a = 1.7 , b = = 7.5 observed ZM model 20000 40000 15000 30000 V ( N ) E [ V ( N )] V m E [ V m ] 10000 20000 10000 5000 observed ZM model 0 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 m N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

  70. Part 1 LNRE models: intuition Parameter estimation by trial & error 25000 a = 1.7 , b = 80 50000 a = 1.7 , b = = 80 observed ZM model 20000 40000 15000 30000 V ( N ) E [ V ( N )] V m E [ V m ] 10000 20000 10000 5000 observed ZM model 0 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 m N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

  71. Part 1 LNRE models: intuition Parameter estimation by trial & error 25000 a = 2 , b = 550 50000 a = 2 , b = = 550 observed ZM model 20000 40000 15000 30000 V ( N ) E [ V ( N )] V m E [ V m ] 10000 20000 10000 5000 observed ZM model 0 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 m N Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 53 / 99

  72. Part 1 LNRE models: intuition Automatic parameter estimation 25000 a = 2.39 , b = 1968.49 50000 a = 2.39 , b = = 1968.49 observed expected 20000 40000 15000 30000 V ( N ) E [ V ( N )] V m E [ V m ] 10000 20000 10000 5000 observed expected 0 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 m N ◮ By trial & error we found a = 2 . 0 and b = 550 ◮ Automatic estimation procedure: a = 2 . 39 and b = 1968 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 54 / 99

  73. Part 1 LNRE models: mathematics Outline Part 1 Motivation Descriptive statistics & notation Some examples (zipfR) LNRE models: intuition LNRE models: mathematics Part 2 Applications & examples (zipfR) Limitations Non-randomness Conclusion & outlook Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 55 / 99

  74. Part 1 LNRE models: mathematics The sampling model ◮ Draw random sample of N tokens from LNRE population ◮ Sufficient statistic: set of type frequencies { f i } ◮ because tokens of random sample have no ordering ◮ Joint multinomial distribution of { f i } : N ! k 1 ! · · · k S ! π k 1 1 · · · π k S Pr ( { f i = k i } | N ) = S Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 56 / 99

  75. Part 1 LNRE models: mathematics The sampling model ◮ Draw random sample of N tokens from LNRE population ◮ Sufficient statistic: set of type frequencies { f i } ◮ because tokens of random sample have no ordering ◮ Joint multinomial distribution of { f i } : N ! k 1 ! · · · k S ! π k 1 1 · · · π k S Pr ( { f i = k i } | N ) = S ◮ Approximation: do not condition on fixed sample size N ◮ N is now the average (expected) sample size ◮ Random variables f i have independent Poisson distributions: Pr ( f i = k i ) = e − N π i ( N π i ) k i k i ! Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 56 / 99

  76. Part 1 LNRE models: mathematics Frequency spectrum ◮ Key problem: we cannot determine f i in observed sample ◮ becasue we don’t know which type w i is ◮ recall that population ranking f i � = Zipf ranking f r ◮ Use spectrum { V m } and sample size V as statistics ◮ contains all information we have about observed sample Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 57 / 99

  77. Part 1 LNRE models: mathematics Frequency spectrum ◮ Key problem: we cannot determine f i in observed sample ◮ becasue we don’t know which type w i is ◮ recall that population ranking f i � = Zipf ranking f r ◮ Use spectrum { V m } and sample size V as statistics ◮ contains all information we have about observed sample ◮ Can be expressed in terms of indicator variables � 1 f i = m I [ f i = m ] = 0 otherwise Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 57 / 99

  78. Part 1 LNRE models: mathematics Frequency spectrum ◮ Key problem: we cannot determine f i in observed sample ◮ becasue we don’t know which type w i is ◮ recall that population ranking f i � = Zipf ranking f r ◮ Use spectrum { V m } and sample size V as statistics ◮ contains all information we have about observed sample ◮ Can be expressed in terms of indicator variables � 1 f i = m I [ f i = m ] = 0 otherwise S � V m = I [ f i = m ] i =1 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 57 / 99

  79. Part 1 LNRE models: mathematics Frequency spectrum ◮ Key problem: we cannot determine f i in observed sample ◮ becasue we don’t know which type w i is ◮ recall that population ranking f i � = Zipf ranking f r ◮ Use spectrum { V m } and sample size V as statistics ◮ contains all information we have about observed sample ◮ Can be expressed in terms of indicator variables � 1 f i = m I [ f i = m ] = 0 otherwise S � V m = I [ f i = m ] i =1 S S � � � 1 − I [ f i =0] � V = I [ f i > 0] = i =1 i =1 Stefan Evert T1: Zipf’s Law 7 May 2018 | CC-by-sa 57 / 99

Recommend


More recommend