counting words
play

Counting Words: the basics Introduction Zipfs law Typical - PowerPoint PPT Presentation

Introduction Baroni & Evert Roadmap Lexical statistics: Counting Words: the basics Introduction Zipfs law Typical frequency patterns Zipfs law Consequences Applications Marco Baroni & Stefan Evert Productivity in


  1. Rank/frequency profiles and frequency spectra Introduction ◮ From rank/frequency profile to spectrum: count Baroni & Evert occurrences of each f in profile to obtain V f values of corresponding spectrum elements Roadmap Lexical statistics: ◮ From spectrum to rank/frequency profile: given highest f the basics (i.e., m ) in a spectrum, the ranks 1 to V f in the Zipf’s law Typical frequency corresponding rank/frequency profile will have frequency patterns Zipf’s law f , the ranks V f + 1 to V f + V g (where g is the second Consequences Applications highest frequency in the spectrum) will have frequency g , Productivity in morphology etc. Productivity beyond morphology Lexical richness Conclusion and outlook

  2. Frequency spectrum of Brown corpus Introduction 20000 Baroni & Evert Roadmap Lexical statistics: the basics 15000 Zipf’s law Typical frequency patterns Zipf’s law Consequences 10000 V_m Applications Productivity in morphology Productivity beyond morphology Lexical richness 5000 Conclusion and outlook 0 1 2 3 4 5 6 7 8 9 11 13 15 m

  3. Vocabulary growth curve Introduction ◮ The sample: a b b c a a b a Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  4. Vocabulary growth curve Introduction ◮ The sample: a b b c a a b a Baroni & Evert ◮ N : 1, V : 1, V 1 : 1 Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  5. Vocabulary growth curve Introduction ◮ The sample: a b b c a a b a Baroni & Evert ◮ N : 1, V : 1, V 1 : 1 Roadmap ◮ N : 3, V : 2, V 1 : 1 Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  6. Vocabulary growth curve Introduction ◮ The sample: a b b c a a b a Baroni & Evert ◮ N : 1, V : 1, V 1 : 1 Roadmap ◮ N : 3, V : 2, V 1 : 1 Lexical statistics: the basics ◮ N : 5, V : 3, V 1 : 1 Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  7. Vocabulary growth curve Introduction ◮ The sample: a b b c a a b a Baroni & Evert ◮ N : 1, V : 1, V 1 : 1 Roadmap ◮ N : 3, V : 2, V 1 : 1 Lexical statistics: the basics ◮ N : 5, V : 3, V 1 : 1 Zipf’s law ◮ N : 8, V : 3, V 1 : 1 Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  8. Vocabulary growth curve Introduction ◮ The sample: a b b c a a b a Baroni & Evert ◮ N : 1, V : 1, V 1 : 1 Roadmap ◮ N : 3, V : 2, V 1 : 1 Lexical statistics: the basics ◮ N : 5, V : 3, V 1 : 1 Zipf’s law ◮ N : 8, V : 3, V 1 : 1 Typical frequency patterns Zipf’s law ◮ (Most VGCs on our slides smoothed with binomial Consequences interpolation ) Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  9. Vocabulary growth curve of Brown corpus With V 1 growth in red Introduction Baroni & Evert Roadmap 40000 Lexical statistics: the basics Zipf’s law Typical frequency 30000 patterns Zipf’s law Consequences V and V_1 Applications Productivity in 20000 morphology Productivity beyond morphology Lexical richness Conclusion and outlook 10000 0 0e+00 2e+05 4e+05 6e+05 8e+05 1e+06 N

  10. Outline Introduction Baroni & Evert Roadmap Roadmap Lexical statistics: Lexical statistics: the basics the basics Zipf’s law Typical frequency Zipf’s law patterns Zipf’s law Consequences Applications Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  11. Typical frequency patterns Top and bottom ranks in the Brown corpus Introduction Baroni & Evert top frequencies bottom frequencies Roadmap rank fq word rank range fq randomly selected examples 1 62642 the 7967-8522 10 recordings undergone privileges Lexical statistics: the basics 2 35971 of 8523-9236 9 Leonard indulge creativity 3 27831 and 9237-10042 8 unnatural Lolotte authenticity Zipf’s law 4 25608 to 10043-11185 7 diffraction Augusta postpone Typical frequency patterns 5 21883 a 11186-12510 6 uniformly throttle agglutinin Zipf’s law 6 19474 in 12511-14369 5 Bud Councilman immoral Consequences 7 10292 that 14370-16938 4 verification gleamed groin Applications 8 10026 is 16939-21076 3 Princes nonspecifically Arger Productivity in morphology 9 9887 was 21077-28701 2 blitz pertinence arson Productivity beyond morphology 10 8811 for 28702-53076 1 Salaries Evensen parentheses Lexical richness Conclusion and outlook

  12. Typical frequency patterns BNC Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  13. Typical frequency patterns Other corpora Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  14. Typical frequency patterns Brown bigrams and trigrams Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  15. Typical frequency patterns The Italian prefix ri- in the la Repubblica corpus Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  16. Zipf’s law Introduction ◮ Language after language, corpus after corpus, linguistic Baroni & Evert type after linguistic type. . . Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  17. Zipf’s law Introduction ◮ Language after language, corpus after corpus, linguistic Baroni & Evert type after linguistic type. . . Roadmap ◮ same “few giants, many dwarves” pattern is encountered Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  18. Zipf’s law Introduction ◮ Language after language, corpus after corpus, linguistic Baroni & Evert type after linguistic type. . . Roadmap ◮ same “few giants, many dwarves” pattern is encountered Lexical statistics: ◮ Similarity of plots suggests that relation between rank the basics and frequency could be captured by a law Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  19. Zipf’s law Introduction ◮ Language after language, corpus after corpus, linguistic Baroni & Evert type after linguistic type. . . Roadmap ◮ same “few giants, many dwarves” pattern is encountered Lexical statistics: ◮ Similarity of plots suggests that relation between rank the basics and frequency could be captured by a law Zipf’s law Typical frequency patterns ◮ Nature of relation becomes clearer if we plot log f in Zipf’s law Consequences function of log r Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  20. Zipf’s law Introduction ◮ Language after language, corpus after corpus, linguistic Baroni & Evert type after linguistic type. . . Roadmap ◮ same “few giants, many dwarves” pattern is encountered Lexical statistics: ◮ Similarity of plots suggests that relation between rank the basics and frequency could be captured by a law Zipf’s law Typical frequency patterns ◮ Nature of relation becomes clearer if we plot log f in Zipf’s law Consequences function of log r Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  21. Zipf’s law Introduction ◮ Straight line in double-logarithmic space corresponds to Baroni & Evert power law for original variables Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  22. Zipf’s law Introduction ◮ Straight line in double-logarithmic space corresponds to Baroni & Evert power law for original variables Roadmap ◮ This leads to Zipf’s (1949, 1965) famous law: Lexical statistics: the basics C f ( w ) = Zipf’s law r ( w ) a Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  23. Zipf’s law Introduction ◮ Straight line in double-logarithmic space corresponds to Baroni & Evert power law for original variables Roadmap ◮ This leads to Zipf’s (1949, 1965) famous law: Lexical statistics: the basics C f ( w ) = Zipf’s law r ( w ) a Typical frequency patterns Zipf’s law Consequences Applications ◮ With a = 1 and C = 60 , 000, Zipf’s law predicts that Productivity in morphology most frequent word has frequency 60,000; second most Productivity beyond morphology frequent word has frequency 30,000; third word has Lexical richness Conclusion and outlook frequency 20,000. . .

  24. Zipf’s law Introduction ◮ Straight line in double-logarithmic space corresponds to Baroni & Evert power law for original variables Roadmap ◮ This leads to Zipf’s (1949, 1965) famous law: Lexical statistics: the basics C f ( w ) = Zipf’s law r ( w ) a Typical frequency patterns Zipf’s law Consequences Applications ◮ With a = 1 and C = 60 , 000, Zipf’s law predicts that Productivity in morphology most frequent word has frequency 60,000; second most Productivity beyond morphology frequent word has frequency 30,000; third word has Lexical richness Conclusion and outlook frequency 20,000. . . ◮ and long tail of 80,000 words with frequency between 1.5 and 0.5

  25. Zipf’s law Logarithmic version Introduction ◮ Zipf’s power law : Baroni & Evert C f ( w ) = Roadmap r ( w ) a Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  26. Zipf’s law Logarithmic version Introduction ◮ Zipf’s power law : Baroni & Evert C f ( w ) = Roadmap r ( w ) a Lexical statistics: the basics ◮ If we take logarithm of both sides, we obtain: Zipf’s law Typical frequency patterns log f ( w ) = log C − a log r ( w ) Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  27. Zipf’s law Logarithmic version Introduction ◮ Zipf’s power law : Baroni & Evert C f ( w ) = Roadmap r ( w ) a Lexical statistics: the basics ◮ If we take logarithm of both sides, we obtain: Zipf’s law Typical frequency patterns log f ( w ) = log C − a log r ( w ) Zipf’s law Consequences Applications ◮ I.e., Zipf’s law predicts that rank/frequency profiles are Productivity in morphology Productivity straight lines in double logarithmic space, which, we saw, beyond morphology Lexical richness is a reasonable approximation Conclusion and outlook

  28. Zipf’s law Logarithmic version Introduction ◮ Zipf’s power law : Baroni & Evert C f ( w ) = Roadmap r ( w ) a Lexical statistics: the basics ◮ If we take logarithm of both sides, we obtain: Zipf’s law Typical frequency patterns log f ( w ) = log C − a log r ( w ) Zipf’s law Consequences Applications ◮ I.e., Zipf’s law predicts that rank/frequency profiles are Productivity in morphology Productivity straight lines in double logarithmic space, which, we saw, beyond morphology Lexical richness is a reasonable approximation Conclusion and outlook ◮ Best fit a and C can be found with least squares method

  29. Zipf’s law Logarithmic version Introduction ◮ Zipf’s power law : Baroni & Evert C f ( w ) = Roadmap r ( w ) a Lexical statistics: the basics ◮ If we take logarithm of both sides, we obtain: Zipf’s law Typical frequency patterns log f ( w ) = log C − a log r ( w ) Zipf’s law Consequences Applications ◮ I.e., Zipf’s law predicts that rank/frequency profiles are Productivity in morphology Productivity straight lines in double logarithmic space, which, we saw, beyond morphology Lexical richness is a reasonable approximation Conclusion and outlook ◮ Best fit a and C can be found with least squares method ◮ Provides intuitive interpretation of a and C : ◮ a is slope determining how fast log frequency decreases with log rank ◮ log C is intercept , i.e., predicted log frequency of word with rank 1 (log rank 0), i.e., most frequent word

  30. Zipf’s law Fitting the Brown rank/frequency profile Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  31. Fit of Zipf’s law Introduction ◮ At right edge (low frequencies): Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  32. Fit of Zipf’s law Introduction ◮ At right edge (low frequencies): Baroni & Evert ◮ “Bell-bottom” pattern expected as we are fitting continuous model to discrete frequencies Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  33. Fit of Zipf’s law Introduction ◮ At right edge (low frequencies): Baroni & Evert ◮ “Bell-bottom” pattern expected as we are fitting continuous model to discrete frequencies Roadmap ◮ More worryingly, in large corpora frequency drops more Lexical statistics: the basics rapidly than predicted by Zipf’s law Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  34. Fit of Zipf’s law Introduction ◮ At right edge (low frequencies): Baroni & Evert ◮ “Bell-bottom” pattern expected as we are fitting continuous model to discrete frequencies Roadmap ◮ More worryingly, in large corpora frequency drops more Lexical statistics: the basics rapidly than predicted by Zipf’s law Zipf’s law ◮ At left edge (high frequencies): Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  35. Fit of Zipf’s law Introduction ◮ At right edge (low frequencies): Baroni & Evert ◮ “Bell-bottom” pattern expected as we are fitting continuous model to discrete frequencies Roadmap ◮ More worryingly, in large corpora frequency drops more Lexical statistics: the basics rapidly than predicted by Zipf’s law Zipf’s law ◮ At left edge (high frequencies): Typical frequency patterns ◮ Highest frequencies lower than predicted Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  36. Fit of Zipf’s law Introduction ◮ At right edge (low frequencies): Baroni & Evert ◮ “Bell-bottom” pattern expected as we are fitting continuous model to discrete frequencies Roadmap ◮ More worryingly, in large corpora frequency drops more Lexical statistics: the basics rapidly than predicted by Zipf’s law Zipf’s law ◮ At left edge (high frequencies): Typical frequency patterns ◮ Highest frequencies lower than predicted → Mandelbrot’s Zipf’s law Consequences correction Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  37. Zipf-Mandelbrot’s law Mandelbrot 1953 Introduction ◮ Mandelbrot’s extra parameter: Baroni & Evert C Roadmap f ( w ) = ( r ( w ) + b ) a Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  38. Zipf-Mandelbrot’s law Mandelbrot 1953 Introduction ◮ Mandelbrot’s extra parameter: Baroni & Evert C Roadmap f ( w ) = ( r ( w ) + b ) a Lexical statistics: the basics Zipf’s law Typical frequency ◮ Zipf’s law is special case with b = 0 patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  39. Zipf-Mandelbrot’s law Mandelbrot 1953 Introduction ◮ Mandelbrot’s extra parameter: Baroni & Evert C Roadmap f ( w ) = ( r ( w ) + b ) a Lexical statistics: the basics Zipf’s law Typical frequency ◮ Zipf’s law is special case with b = 0 patterns Zipf’s law ◮ Assuming a = 1, C = 60 , 000, b = 1: Consequences Applications ◮ For word with rank 1, Zipf’s law predicts frequency of Productivity in 60,000; Mandelbrot’s variation predicts frequency of morphology Productivity beyond morphology 30,000 Lexical richness ◮ For word with rank 1,000, Zipf’s law predicts frequency Conclusion and outlook of 60; Mandelbrot’s variation predicts frequency of 59.94

  40. Zipf-Mandelbrot’s law Mandelbrot 1953 Introduction ◮ Mandelbrot’s extra parameter: Baroni & Evert C Roadmap f ( w ) = ( r ( w ) + b ) a Lexical statistics: the basics Zipf’s law Typical frequency ◮ Zipf’s law is special case with b = 0 patterns Zipf’s law ◮ Assuming a = 1, C = 60 , 000, b = 1: Consequences Applications ◮ For word with rank 1, Zipf’s law predicts frequency of Productivity in 60,000; Mandelbrot’s variation predicts frequency of morphology Productivity beyond morphology 30,000 Lexical richness ◮ For word with rank 1,000, Zipf’s law predicts frequency Conclusion and outlook of 60; Mandelbrot’s variation predicts frequency of 59.94 ◮ No longer a straight line in double logarithmic space; finding best fit harder than least squares

  41. Zipf-Mandelbrot’s law Mandelbrot 1953 Introduction ◮ Mandelbrot’s extra parameter: Baroni & Evert C Roadmap f ( w ) = ( r ( w ) + b ) a Lexical statistics: the basics Zipf’s law Typical frequency ◮ Zipf’s law is special case with b = 0 patterns Zipf’s law ◮ Assuming a = 1, C = 60 , 000, b = 1: Consequences Applications ◮ For word with rank 1, Zipf’s law predicts frequency of Productivity in 60,000; Mandelbrot’s variation predicts frequency of morphology Productivity beyond morphology 30,000 Lexical richness ◮ For word with rank 1,000, Zipf’s law predicts frequency Conclusion and outlook of 60; Mandelbrot’s variation predicts frequency of 59.94 ◮ No longer a straight line in double logarithmic space; finding best fit harder than least squares ◮ Zipf-Mandelbrot’s law is basis of LNRE statistical models we will introduce

  42. Mandelbrot’s adjustment Fitting the Brown rank/frequency profile Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  43. More fits Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  44. A few mildly interesting things about Zipf(-Mandelbrot)’s law Introduction ◮ a is often close to 1 for word frequency distributions Baroni & Evert (hence simplified version: f = C / r , and -1 slope in log-log space) Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  45. A few mildly interesting things about Zipf(-Mandelbrot)’s law Introduction ◮ a is often close to 1 for word frequency distributions Baroni & Evert (hence simplified version: f = C / r , and -1 slope in log-log space) Roadmap Lexical statistics: ◮ Zipf’s law also provides good fit to frequency spectra the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  46. A few mildly interesting things about Zipf(-Mandelbrot)’s law Introduction ◮ a is often close to 1 for word frequency distributions Baroni & Evert (hence simplified version: f = C / r , and -1 slope in log-log space) Roadmap Lexical statistics: ◮ Zipf’s law also provides good fit to frequency spectra the basics ◮ Monkey languages display Zipf’s law (intuition: few short Zipf’s law Typical frequency patterns words have very high chances to be generated; long tail Zipf’s law Consequences of highly unlikely long words) Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  47. A few mildly interesting things about Zipf(-Mandelbrot)’s law Introduction ◮ a is often close to 1 for word frequency distributions Baroni & Evert (hence simplified version: f = C / r , and -1 slope in log-log space) Roadmap Lexical statistics: ◮ Zipf’s law also provides good fit to frequency spectra the basics ◮ Monkey languages display Zipf’s law (intuition: few short Zipf’s law Typical frequency patterns words have very high chances to be generated; long tail Zipf’s law Consequences of highly unlikely long words) Applications ◮ Zipf’s law is everywhere (Li 2002) Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  48. Consequences Introduction ◮ Data sparseness Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  49. Consequences Introduction ◮ Data sparseness Baroni & Evert ◮ Standard statistics, normal approximation not Roadmap appropriate for lexical type distributions Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  50. Consequences Introduction ◮ Data sparseness Baroni & Evert ◮ Standard statistics, normal approximation not Roadmap appropriate for lexical type distributions Lexical statistics: ◮ V is not stable, will grow with sample size, we need the basics special methods to estimate V and related quantities at Zipf’s law Typical frequency patterns arbitrary sizes (including V of whole type population) Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  51. Consequences Introduction ◮ Data sparseness Baroni & Evert ◮ Standard statistics, normal approximation not Roadmap appropriate for lexical type distributions Lexical statistics: ◮ V is not stable, will grow with sample size, we need the basics special methods to estimate V and related quantities at Zipf’s law Typical frequency patterns arbitrary sizes (including V of whole type population) Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  52. V , sample size and the Zipfian distribution Introduction ◮ Significant tail of hapax legomena indicates that chances Baroni & Evert of encountering new type if we keep sampling are high Roadmap ◮ Zipfian distribution implies vocabulary curve that is still Lexical statistics: growing at largest sample size the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  53. Pronouns in Italian ( la Repubblica ) Rank/frequency profile Introduction Baroni & Evert ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Roadmap ● ● ● ● ● ● ● ● Lexical statistics: ● ● ● ● ● ● ● the basics ● ● 10000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Zipf’s law ● ● ● ● ● Typical frequency ● ● ● patterns ● ● ● Zipf’s law ● ● ● Consequences ● fq Applications ● Productivity in morphology 100 ● ● Productivity beyond morphology Lexical richness Conclusion and outlook 1 0 20 40 60 80 rank

  54. Pronouns in Italian Frequency spectrum Introduction 1.4 Baroni & Evert Roadmap Lexical statistics: the basics 1.2 Zipf’s law Typical frequency patterns Zipf’s law Consequences V_m 1.0 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Applications Productivity in morphology Productivity beyond morphology Lexical richness 0.8 Conclusion and outlook 0.6 1 100 10000 m

  55. Pronouns in Italian Vocabulary growth curve Introduction 80 Baroni & Evert Roadmap Lexical statistics: the basics 60 Zipf’s law Typical frequency patterns Zipf’s law V and V_1 Consequences 40 Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and 20 outlook 0 0e+00 1e+06 2e+06 3e+06 4e+06 N

  56. Pronouns in Italian Vocabulary growth curve (zooming in) Introduction 80 Baroni & Evert Roadmap Lexical statistics: the basics 60 Zipf’s law Typical frequency patterns Zipf’s law V and V_1 Consequences 40 Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and 20 outlook 0 0 2000 4000 6000 8000 10000 N

  57. ri- in Italian ( la Repubblica ) Rank/frequency profile Introduction Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  58. ri- in Italian Frequency spectrum Introduction Baroni & Evert 350 Roadmap 300 Lexical statistics: the basics Zipf’s law 250 Typical frequency patterns Zipf’s law 200 Consequences V_m Applications Productivity in 150 morphology Productivity beyond morphology Lexical richness 100 Conclusion and outlook 50 0 1 2 3 4 5 6 7 8 9 11 13 15 m

  59. ri- in Italian Vocabulary growth curve Introduction Baroni & Evert 1000 Roadmap Lexical statistics: the basics 800 Zipf’s law Typical frequency patterns Zipf’s law V and V_1 Consequences 600 Applications Productivity in morphology Productivity beyond morphology 400 Lexical richness Conclusion and outlook 200 0 0 200000 600000 1000000 N

  60. Outline Introduction Baroni & Evert Roadmap Roadmap Lexical statistics: Lexical statistics: the basics the basics Zipf’s law Typical frequency Zipf’s law patterns Zipf’s law Consequences Applications Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  61. Applications Introduction ◮ Productivity (in morphology and elsewhere) Baroni & Evert ◮ Lexical richness (in stylometry, language Roadmap acquisition/pathology and elsewhere) Lexical statistics: ◮ Extrapolation of type counts and type frequency the basics distribution for practical NLP purposes (e.g., estimating Zipf’s law Typical frequency patterns proportion of OOV words, typos, etc.) Zipf’s law Consequences ◮ ... (e.g., Good-Turing smoothing, prior distribution for Applications Bayesian language modeling) Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  62. Productivity Introduction ◮ In many linguistic problems, rate of growth of VGC is Baroni & Evert interesting issue in itself Roadmap ◮ Baayen (1989 and later) makes link between linguistic Lexical statistics: notion of productivity and vocabulary growth rate the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  63. Productivity in morphology: the classic definition Schultink (1961), translated by Booij Introduction Productivity as morphological phenomenon is the possibility Baroni & Evert which language users have to form an in principle uncountable number of new words unintentionally, by means of a Roadmap Lexical statistics: morphological process which is the basis of the form-meaning the basics correspondence of some words they know. Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  64. V as a measure of productivity Introduction ◮ Comparable for same N only! Baroni & Evert Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  65. V as a measure of productivity Introduction ◮ Comparable for same N only! Baroni & Evert ◮ Good first approximation, but it is measuring attestedness, not potential: Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  66. V as a measure of productivity Introduction ◮ Comparable for same N only! Baroni & Evert ◮ Good first approximation, but it is measuring attestedness, not potential: Roadmap ◮ (According to rough BNC counts) de- verbs have V of Lexical statistics: the basics 141, un- verbs have V of 119, contra our intuition Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  67. V as a measure of productivity Introduction ◮ Comparable for same N only! Baroni & Evert ◮ Good first approximation, but it is measuring attestedness, not potential: Roadmap ◮ (According to rough BNC counts) de- verbs have V of Lexical statistics: the basics 141, un- verbs have V of 119, contra our intuition Zipf’s law ◮ We want productivity index of pronouns to be 0, not 72! Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  68. Baayen’s P Introduction ◮ Operationalize productivity of a process as probability Baroni & Evert that the next token created by the process that we sample is a new word Roadmap Lexical statistics: ◮ This is same as probability that next token in sample is the basics hapax legomenon Zipf’s law Typical frequency ◮ Thus, we can estimate probability of sampling a new patterns Zipf’s law Consequences word as relative frequency of hapax legomena in our Applications sample: Productivity in morphology P = V 1 Productivity beyond morphology N Lexical richness Conclusion and outlook

  69. Baayen’s P Introduction P = V 1 Baroni & Evert N Roadmap Lexical statistics: ◮ Probability to sample token representing type we will the basics never encounter again (token labeled “hapax”) at first Zipf’s law Typical frequency stage of sampling (when we are at the beginning of patterns Zipf’s law N -token-sample) is given by the proportion of hapaxes in Consequences Applications the whole N -token-sample divided by the total number of Productivity in morphology tokens in the sample Productivity beyond morphology ◮ Thus, this must also be probability that last token Lexical richness Conclusion and outlook sampled represents new type ◮ P as productivity measure matches intuition that productivity should measure potential of process to generate new forms

  70. P as vocabulary growth rate Introduction ◮ P measures the potentiality of growth of V in a very Baroni & Evert literal way, i.e., it is the growth rate of V , the rate at which vocabulary size increases Roadmap Lexical statistics: ◮ P is (approximation to) the derivative of V at N , i.e., the basics the slope of the tangent to the vocabulary growth curve Zipf’s law Typical frequency at N (Baayen 2001, pp. 49-50) patterns Zipf’s law ◮ Again, “rate of growth” of vocabulary generated by word Consequences Applications formation process seems good match for intuition about Productivity in morphology productivity of word formation process Productivity beyond morphology Lexical richness Conclusion and outlook

  71. ri- in Italian la Repubblica corpus Introduction Baroni & Evert 1000 Roadmap Lexical statistics: the basics Zipf’s law 800 Typical frequency patterns Zipf’s law Consequences ● Applications V 600 Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook 400 200 0 200000 600000 1000000 1400000 N

  72. Pronouns in Italian la Repubblica corpus Introduction 80 Baroni & Evert Roadmap ● Lexical statistics: the basics 60 Zipf’s law Typical frequency patterns Zipf’s law Consequences 40 Applications V Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and 20 outlook 0 0 2000 4000 6000 8000 10000 N

  73. Baayen’s P and intuition Introduction class V V 1 N P Baroni & Evert it. ri- 1098 346 1,399,898 0.00025 Roadmap it. pronouns 72 0 4,313,123 0 Lexical statistics: the basics en. un- 119 25 7,618 .00328 Zipf’s law en. de- 141 16 86,130 .000185 Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  74. P and sample size Introduction ◮ We saw that as N increases, V also increases (for Baroni & Evert at-least-mildly-productive processes) Roadmap Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  75. P and sample size Introduction ◮ We saw that as N increases, V also increases (for Baroni & Evert at-least-mildly-productive processes) Roadmap ◮ Thus, V cannot be compared at different N s Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  76. V and N English re- and mis- Introduction Baroni & Evert Roadmap 250 Lexical statistics: the basics Zipf’s law 200 Typical frequency patterns Zipf’s law Consequences 150 Applications V Productivity in morphology Productivity beyond morphology 100 Lexical richness Conclusion and outlook 50 0 0 10000 20000 30000 40000 50000 N

  77. P and sample size Introduction ◮ We saw that as N increases, V also increases (for Baroni & Evert at-least-mildly-productive processes) Roadmap ◮ Thus, V cannot be compared at different N s Lexical statistics: the basics Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

  78. P and sample size Introduction ◮ We saw that as N increases, V also increases (for Baroni & Evert at-least-mildly-productive processes) Roadmap ◮ Thus, V cannot be compared at different N s Lexical statistics: ◮ However, growth rate is also systematically decreasing as the basics N becomes larger Zipf’s law Typical frequency patterns Zipf’s law Consequences Applications Productivity in morphology Productivity beyond morphology Lexical richness Conclusion and outlook

Recommend


More recommend