language and stats 11 7 6 61 heterogenity of language
play

Language and Stats 11-(7/6)61 Heterogenity of language Types and - PowerPoint PPT Presentation

Language and Stats 11-(7/6)61 Heterogenity of language Types and tokens Bhiksha Raj 11-761 1 The fiction we maintain To generate a text, the source randomly chooses a hidden message The concept to be conveyed It also


  1. Heterogeneity • The Dover mail was in its usual genial position that the guard suspected the passengers, the passengers suspected one another and the guard, they all suspected everybody else, and the coachman was sure of nothing but the horses; as to which cattle he could with a clear conscience have taken his oath on the two Testaments that they were not fit for the journey. • News or fiction? • Time period of events described? • When was this written? • Nationality of author? • What else.. 11-761 22

  2. Heterogeneity • The Dover mail was in its usual genial position that the guard suspected the passengers, the passengers suspected one another and the guard, they all suspected everybody else, and the coachman was sure of nothing but the horses; as to which cattle he could with a clear conscience have taken his oath on the two Testaments that they were not fit for the journey. • News or fiction? • Time period of events described? • When was this written? • Nationality of author? • What else.. 11-761 23

  3. Heterogeneity • The Dover mail was in its usual genial position that the guard suspected the passengers, the passengers suspected one another and the guard, they all suspected everybody else, and the coachman was sure of nothing but the horses; as to which cattle he could with a clear conscience have taken his oath on the two Testaments that they were not fit for the journey. • News or fiction? • Time period of events described? • When was this written? • Nationality of author? • What else.. 11-761 24

  4. Heterogeneity • The Dover mail was in its usual genial position that the guard suspected the passengers, the passengers suspected one another and the guard, they all suspected everybody else, and the coachman was sure of nothing but the horses; as to which cattle he could with a clear conscience have taken his oath on the two Testaments that they were not fit for the journey. • News or fiction? • Time period of events described? • When was this written? • Nationality of author? • What else.. 11-761 25

  5. Who said this? • “Sad” 11-761 26

  6. Heterogeneity in language • What are the various sources of heterogeneity in language 11-761 27

  7. Heterogeneity in language • What are the various sources of heterogeneity in language • How do they differ? 11-761 28

  8. Heterogeneity in language • What are the various sources of heterogeneity in language • How do they differ? • Homework coming up on this problem 11-761 29

  9. True or false • The Merriam Webster dictionary has 470000 words • The Merriam Webster dictionary has over 50 million words 11-761 30

  10. True or false • The Merriam Webster dictionary has 470000 words • The Merriam Webster dictionary has over 50 million words 11-761 31

  11. Types vs. Tokens • Type: Uniquely identifiable value • Words in a lexicon (the left column of the dictionary) • Notes in music (how many) • Token: Instances of types • “Number of words in this article” • “12 notes to a bar” 11-761 32

  12. Word types vs. word tokens • How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. • Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 33

  13. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 34

  14. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 35

  15. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 36

  16. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 37

  17. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 38

  18. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 39

  19. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 40

  20. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 41

  21. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 42

  22. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 43

  23. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 44

  24. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 45

  25. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 46

  26. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 47

  27. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 48

  28. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 49

  29. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 50

  30. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 51

  31. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 52

  32. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 53

  33. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 54

  34. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 55

  35. Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 56

  36. Type-token curves ntypes ntokens • Typical type-token curve • Increases monotonically • Gets flatter all the time • But never gets completely flat • There are always new words you will encounter • Type-token curves will differ for different sub languages 11-761 57

  37. Comparing type-token curves for sublanguages • “Wall Street Journal” Corpus (WSJ): • Newspaper articles, 1988-1992 • Written English, rich vocabulary (leaning towards finance) • “Switchboard” Corpus (SWB): • Transcribed spoken conversations • over the telephone • Prescribed topic (one of 70) • 1990’s • “Broadcast News” Corpus (BN): • Transcribed TV/Radio News programs • Spoken, but somewhat scripted

  38. Comparing type-token curves for sub-languages 11-761 59

  39. WSJ vs BN vs SWB (log scale) Note: slope << 1 60

  40. Token-type curves: Bigrams • The number of bigrams is greater than unigrams • The probability of hitting a “new” bigram type is higher • The curve is steeper, but flattens out after a few tens of millions of tokens • Distinctions between sub-languages is more stark 61

  41. Bigram Token Type Curve – BN vs. SWB (log scale) Note: slope closer to 1.0 than for unigrams

  42. Token-type curves: Trigrams • The number of trigrams is greater than bigrams • The curve is steeper than for bigrams Will flatten out after hundreds of millions of tokens • 11-761 63

  43. Trigram Token-Type Curve – BN vs. SWB (log scale) Note: slope almost 1.0

  44. Head of word-frequency lists Count unit: 1000 • WSJ vs BN vs Switchboard 11-761 65

  45. Tail of word-frequency lists Singletons (Count = 1) • WSJ vs BN vs Switchboard 11-761 66

  46. Sub-language Example 2 • The Diabetes set includes 9 Diabetes-related journals and a total of 4.5M tokens and 95K types. • The Veterinary science set includes 11 journals and 3.2M tokens and 87K types. • All Journals were extracted from PubMed in Oct 2010 and they include everything that was available by those journals up until then. • This example is provided by Dana Movshovitz-Attias.

  47. Diabetes vs. Veterinary: Type-Token Curve

  48. Diabetes vs. Veterinary: Type-Token Curve (log scale)

  49. Head of Word Frequency List (counts per 1,000 tokens) diabetes count veterinary count Count unit: 1000 THE 42 THE 57 OF 35 OF 39 AND 31 AND 30 IN 29 IN 29 TO 16 TO 17 WITH 13 A 14 A 13 WERE 11 FOR 10 WAS 10 WAS 10 FOR 10 WERE 9 WITH 9 DIABETES 7 FROM 7 THAT 7 THAT 6 BY 6 IS 6 IS 6 AS 6 2 6 BY 6 AS 5 ON 5 INSULIN 5 AT 5 OR 5 1 4 GLUCOSE 5 BE 4 1 5 THIS 4

  50. Tail of Word Frequency List: Count=1 (“Singletons”) Diabetes Veterinary QUESTIONNAIRE-BASED MOLARITIES CAPACITY-CONSTRAINED LIDOCAIN Singletons DND MULTIORGAN (Count = 1) 1003500 MICROGLIA-MEDIATED ENZYME-INHIBITOR NALYSIS ALVEOLUS-CAPILLARY 10702 KUZUYA BLUE-DNA $6054 HAIR-LOSS SENTENCING POPULATION-DYNAMICAL PAPER-AND-PENCIL STATE-TRANSITION

  51. An interesting feature • In every case, word frequency count goes down very fast • Lets plot the relative frequency of words against their rank.. 11-761 72

  52. P(word) vs rank • The probability of a word falls off rapidly with rank! • This is an instance of a more generic principle.. 11-761 73

  53. A peculiar phenomenon.. • There are many more rare things than there are common things! • This is true, not just of words.. 11-761 74

  54. A peculiar phenomenon.. • There are many more rare things than there are common things! • This is true, not just of words.. • In any sufficiently large collection.. • The most frequent event is • ~2 times as frequent as the second most frequent event • ~3 times as frequent as the fourth most frequent event • ~4 times as frequent as the fourth most frequent event • .. • ~N times as frequent as the N-th most frequent event 11-761 75

  55. Typical behavior ! Frequency !/2 !/3 !/4 !/5 !/6 !/7 !/8 !/9 !/10 1 2 3 4 5 6 7 8 9 10 Rank • Rank vs relative frequency 11-761 76

  56. Typical behavior in Log domain • Rank vs relative frequency • In a log-log plot its just a line with negative slope approximately 1 11-761 77

  57. Examples: Population of cities • Caveat: Axes are flipped w.r.t. earlier figure • The most populous city is approx. twice as populous as the second most 78 populous city and so on..

  58. Examples: AOL users vs. sites • AOL visitors to sites • http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html 79

  59. Examples: Cryptocurrencies • https://steemit.com/steem/@akrid/applying-zipf-s-law- to-the-crypto-market 80

  60. Examples: UNK • I think this is an example • No clue what it is.. 81

  61. And, of course, words.. • Word counts in wikipedias of 30 languages • (from Wikipedia) 11-761 82

  62. Zipf’s law George Kingsley Zipf (1902-1950) Linguist • Define the probability of a word in terms of its rank • Zipf’s hypothesis 1 ! "#$% ∝ $()*("#$%) • This is an empirical law 11-761 83

  63. Zipf’s law log !("#$%) !("#$%) 1 3 5 7 9 11 13 log $()*("#$%) $()*("#$%) 1 ! "#$% ∝ $()*("#$%) • In a log-log plot the relationship is linear log !("#$%) = 1 − log($()*("#$%)) • The slope of the plot is -1.0. 11-761 84

  64. Frequency vs. rank, Brown Corpus • Brown Corpus (1969) 500 samples of English-language text, totaling roughly one million words, compiled from • works published in the United States in 1961 15 text categories • • Appears to match Zipf 85

  65. Is Zipf’s distribution valid? !("#$%) 1 3 5 7 9 11 13 $()*("#$%) • Is the following a valid distribution? 1 ! "#$% ∝ $()*("#$%) • Correction: 1 ! "#$% ∝ $()*("#$%) -./ 11-761 86

  66. Adjustment !("#$%) 1 3 5 7 9 11 13 $()*("#$%) • Is the following a valid distribution? 1 ! "#$% ∝ $()*("#$%) • Only for finite vocabularies • Correction for infinite vocabularies: 1 ! "#$% ∝ $()*("#$%) -./ 87

  67. Word unigrams, WSJ corpus • Word frequency vs rank, WSJ corpus 11-761 88

  68. Word bigrams, WSJ corpus • Word bigram frequency vs rank, WSJ corpus 11-761 89

  69. Word trigrams, WSJ corpus • Word trigram frequency vs rank, WSJ corpus 11-761 90

  70. Word trigrams, WSJ corpus • Falls off rapidly with rank as predicted 11-761 91

  71. Zipf’s law • Zipf’s law is seen to apply to a large variety of natural phenomena • Zipf suggested it is the outcome of the “principle of least effort” • We – and nature – spends the least effort most of the time • Short, easy (or easy-to-recall) words are used most frequently. Longer harder words are less frequent. • But a more mathematical explanation came from Beniot Mandelbrot and George Miller 11-761 92

  72. Zipf’s law: Monkey on a typewriter • A monkey on a keyboard will produce text that follows Zipf’s law • Shorter words are more likely than longer words. If the keyboard has only 26 characters + space: ! "#$%&ℎ = ) = 26 , 27 ,./ 1 ! 0123 | "#$%&ℎ = ) = 26 , • Combining these will give you Zipf’s law (not quite; but we return to this) • Problem: Language is not a monkey on a typewriter 93

  73. Don’t try this at home From wikipedia In 2003, lecturers and students from the University of Plymouth MediaLab Arts course used a £2,000 grant from the Arts Council to study the literary output of real monkeys. They left a computer keyboard in the enclosure of six Celebes crested macaques in Paignton Zoo in Devon in England for a month, with a radio link to broadcast the results on a website… Not only did the monkeys produce nothing but five total pages largely consisting of the letter S, the lead male began by bashing the keyboard with a stone, and the monkeys continued by urinating and defecating on it. 94

  74. The Pareto distribution Pareto type 1 & ! " = $" % " &'( )*+ " ≥ " % Discovered by Vilfredo Pareto 1848-1923 • Real life phenomena are concentrated into a few frequent types, with a long tail of infrequent ones. 20% of the values account for 80% of the probability mass. • Almost everything in the universe follows the Pareto distribution • The distribution of wealth (most money in the hands of a few) • The sizes of human settlements (few cities, many hamlets/villages) • File size distribution of Internet traffic which uses the TCP protocol (many smaller files, few larger ones) • Hard disk drive error rates • The values of oil reserves in oil fields (a few large fields, many small fields) 11-761 95 • The sizes of sand particles

  75. Zipf’s law: The Pareto distribution • Matthew 13 : 12 For whosoever hath, to him shall be given, and he shall have more abundance: but whosoever hath not, from him shall be taken away even that he hath. • When you use a word once, its easier to recall, so you are more likely to use it again • Over time, you will use familiar words more and more frequently and unfamiliar ones less and less • The same principle applies to other data 11-761 96

  76. Mandelbrot.. • Beniot Mandelbrot (1924-2010) • Fractal theory, Chaos theory, Mandelbrot-zipf law 11-761 97

  77. Mandelbrot-Zipf law • The money on the typewriter doesn’t actually produce Zipf’s distribution • Instead you get the Mandelbrot distribution $ ! " = % + '()*(") - • Zipf’s rule is a special case if % = 0 and / = 1 11-761 98

  78. Mandelbrot distribution $ ! " = % + '()*(") - • In logs log ! " = ( − 2 log % + '()*(") • No longer exactly a line (curves “down”) • 2 changes the slope of the curve • % changes the log transform itself • For low-rank " , it allows you to start from a relatively high log transform • Better fits most data 11-761 99

  79. These curves (Unigram).. • Actually better fit the Mandelbrot distribution for different values of ! and " 11-761 100

Recommend


More recommend