Heterogeneity • The Dover mail was in its usual genial position that the guard suspected the passengers, the passengers suspected one another and the guard, they all suspected everybody else, and the coachman was sure of nothing but the horses; as to which cattle he could with a clear conscience have taken his oath on the two Testaments that they were not fit for the journey. • News or fiction? • Time period of events described? • When was this written? • Nationality of author? • What else.. 11-761 22
Heterogeneity • The Dover mail was in its usual genial position that the guard suspected the passengers, the passengers suspected one another and the guard, they all suspected everybody else, and the coachman was sure of nothing but the horses; as to which cattle he could with a clear conscience have taken his oath on the two Testaments that they were not fit for the journey. • News or fiction? • Time period of events described? • When was this written? • Nationality of author? • What else.. 11-761 23
Heterogeneity • The Dover mail was in its usual genial position that the guard suspected the passengers, the passengers suspected one another and the guard, they all suspected everybody else, and the coachman was sure of nothing but the horses; as to which cattle he could with a clear conscience have taken his oath on the two Testaments that they were not fit for the journey. • News or fiction? • Time period of events described? • When was this written? • Nationality of author? • What else.. 11-761 24
Heterogeneity • The Dover mail was in its usual genial position that the guard suspected the passengers, the passengers suspected one another and the guard, they all suspected everybody else, and the coachman was sure of nothing but the horses; as to which cattle he could with a clear conscience have taken his oath on the two Testaments that they were not fit for the journey. • News or fiction? • Time period of events described? • When was this written? • Nationality of author? • What else.. 11-761 25
Who said this? • “Sad” 11-761 26
Heterogeneity in language • What are the various sources of heterogeneity in language 11-761 27
Heterogeneity in language • What are the various sources of heterogeneity in language • How do they differ? 11-761 28
Heterogeneity in language • What are the various sources of heterogeneity in language • How do they differ? • Homework coming up on this problem 11-761 29
True or false • The Merriam Webster dictionary has 470000 words • The Merriam Webster dictionary has over 50 million words 11-761 30
True or false • The Merriam Webster dictionary has 470000 words • The Merriam Webster dictionary has over 50 million words 11-761 31
Types vs. Tokens • Type: Uniquely identifiable value • Words in a lexicon (the left column of the dictionary) • Notes in music (how many) • Token: Instances of types • “Number of words in this article” • “12 notes to a bar” 11-761 32
Word types vs. word tokens • How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. • Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 33
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 34
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 35
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 36
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 37
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 38
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 39
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 40
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 41
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 42
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 43
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 44
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 45
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 46
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 47
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 48
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 49
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 50
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 51
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 52
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 53
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 54
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 55
Frequency of encountering new types • How frequently do we encounter new type as we read the following texts: How much wood would woodchuck chuck if woodchuck could chuck wood? As much wood as woodchuck could chuck woodchuck would chuck wood. Friends, Romans, countrymen, lend me your ears; I come to bury Caesar, not to praise him. The evil that men do lives after them; The good is oft interred with their bones 11-761 56
Type-token curves ntypes ntokens • Typical type-token curve • Increases monotonically • Gets flatter all the time • But never gets completely flat • There are always new words you will encounter • Type-token curves will differ for different sub languages 11-761 57
Comparing type-token curves for sublanguages • “Wall Street Journal” Corpus (WSJ): • Newspaper articles, 1988-1992 • Written English, rich vocabulary (leaning towards finance) • “Switchboard” Corpus (SWB): • Transcribed spoken conversations • over the telephone • Prescribed topic (one of 70) • 1990’s • “Broadcast News” Corpus (BN): • Transcribed TV/Radio News programs • Spoken, but somewhat scripted
Comparing type-token curves for sub-languages 11-761 59
WSJ vs BN vs SWB (log scale) Note: slope << 1 60
Token-type curves: Bigrams • The number of bigrams is greater than unigrams • The probability of hitting a “new” bigram type is higher • The curve is steeper, but flattens out after a few tens of millions of tokens • Distinctions between sub-languages is more stark 61
Bigram Token Type Curve – BN vs. SWB (log scale) Note: slope closer to 1.0 than for unigrams
Token-type curves: Trigrams • The number of trigrams is greater than bigrams • The curve is steeper than for bigrams Will flatten out after hundreds of millions of tokens • 11-761 63
Trigram Token-Type Curve – BN vs. SWB (log scale) Note: slope almost 1.0
Head of word-frequency lists Count unit: 1000 • WSJ vs BN vs Switchboard 11-761 65
Tail of word-frequency lists Singletons (Count = 1) • WSJ vs BN vs Switchboard 11-761 66
Sub-language Example 2 • The Diabetes set includes 9 Diabetes-related journals and a total of 4.5M tokens and 95K types. • The Veterinary science set includes 11 journals and 3.2M tokens and 87K types. • All Journals were extracted from PubMed in Oct 2010 and they include everything that was available by those journals up until then. • This example is provided by Dana Movshovitz-Attias.
Diabetes vs. Veterinary: Type-Token Curve
Diabetes vs. Veterinary: Type-Token Curve (log scale)
Head of Word Frequency List (counts per 1,000 tokens) diabetes count veterinary count Count unit: 1000 THE 42 THE 57 OF 35 OF 39 AND 31 AND 30 IN 29 IN 29 TO 16 TO 17 WITH 13 A 14 A 13 WERE 11 FOR 10 WAS 10 WAS 10 FOR 10 WERE 9 WITH 9 DIABETES 7 FROM 7 THAT 7 THAT 6 BY 6 IS 6 IS 6 AS 6 2 6 BY 6 AS 5 ON 5 INSULIN 5 AT 5 OR 5 1 4 GLUCOSE 5 BE 4 1 5 THIS 4
Tail of Word Frequency List: Count=1 (“Singletons”) Diabetes Veterinary QUESTIONNAIRE-BASED MOLARITIES CAPACITY-CONSTRAINED LIDOCAIN Singletons DND MULTIORGAN (Count = 1) 1003500 MICROGLIA-MEDIATED ENZYME-INHIBITOR NALYSIS ALVEOLUS-CAPILLARY 10702 KUZUYA BLUE-DNA $6054 HAIR-LOSS SENTENCING POPULATION-DYNAMICAL PAPER-AND-PENCIL STATE-TRANSITION
An interesting feature • In every case, word frequency count goes down very fast • Lets plot the relative frequency of words against their rank.. 11-761 72
P(word) vs rank • The probability of a word falls off rapidly with rank! • This is an instance of a more generic principle.. 11-761 73
A peculiar phenomenon.. • There are many more rare things than there are common things! • This is true, not just of words.. 11-761 74
A peculiar phenomenon.. • There are many more rare things than there are common things! • This is true, not just of words.. • In any sufficiently large collection.. • The most frequent event is • ~2 times as frequent as the second most frequent event • ~3 times as frequent as the fourth most frequent event • ~4 times as frequent as the fourth most frequent event • .. • ~N times as frequent as the N-th most frequent event 11-761 75
Typical behavior ! Frequency !/2 !/3 !/4 !/5 !/6 !/7 !/8 !/9 !/10 1 2 3 4 5 6 7 8 9 10 Rank • Rank vs relative frequency 11-761 76
Typical behavior in Log domain • Rank vs relative frequency • In a log-log plot its just a line with negative slope approximately 1 11-761 77
Examples: Population of cities • Caveat: Axes are flipped w.r.t. earlier figure • The most populous city is approx. twice as populous as the second most 78 populous city and so on..
Examples: AOL users vs. sites • AOL visitors to sites • http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html 79
Examples: Cryptocurrencies • https://steemit.com/steem/@akrid/applying-zipf-s-law- to-the-crypto-market 80
Examples: UNK • I think this is an example • No clue what it is.. 81
And, of course, words.. • Word counts in wikipedias of 30 languages • (from Wikipedia) 11-761 82
Zipf’s law George Kingsley Zipf (1902-1950) Linguist • Define the probability of a word in terms of its rank • Zipf’s hypothesis 1 ! "#$% ∝ $()*("#$%) • This is an empirical law 11-761 83
Zipf’s law log !("#$%) !("#$%) 1 3 5 7 9 11 13 log $()*("#$%) $()*("#$%) 1 ! "#$% ∝ $()*("#$%) • In a log-log plot the relationship is linear log !("#$%) = 1 − log($()*("#$%)) • The slope of the plot is -1.0. 11-761 84
Frequency vs. rank, Brown Corpus • Brown Corpus (1969) 500 samples of English-language text, totaling roughly one million words, compiled from • works published in the United States in 1961 15 text categories • • Appears to match Zipf 85
Is Zipf’s distribution valid? !("#$%) 1 3 5 7 9 11 13 $()*("#$%) • Is the following a valid distribution? 1 ! "#$% ∝ $()*("#$%) • Correction: 1 ! "#$% ∝ $()*("#$%) -./ 11-761 86
Adjustment !("#$%) 1 3 5 7 9 11 13 $()*("#$%) • Is the following a valid distribution? 1 ! "#$% ∝ $()*("#$%) • Only for finite vocabularies • Correction for infinite vocabularies: 1 ! "#$% ∝ $()*("#$%) -./ 87
Word unigrams, WSJ corpus • Word frequency vs rank, WSJ corpus 11-761 88
Word bigrams, WSJ corpus • Word bigram frequency vs rank, WSJ corpus 11-761 89
Word trigrams, WSJ corpus • Word trigram frequency vs rank, WSJ corpus 11-761 90
Word trigrams, WSJ corpus • Falls off rapidly with rank as predicted 11-761 91
Zipf’s law • Zipf’s law is seen to apply to a large variety of natural phenomena • Zipf suggested it is the outcome of the “principle of least effort” • We – and nature – spends the least effort most of the time • Short, easy (or easy-to-recall) words are used most frequently. Longer harder words are less frequent. • But a more mathematical explanation came from Beniot Mandelbrot and George Miller 11-761 92
Zipf’s law: Monkey on a typewriter • A monkey on a keyboard will produce text that follows Zipf’s law • Shorter words are more likely than longer words. If the keyboard has only 26 characters + space: ! "#$%&ℎ = ) = 26 , 27 ,./ 1 ! 0123 | "#$%&ℎ = ) = 26 , • Combining these will give you Zipf’s law (not quite; but we return to this) • Problem: Language is not a monkey on a typewriter 93
Don’t try this at home From wikipedia In 2003, lecturers and students from the University of Plymouth MediaLab Arts course used a £2,000 grant from the Arts Council to study the literary output of real monkeys. They left a computer keyboard in the enclosure of six Celebes crested macaques in Paignton Zoo in Devon in England for a month, with a radio link to broadcast the results on a website… Not only did the monkeys produce nothing but five total pages largely consisting of the letter S, the lead male began by bashing the keyboard with a stone, and the monkeys continued by urinating and defecating on it. 94
The Pareto distribution Pareto type 1 & ! " = $" % " &'( )*+ " ≥ " % Discovered by Vilfredo Pareto 1848-1923 • Real life phenomena are concentrated into a few frequent types, with a long tail of infrequent ones. 20% of the values account for 80% of the probability mass. • Almost everything in the universe follows the Pareto distribution • The distribution of wealth (most money in the hands of a few) • The sizes of human settlements (few cities, many hamlets/villages) • File size distribution of Internet traffic which uses the TCP protocol (many smaller files, few larger ones) • Hard disk drive error rates • The values of oil reserves in oil fields (a few large fields, many small fields) 11-761 95 • The sizes of sand particles
Zipf’s law: The Pareto distribution • Matthew 13 : 12 For whosoever hath, to him shall be given, and he shall have more abundance: but whosoever hath not, from him shall be taken away even that he hath. • When you use a word once, its easier to recall, so you are more likely to use it again • Over time, you will use familiar words more and more frequently and unfamiliar ones less and less • The same principle applies to other data 11-761 96
Mandelbrot.. • Beniot Mandelbrot (1924-2010) • Fractal theory, Chaos theory, Mandelbrot-zipf law 11-761 97
Mandelbrot-Zipf law • The money on the typewriter doesn’t actually produce Zipf’s distribution • Instead you get the Mandelbrot distribution $ ! " = % + '()*(") - • Zipf’s rule is a special case if % = 0 and / = 1 11-761 98
Mandelbrot distribution $ ! " = % + '()*(") - • In logs log ! " = ( − 2 log % + '()*(") • No longer exactly a line (curves “down”) • 2 changes the slope of the curve • % changes the log transform itself • For low-rank " , it allows you to start from a relatively high log transform • Better fits most data 11-761 99
These curves (Unigram).. • Actually better fit the Mandelbrot distribution for different values of ! and " 11-761 100
Recommend
More recommend