Probability for linguists John A Goldsmith probability and distri- butions Unigram Probability for linguists probabili- ties Logarithms and plogs John A Goldsmith From single symbols to strings of symbols Conditional July 6, 2015 probability: first steps in taking sequence into account Conditional probability: first steps in taking sequence into account
Probability for linguists John A Overall strategy Goldsmith 1 probabilities and distributions probability and distri- butions 2 unigram probability Unigram 3 a word about parametric distributions probabili- ties 4 -1 × log 2 probability (or plog : positive log probability) Logarithms and plogs 5 bigram probability: conditional probability From single symbols to 6 mutual information : the log of the ratio of the observed strings of symbols to the “expected” Conditional 7 average plog → entropy probability: first steps 8 encoding events: compression, optimal compression, in taking sequence and cross-entropy into account 9 encoding grammars optimally Conditional probability: first steps in taking sequence into account
Probability for linguists A distribution John A Goldsmith probability and distri- butions Unigram probabili- Big point 1 ties A distribution is a list of numbers that are not negative and Logarithms and plogs that sum to 1. From single symbols to strings of � symbols p i = 1 Conditional i probability: first steps p i ≥ 0 in taking sequence into account Conditional probability: first steps in taking sequence into account
Probability for linguists A probabilistic grammar John A Goldsmith probability and distri- butions Unigram probabili- ties • A probabilistic model, or grammar, is a universe of Logarithms possibilities (“sample space”) + a distribution. and plogs • A probabilistic grammar is a distribution over all From single symbols to strings of the IPA alphabet. strings of symbols • It is not a formalism stating which strings are in and Conditional probability: which are out . first steps in taking sequence into account Conditional probability: first steps in taking sequence into account
Probability for linguists The purpose of a probabilistic John A model Goldsmith probability and distri- butions Unigram Big point 2 probabili- ties The purpose of a probabilistic model is to test the model Logarithms and plogs against the data. From single symbols to • Suppose we have some well-chosen data D. Then the strings of symbols best grammar is the one that assigns the highest Conditional probability to D, all other things being equal. probability: first steps • The goal is not to test the data! in taking sequence into • Therefore: all grammars must be probabilistic, so they account can be tested and evaluated. Conditional probability: first steps in taking sequence into account
Probability for linguists Probability John A Goldsmith probability and distri- butions Unigram probabili- ties • The quantitative theory of evidence . Logarithms and plogs • If we have variable data, then probability is the best From single model to use. symbols to strings of • If we have categorical (not variable) data, probability is symbols Conditional still the best model to use. probability: first steps in taking sequence into account Conditional probability: first steps in taking sequence into account
Probability for linguists Probabilities and frequencies John A Goldsmith Probabilities and frequencies are not the same thing. probability and distri- • Frequencies are observed . butions Unigram • Probabilities are values in a system that a human being probabili- ties creates and assigns . Logarithms • We can choose to assign probabilities as the observed and plogs frequencies—buy that is not always a good idea. From single symbols to strings of • This is a good idea only so long as we don’t need to symbols handle yet-unseen (never before seen) data. Conditional probability: • In many cases, this choice maximizes the probability of first steps in taking the data. sequence into • They both deal with distributions (i.e., the observed account Conditional frequencies and the probability distributions of a probability: first steps model). in taking sequence into account
Probability for linguists Probabilities and frequencies John A Goldsmith probability and distri- butions Probabilities and frequencies are not the same thing. Unigram • Counts are counts: the number of things or events that probabili- ties fall in some category. Logarithms and plogs • Frequency is ambiguous: it either means count (less From single often) or it means relative frequency : a ratio between a symbols to strings of count of something and the total number of things that symbols fall within the larger category. Conditional probability: • There are 63,147 occurrences of the in the Brown first steps in taking Corpus, out of 1,017,904; 6.2% of the words in the sequence into Brown Corpus are the . account Conditional probability: first steps in taking sequence into account
Probability for linguists English, French, Spanish John A Goldsmith probability and distri- butions Unigram probabili- ties Let’s take a look at some languages. Logarithms and plogs And for starters, let’s just look at unigram frequencies: the From single frequencies at which items appear, not conditioned by the symbols to strings of environment. symbols people.cs.uchicago.edu/jagoldsm/course/class1 Conditional probability: first steps in taking sequence into account Conditional probability: first steps in taking sequence into account
Probability for linguists Plogs John A Goldsmith probability and distri- butions Unigram probabili- ties • We will assign probabilities to every outcome we Logarithms consider. and plogs From single • Each of these is typically quite small. symbols to strings of • We therefore use a slightly different way of talking symbols about small numbers: plogs. Conditional probability: first steps in taking sequence into account Conditional probability: first steps in taking sequence into account
Probability for linguists Inverse log probabilities, or plogs John A Goldsmith A way to describe small numbers... upside down. probability A probability its plog and distri- butions 0.5 1 Unigram 0.25 2 probabili- ties 0.128 3 Logarithms 1 4 and plogs 16 1 From single 5 symbols to 32 1 strings of 10 1024 symbols . . . . . . Conditional 1 probability: almost 20 1 , 000 , 000 first steps in taking • The bigger the plog, the smaller the probability. sequence into • It’s a bit like a measure of markedness, if you think of account Conditional more marked things as being less frequent. probability: first steps • plog ( x ) = − log 2 ( x ) = log 2 ( 1 x ) in taking sequence into account
Probability for linguists Plogs John A Goldsmith probability and distri- butions Unigram probabili- 5 ties Logarithms 4 and plogs From single symbols to plog 3 strings of symbols Conditional 2 probability: first steps in taking 1 sequence into account Conditional probability 0 1 probability: first steps in taking sequence into account
Probability for linguists John A Goldsmith Average is 4.64 below: S probability ej and distri- 6 butions z Unigram 5 s t n probabili- ties 4 @ # # Logarithms 3 and plogs 2 From single symbols to 1 stations strings of symbols Conditional This diagram from a visually interactive program displaying probability: first steps phonological complexity at: in taking sequence http://hum.uchicago.edu/~jagoldsm/PhonologicalComplexi into account Conditional probability: first steps in taking sequence into account
Probability for linguists Most and least frequent John A phonemes in English Goldsmith rank phoneme frequency plog probability and distri- 1 # 0.20 2.30 butions 2 0.066 3.92 @ Unigram probabili- 3 n 0.058 4.10 ties 4 t 0.056 4.17 Logarithms and plogs 5 s 0.041 4.61 From single 6 r 0.040 4.76 symbols to strings of 7 d 0.037 4.85 symbols 8 l 0.035 4.94 Conditional probability: 9 k 0.026 5.27 first steps in taking 10 æ ´ 0.025 5.31 sequence into 45 Oy ´ 0.000 78 10.32 account 46 0.000 69 10.50 æ ˘ Conditional probability: 47 ˇ z 0.000 54 10.84 first steps in taking 48 0.000 38 11.36 ay ˘ sequence into 49 ˘ a 0.000 36 11.42 account 50 0.000 28 11.79 ˘ O
Probability for linguists average plogs John A Goldsmith probability and distri- rank orthography phonemes av. plog 1 butions 1 a @ 3.11 Unigram probabili- 2 an @ n 3.44 ties 3 to t @ 3.47 Logarithms and plogs 4 and @ nd 3.80 From single symbols to 5 eh E ´ 3.88 strings of 6 the 3.88 symbols @ Conditional 7 can k @ n 3.90 probability: 8 an æ n 3.91 first steps ´ in taking 9 Ann æ n ´ 3.91 sequence into 10 in ´ I n 3.91 account Conditional probability: first steps in taking sequence into account
Recommend
More recommend