1 A statistical definition of probability: frequentist 2 concepts: - PDF document

Probability and Likelihood, a brief introduction in support of a course on molecular evolution (BIOL 3046) Probability The subject of probability is a branch of mathematics dedicated to building models to describe conditions of uncertainty and providing tools to make decisions or draw conclusions on the basis of such models. In the broad sense, a probability is a measure of the degree to which an occurrence is certain [or uncertain]. 1

A statistical definition of probability: frequentist 2 concepts: 1. Sample space , S , is the collection [sometimes called universe] of all possible outcomes. The sample space is a set where each outcome comprises one element of the set. 2. Relative frequency is the proportion of the sample space on which an event E occurs. In an experiment with 100 outcomes, and E occurs 81 times, the relative frequency is 81/100 or 0.81. A statistical definition of probability: frequentist The statistical definition is derived from statistical regularity. Statistical regularity is the property of a relative frequency in the long run, over replicates, where the cumulative relative frequency (crf) of an event (E) stabilizes. The crf is simply the relative frequency computed cumulatively over some number of replicates of samples, each with a space S. 2

Month Number of Number Cumulative S Cumulative E crf subjects (S) Controlled (E) 1 100 80 100 80 0.800 2 100 88 200 168 0.840 3 100 75 300 243 0.810 4 100 77 400 320 0.800 5 100 80 500 400 0.800 6 100 76 600 476 0.793 7 100 82 700 558 0.797 8 100 79 800 637 0.796 9 100 80 900 717 0.797 10 100 76 1000 793 0.793 11 100 77 1100 970 0.791 12 100 78 1200 948 0.790 [data for example is after McColl (1995)] In words, the probability of an event E, written as P(E), is the long run (cumulative) relative frequency of E. P(E) lim crf ( ) E = n n → ∞ ( ) P(E) lim crf E = n n → ∞ 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2500 5000 7500 10000 Hypothetical plot of crf of an event 3

Probability axioms: 1. Probability scale= 1 to 0. Hence, 0 ≤ P(E) ≤ 1. 2. Probabilities are derived from a relative frequency of an event (E) in the “space of all possible outcomes” ( S ), where P(S) = 1 . Hence, if the probability of an event (E) is P(E), then the probability that E does not occur is 1 – P(E). 3. When events E and F are disjoint , they cannot occur together. The probability of disjoint events E or F = P(E or F) = P(E) + P(F). 4. Axiom 3 above deals with a finite sequence of events. Axiom 4 is an extension of axiom 3 to an infinite sequence of events. Product rule: The product rule applies when two events E1 and E2 are independent . E1 and E2 are independent if the occurrence or non-occurrence of E1 does not change the probability of E2 [and vice versa]. [ A further statistical definition requires the use of the multiplication theorem ] It is important to note that a proof of statistical independence for a specific case by using the multiplication theorem is rarely possible; hence, most models incorporate independence as a model assumption. When E1 and E2 occur together they are joint events. The joint probability of the independent events E1 and E2 = P(E1,E2) = P(E1) × P(E2). Hence the term “product rule” or “multiplication principle”, or whatever you call it. 4

Conditional probability: is the probability of event E2 assuming that event E1 has already occurred. We assume the E1 and E2 events are in a given sample space, S, and P(E1) > 0. We write it as P(E2|E1); the vertical bar is read as “ given ” . Example for “ jog your memory ” : Suppose we have two fair dice: For one: S = 1,2,3,4,5 & 6 P(S) = 1 P(1), ..., P(6) = 1/6, …, 1/6 For two: S = 36 different pairs of integers [1,6] You roll Die #1; what is the probability that you roll a “ 5 ” or a “ 6 ” ? Die 1: P(5) = 1/6 and P(6) = 1/6 P(5 or 6) = P(5) + P(6) = 1/3 You roll both dice: what is the probability that you roll two “ 5 ” s? P(5,5) = P(5) × P(5) = 1/6 × 1/6 = 1/36 5

Example for “ jog your memory ” : What about the conditional probability that second roll is a “ 5 ” given the first was a “ 5 ” ? We write this as follows: P ( E 1 , E 2 ) P ( E 1 | E 2 ) = P ( E 2 ) 1 P ( 5 , 5 ) 36 1 P ( 5 | 5 ) = P = = 6 ( 5 ) 1 / 6 There is a logically satisfying result: since the two rolls are independent, it should not matter what the first roll was, and indeed the outcome of the second roll [conditional on the first roll] was 1/6. Probability model n ⎛ ⎞ k n k ( ) ( ) − P p 1 p = ⎜ ⎟ − ⎜ ⎟ k ⎝ ⎠ n ⎛ ⎞ n ! ⎜ ⎟ = ( ) ⎜ ⎟ k ! n k ! k − ⎝ ⎠ Coin toss example: what is the probability of obtaining, say, 5 heads given a fair coin (p = 0.5) and 12 tosses? P(k=5 | p=0.5, n=12). 6

Probability model ( ) = P k 5 | p 0 . 5 , n 12 = = = 12 ⎛ ⎞ 5 12 5 ( ) ( ) − 0 . 5 1 0 . 5 ⎜ ⎟ = × − ⎜ ⎟ 5 ⎝ ⎠ 12 12 ! ⎛ ⎞ ⎜ ⎟ = ⎜ ⎟ 5 ( ) 5 ! 12 5 ! − ⎝ ⎠ Probability and likelihood are inverted Probability refers to the occurrence of some future outcome. • For example: “ If I toss a fair coin 12 times, what is the probability that I will obtain 5 heads and 7 tails? ” Likelihood refers to a past event with a known outcome. • For example: “ What is the probability that my coin is fair if I tossed it 12 times and observed 5 heads and 7 tails ” 7

Case 1: probability The question is the same: “ If I toss a fair coin 12 times, what is the probability that I will obtain 5 heads and 7 tails? ” The answer comes directly from the above formula where n = 12, and k = 5. The probability of such a future event is 0.193359. Probability of 5 heads & 7 tails = 0.1933 Our outcome of 5 heads & 7 tails = 0.1933 Axiom 2: P(S) = 1; the probability of each outcome (i.e., 0 to 12 heads) sum to 1. Case 2: likelihood The second question is: “ What is the probability that my coin is fair if I tossed it 12 times and observed 5 heads and 7 tails? ” We have inverted the problem: In case 1: we were interested in the probability of a future outcome given that my coin is fair. In case 2: we are interested in the probability that my coin is fair, given a particular outcome. So, in the likelihood framework we have inverted the question such that the hypothesis (H) is variable, and the outcome (let ’ s call it the data, D) is constant. We are interested in P(H|D), but we have a problem… 8

Case 2: likelihood A problem: What we want to measure is P(H|D). The problem is that we can ’ t work with the probability of a hypothesis, only the relative frequencies of outcomes. The solution: The P(H|D) = α P(D|H) The P(H|D) = α P(D|H) The P(H|D) = α P(D|H) Constant value of proportionality Constant value of proportionality Constant value of proportionality n ⎛ ⎞ k n k ( ) ( ) − P p 1 p ⎜ ⎟ = − ⎜ ⎟ k ⎝ ⎠ P ROBABILITIES Data D1: 1H & 1T D2: 2H Hypotheses H 1 : p(H) = 1/4 0.375 0.0625 H 2 : p(H) = 1/2 0.5 0.25 9

n ⎛ ⎞ k n k ( ) ( ) − P p 1 p = ⎜ ⎟ − ⎜ ⎟ k ⎝ ⎠ L IKELIHOODS Data D1: 1H & 1T D2: 2H Hypotheses H 1 : p(H) = 1/4 ! 1 " 0.375 ! 2 " 0.0625 H 2 : p(H) = 1/2 ! 1 " 0.5 ! 2 " 0.25 For Dataset 1 (D1): H1 is less likely than H2 by a factor of ¾ . The framework here is one of relative support. An example of Likelihood in action Coin toss: What is likelihood that my coin is “ fair ” given 12 tosses with 5 heads and 7 tails? Is the hypothesis of “ fairness ” the best explanation of these data? The P(H|D) = α × P(D|H) L = α × P(D|H) L = P(D|H) n ⎛ ⎞ k n k P ( ) ( p 1 p ) − = ⎜ ⎟ − ⎜ ⎟ k ⎝ ⎠ D = outcome ( n choose k ) H = probability ( p ) 10

Maximum Likelihood score = 0.228 0.25 p =0.5 ( L = 0.193) 0.2 0.15 0.1 0.05 0 0 0.2 0.4 0.6 0.8 1 ML estimate of p = 0.42 Likelihood that the coin is fair (p = 0.5) is 0.193. This (p = 0.5) is less likely than the MLE by about: Recall Case 1: probability The question is the same: “ If I toss a fair coin 12 times, what is the probability that I will obtain 5 heads and 7 tails? ” The answer comes directly from the above formula where n = 12, and k = 5. The probability of such a future event is 0.193359. Probability of 5 heads & 7 tails = 0.1933 Our outcome of 5 heads & 7 tails = 0.1933 Axiom 2: P(S) = 1; the probability of each outcome (i.e., 0 to 12 heads) sum to 1. 11

Don’t forget, the area under the likelihood curve does not sum to 1 12