0. Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning & H. Sch¨ utze, ch. 2, MIT Press, 2002 “Probability theory is nothing but common sense reduced to calculation.” Pierre Simon, Marquis de Laplace (1749-1827)
1. PLAN 1. Elementary Probability Notions: • Event Space, and Probability Function • Conditional Probabiblity • Bayes’ Theorem • Independence of Probabilistic Events 2. Random Variables: • Discrete Variables and Continuous Variables • Mean, Variance and Standard Deviation • Standard Distributions • Joint, Marginal and and Conditional Distributions • Independence of Random Variables 3. Limit Theorems 4. Estimating the parameters of probab. models from data 5. Elementary Information Theory
2. 1. Elementary Probability Notions • sample/event space: Ω (either discrete or continuous) • event: A ⊆ Ω – the certain event: Ω – the impossible event: ∅ • event space: F = 2 Ω (or a subspace of 2 Ω that contains ∅ and is closed under complement and countable union) • probability function/distribution: P : F → [0 , 1] such that: – P (Ω) = 1 – the “countable additivity” property: ∀ A 1 , ..., A k disjoint events, P ( ∪ A i ) = � P ( A i ) Consequence: for a uniform distribution in a finite sample space: P ( A ) = # favorable events # all events
3. Conditional Probabiblity • P ( A | B ) = P ( A ∩ B ) P ( B ) Note: P ( A | B ) is called the a posteriory probability of A, given B. • The “multiplication” rule: P ( A ∩ B ) = P ( A | B ) P ( B ) = P ( B | A ) P ( A ) • The “chain” rule: P ( A 1 ∩ A 2 ∩ . . . ∩ A n ) = P ( A 1 ) P ( A 2 | A 1 ) P ( A 3 | A 1 , A 2 ) . . . P ( A n | A 1 , A 2 , . . . , A n − 1 )
4. • The “total probability” formula: P ( A ) = P ( A | B ) P ( B ) + P ( A | ¬ B ) P ( ¬ B ) More generally: if A ⊆ ∪ B i and ∀ i � = j B i ∩ B j = ∅ , then P ( A ) = � i P ( A | B i ) P ( B i ) • Bayes’ Theorem: P ( B | A ) = P ( A | B ) P ( B ) P ( A ) P ( A | B ) P ( B ) or P ( B | A ) = P ( A | B ) P ( B ) + P ( A | ¬ B ) P ( ¬ B ) or ...
5. Independence of Probabilistic Events • Independent events: P ( A ∩ B ) = P ( A ) P ( B ) Note: When P ( B ) � = 0 , the above definition is equivalent to P ( A | B ) = P ( A ) . • Conditionally independent events: P ( A ∩ B | C ) = P ( A | C ) P ( B | C ) , assuming, of course, that P ( C ) � = 0 . Note: When P ( B ∩ C ) � = 0 , the above definition is equivalent to P ( A | B, C ) = P ( A | C ) .
6. 2. Random Variables 2.1 Basic Definitions Let Ω be a sample space, and P : 2 Ω → [0 , 1] a probability function. • A random variable of distribution P is a function X : Ω → R n ◦ For now, let us consider n = 1 . ◦ The cumulative distribution function of X is F : R → [0 , ∞ ) defined by F ( x ) = P ( X ≤ x ) = P ( { ω ∈ Ω | X ( ω ) ≤ x } )
7. 2.2 Discrete Random Variables Definition: Let P : 2 Ω → [0 , 1] be a probability function, and X be a random variable of distribution P . • If Image ( X ) is either finite or unfinite countable, then X is called a discrete random variable. ◦ For such a variable we define the probability mass function (pmf) def p : R → [0 , 1] as p ( x ) = p ( X = x ) = P ( { ω ∈ Ω | X ( ω ) = x } ) . (Obviously, it follows that � x i ∈ Image ( X ) p ( x i ) = 1 .) Mean, Variance, and Standard Deviation: • Expectation / mean of X : = E [ X ] = � not. E ( X ) x xp ( x ) if X is a discrete random variable. not. = Var [ X ] = E (( X − E ( X )) 2 ) . • Variance of X : Var ( X ) � • Standard deviation: σ = Var ( X ) . Covariance of X and Y , two random variables of distribution P : • Cov ( X, Y ) = E [( X − E [ X ])( Y − E [ Y ])]
8. Exemplification: n p r (1 − p ) n − r ( 0 ≤ r ≤ n ) • the Binomial distribution: b ( r ; n, p ) = C r mean: np , variance: np (1 − p ) ◦ the Bernoulli distribution: b ( r ; 1 , p ) The probability mass function and the cumulative distribution function of the Binomial distribution:
9. 2.3 Continuous Random Variables Definitions: Let P : 2 Ω → [0 , 1] be a probability function, and X : Ω → R be a random variable of distribution P . • If Image ( X ) is unfinite non-countable set, and F , the cumulative distribution function of X is continuous, then X is called a continuous random variable. (It follows, naturally, that P ( X = x ) = 0 , for all x ∈ R .) � x ◦ If there exists p : R → [0 , ∞ ) such that F ( x ) = −∞ p ( t ) dt , then X is called absolutely continuous. In such a case, p is called the probability density function (pdf) of X . � ◦ For B ⊆ R for which B p ( x ) dx exists, � def = P ( { ω ∈ Ω | X ( ω ) ∈ B } ) = Pr ( B ) B p ( x ) dx . � + ∞ • In particular, −∞ p ( x ) dx = 1 . � not. • Expectation / mean of X : E ( X ) = E [ X ] = xp ( x ) dx .
10. Exemplification: − ( x − µ ) 2 2 πσ e √ 1 2 σ 2 • Normal (Gaussean) distribution: N ( x ; µ, σ ) = mean: µ , variance: σ 2 ◦ Standard Normal distribution: N ( x ; 0 , 1) • Remark: For n, p such that np (1 − p ) > 5 , the Binomial distributions can be approximated by Normal distributions.
11. The Normal distribution: the probability density function and the cumulative distribution function
12. 2.4 Basic Properties of Random Variables Let P : 2 Ω → [0 , 1] be a probability function, X : Ω → R n be a random discrete/continuous variable of distribution P . • If g : R n → R m is a function, then g ( X ) is a random variable. If g ( X ) is discrete, then E ( g ( X )) = � x g ( x ) p ( x ) . � If g ( X ) is continuous, then E ( g ( X )) = g ( x ) p ( x ) dx . • E ( aX + b ) = aE ( X ) + b . ◦ If g is non-linear �⇒ E ( g ( X )) = g ( E ( X )) . • E ( X + Y ) = E ( X ) + E ( Y ) . • Var ( X ) = E ( X 2 ) − E 2 ( X ) . ◦ Var ( aX ) = a 2 Var ( X ). • Cov ( X, Y ) = E [ XY ] − E [ X ] E [ Y ] .
13. 2.5 Joint, Marginal and Conditional Distributions Exemplification for the bi-variate case: Let Ω be a sample space, P : 2 Ω → [0 , 1] a probability function, and V : Ω → R 2 be a random variable of distribution P . One could naturally see V as a pair of two random variables X : Ω → R and Y : Ω → R . (More precisely, V ( ω ) = ( x, y ) = ( X ( ω ) , Y ( ω )) .) • the joint pmf/pdf of X and Y is defined by not. p ( x, y ) = p X,Y ( x, y ) = P ( X = x, Y = y ) = P ( ω ∈ Ω | X ( ω ) = x, Y ( ω ) = y ) . • the marginal pmf/pdf functions of X and Y are: for the discrete case: p X ( x ) = � p Y ( y ) = � y p ( x, y ) , x p ( x, y ) for the continuous case: � � p X ( x ) = y p ( x, y ) dy , p Y ( y ) = x p ( x, y ) dx • the conditional pmf/pdf of X given Y is: p X | Y ( x | y ) = p X,Y ( x, y ) p Y ( y )
14. 2.6 Independence of Random Variables Definitions: • Let X, Y be random variables of the same type (i.e. either discrete or continuous), and p X,Y their joint pmf/pdf. X and Y are are said to be independent if p X,Y ( x, y ) = p X ( x ) · p Y ( y ) for all possible values x and y of X and Y respectively. • Similarly, let X, Y and Z be random variables of the same type, and p their joint pmf/pdf. X and Y are conditionally independent given Z if p X,Y | Z ( x, y | z ) = p X | Z ( x | z ) · p Y | Z ( y | z ) for all possible values x, y and z of X, Y and Z respectively.
15. Properties of random variables pertaining to independence • If X, Y are independent, then Var ( X + Y ) = Var ( X ) + Var ( Y ). • If X, Y are independent, then E ( XY ) = E ( X ) E ( Y ) , i.e. Cov ( X, Y ) = 0 . ◦ Cov ( X, Y ) = 0 �⇒ X, Y are independent. ◦ The covariance matrix corresponding to a vector of random variables is symmetric and positive semi-definite. • If the covariance matrix of a multi-variate Gaussian distribution is diagonal, then the marginal distributions are independent.
16. 3. Limit Theorems [ Sheldon Ross, A first course in probability, 5th ed., 1998 ] “The most important results in probability theory are limit theo- rems. Of these, the most important are... laws of large numbers, concerned with stating conditions under which the average of a sequence of random variables converge (in some sense) to the expected average; central limit theorems, concerned with determining the conditions under which the sum of a large number of random variables has a probability distribution that is approximately normal.”
17. Two basic inequalities and the weak law of large numbers Markov’s inequality: If X is a random variable that takes only non-negative values, then for any value a > 0 , P ( X ≥ a ) ≤ E [ X ] a Chebyshev’s inequality: If X is a random variable with finite mean µ and variance σ 2 , then for any value k > 0 , P ( | X − µ |≥ k ) ≤ σ 2 k 2 The weak law of large numbers (Bernoulli; Khintchine): Let X 1 , X 2 , . . . , X n be a sequence of independent and identically dis- tributed random variables, each having a finite mean E [ X i ] = µ . Then, for any value ǫ > 0 , �� � � X 1 + . . . + X n � � − µ � ≥ ǫ → 0 as n → ∞ P � � � n
Recommend
More recommend