basic statistics and probability theory
play

Basic Statistics and Probability Theory Based on Foundations of - PowerPoint PPT Presentation

0. Basic Statistics and Probability Theory Based on Foundations of Statistical NLP C. Manning & H. Sch utze, ch. 2, MIT Press, 2002 Probability theory is nothing but common sense reduced to calculation. Pierre Simon,


  1. 0. Basic Statistics and Probability Theory Based on “Foundations of Statistical NLP” C. Manning & H. Sch¨ utze, ch. 2, MIT Press, 2002 “Probability theory is nothing but common sense reduced to calculation.” Pierre Simon, Marquis de Laplace (1749-1827)

  2. 1. PLAN 1. Elementary Probability Notions: • Event Space, and Probability Function • Conditional Probabiblity • Bayes’ Theorem • Independence of Probabilistic Events 2. Random Variables: • Discrete Variables and Continuous Variables • Mean, Variance and Standard Deviation • Standard Distributions • Joint, Marginal and and Conditional Distributions • Independence of Random Variables 3. Limit Theorems 4. Estimating the parameters of probab. models from data 5. Elementary Information Theory

  3. 2. 1. Elementary Probability Notions • sample/event space: Ω (either discrete or continuous) • event: A ⊆ Ω – the certain event: Ω – the impossible event: ∅ • event space: F = 2 Ω (or a subspace of 2 Ω that contains ∅ and is closed under complement and countable union) • probability function/distribution: P : F → [0 , 1] such that: – P (Ω) = 1 – the “countable additivity” property: ∀ A 1 , ..., A k disjoint events, P ( ∪ A i ) = � P ( A i ) Consequence: for a uniform distribution in a finite sample space: P ( A ) = # favorable events # all events

  4. 3. Conditional Probabiblity • P ( A | B ) = P ( A ∩ B ) P ( B ) Note: P ( A | B ) is called the a posteriory probability of A, given B. • The “multiplication” rule: P ( A ∩ B ) = P ( A | B ) P ( B ) = P ( B | A ) P ( A ) • The “chain” rule: P ( A 1 ∩ A 2 ∩ . . . ∩ A n ) = P ( A 1 ) P ( A 2 | A 1 ) P ( A 3 | A 1 , A 2 ) . . . P ( A n | A 1 , A 2 , . . . , A n − 1 )

  5. 4. • The “total probability” formula: P ( A ) = P ( A | B ) P ( B ) + P ( A | ¬ B ) P ( ¬ B ) More generally: if A ⊆ ∪ B i and ∀ i � = j B i ∩ B j = ∅ , then P ( A ) = � i P ( A | B i ) P ( B i ) • Bayes’ Theorem: P ( B | A ) = P ( A | B ) P ( B ) P ( A ) P ( A | B ) P ( B ) or P ( B | A ) = P ( A | B ) P ( B ) + P ( A | ¬ B ) P ( ¬ B ) or ...

  6. 5. Independence of Probabilistic Events • Independent events: P ( A ∩ B ) = P ( A ) P ( B ) Note: When P ( B ) � = 0 , the above definition is equivalent to P ( A | B ) = P ( A ) . • Conditionally independent events: P ( A ∩ B | C ) = P ( A | C ) P ( B | C ) , assuming, of course, that P ( C ) � = 0 . Note: When P ( B ∩ C ) � = 0 , the above definition is equivalent to P ( A | B, C ) = P ( A | C ) .

  7. 6. 2. Random Variables 2.1 Basic Definitions Let Ω be a sample space, and P : 2 Ω → [0 , 1] a probability function. • A random variable of distribution P is a function X : Ω → R n ◦ For now, let us consider n = 1 . ◦ The cumulative distribution function of X is F : R → [0 , ∞ ) defined by F ( x ) = P ( X ≤ x ) = P ( { ω ∈ Ω | X ( ω ) ≤ x } )

  8. 7. 2.2 Discrete Random Variables Definition: Let P : 2 Ω → [0 , 1] be a probability function, and X be a random variable of distribution P . • If Image ( X ) is either finite or unfinite countable, then X is called a discrete random variable. ◦ For such a variable we define the probability mass function (pmf) def p : R → [0 , 1] as p ( x ) = p ( X = x ) = P ( { ω ∈ Ω | X ( ω ) = x } ) . (Obviously, it follows that � x i ∈ Image ( X ) p ( x i ) = 1 .) Mean, Variance, and Standard Deviation: • Expectation / mean of X : = E [ X ] = � not. E ( X ) x xp ( x ) if X is a discrete random variable. not. = Var [ X ] = E (( X − E ( X )) 2 ) . • Variance of X : Var ( X ) � • Standard deviation: σ = Var ( X ) . Covariance of X and Y , two random variables of distribution P : • Cov ( X, Y ) = E [( X − E [ X ])( Y − E [ Y ])]

  9. 8. Exemplification: n p r (1 − p ) n − r ( 0 ≤ r ≤ n ) • the Binomial distribution: b ( r ; n, p ) = C r mean: np , variance: np (1 − p ) ◦ the Bernoulli distribution: b ( r ; 1 , p ) The probability mass function and the cumulative distribution function of the Binomial distribution:

  10. 9. 2.3 Continuous Random Variables Definitions: Let P : 2 Ω → [0 , 1] be a probability function, and X : Ω → R be a random variable of distribution P . • If Image ( X ) is unfinite non-countable set, and F , the cumulative distribution function of X is continuous, then X is called a continuous random variable. (It follows, naturally, that P ( X = x ) = 0 , for all x ∈ R .) � x ◦ If there exists p : R → [0 , ∞ ) such that F ( x ) = −∞ p ( t ) dt , then X is called absolutely continuous. In such a case, p is called the probability density function (pdf) of X . � ◦ For B ⊆ R for which B p ( x ) dx exists, � def = P ( { ω ∈ Ω | X ( ω ) ∈ B } ) = Pr ( B ) B p ( x ) dx . � + ∞ • In particular, −∞ p ( x ) dx = 1 . � not. • Expectation / mean of X : E ( X ) = E [ X ] = xp ( x ) dx .

  11. 10. Exemplification: − ( x − µ ) 2 2 πσ e √ 1 2 σ 2 • Normal (Gaussean) distribution: N ( x ; µ, σ ) = mean: µ , variance: σ 2 ◦ Standard Normal distribution: N ( x ; 0 , 1) • Remark: For n, p such that np (1 − p ) > 5 , the Binomial distributions can be approximated by Normal distributions.

  12. 11. The Normal distribution: the probability density function and the cumulative distribution function

  13. 12. 2.4 Basic Properties of Random Variables Let P : 2 Ω → [0 , 1] be a probability function, X : Ω → R n be a random discrete/continuous variable of distribution P . • If g : R n → R m is a function, then g ( X ) is a random variable. If g ( X ) is discrete, then E ( g ( X )) = � x g ( x ) p ( x ) . � If g ( X ) is continuous, then E ( g ( X )) = g ( x ) p ( x ) dx . • E ( aX + b ) = aE ( X ) + b . ◦ If g is non-linear �⇒ E ( g ( X )) = g ( E ( X )) . • E ( X + Y ) = E ( X ) + E ( Y ) . • Var ( X ) = E ( X 2 ) − E 2 ( X ) . ◦ Var ( aX ) = a 2 Var ( X ). • Cov ( X, Y ) = E [ XY ] − E [ X ] E [ Y ] .

  14. 13. 2.5 Joint, Marginal and Conditional Distributions Exemplification for the bi-variate case: Let Ω be a sample space, P : 2 Ω → [0 , 1] a probability function, and V : Ω → R 2 be a random variable of distribution P . One could naturally see V as a pair of two random variables X : Ω → R and Y : Ω → R . (More precisely, V ( ω ) = ( x, y ) = ( X ( ω ) , Y ( ω )) .) • the joint pmf/pdf of X and Y is defined by not. p ( x, y ) = p X,Y ( x, y ) = P ( X = x, Y = y ) = P ( ω ∈ Ω | X ( ω ) = x, Y ( ω ) = y ) . • the marginal pmf/pdf functions of X and Y are: for the discrete case: p X ( x ) = � p Y ( y ) = � y p ( x, y ) , x p ( x, y ) for the continuous case: � � p X ( x ) = y p ( x, y ) dy , p Y ( y ) = x p ( x, y ) dx • the conditional pmf/pdf of X given Y is: p X | Y ( x | y ) = p X,Y ( x, y ) p Y ( y )

  15. 14. 2.6 Independence of Random Variables Definitions: • Let X, Y be random variables of the same type (i.e. either discrete or continuous), and p X,Y their joint pmf/pdf. X and Y are are said to be independent if p X,Y ( x, y ) = p X ( x ) · p Y ( y ) for all possible values x and y of X and Y respectively. • Similarly, let X, Y and Z be random variables of the same type, and p their joint pmf/pdf. X and Y are conditionally independent given Z if p X,Y | Z ( x, y | z ) = p X | Z ( x | z ) · p Y | Z ( y | z ) for all possible values x, y and z of X, Y and Z respectively.

  16. 15. Properties of random variables pertaining to independence • If X, Y are independent, then Var ( X + Y ) = Var ( X ) + Var ( Y ). • If X, Y are independent, then E ( XY ) = E ( X ) E ( Y ) , i.e. Cov ( X, Y ) = 0 . ◦ Cov ( X, Y ) = 0 �⇒ X, Y are independent. ◦ The covariance matrix corresponding to a vector of random variables is symmetric and positive semi-definite. • If the covariance matrix of a multi-variate Gaussian distribution is diagonal, then the marginal distributions are independent.

  17. 16. 3. Limit Theorems [ Sheldon Ross, A first course in probability, 5th ed., 1998 ] “The most important results in probability theory are limit theo- rems. Of these, the most important are... laws of large numbers, concerned with stating conditions under which the average of a sequence of random variables converge (in some sense) to the expected average; central limit theorems, concerned with determining the conditions under which the sum of a large number of random variables has a probability distribution that is approximately normal.”

  18. 17. Two basic inequalities and the weak law of large numbers Markov’s inequality: If X is a random variable that takes only non-negative values, then for any value a > 0 , P ( X ≥ a ) ≤ E [ X ] a Chebyshev’s inequality: If X is a random variable with finite mean µ and variance σ 2 , then for any value k > 0 , P ( | X − µ |≥ k ) ≤ σ 2 k 2 The weak law of large numbers (Bernoulli; Khintchine): Let X 1 , X 2 , . . . , X n be a sequence of independent and identically dis- tributed random variables, each having a finite mean E [ X i ] = µ . Then, for any value ǫ > 0 , �� � � X 1 + . . . + X n � � − µ � ≥ ǫ → 0 as n → ∞ P � � � n

Recommend


More recommend