Properties The afore-mentioned analysis tells us that the expected number of successes is equal to λ . To prove this rigorously: i ( ) E X ie ! i 0 i i ie ! i 1 i 1 i e ( 1 )! i 1 i j e 0 ! j j 29 e e
Properties Variance: 2 2 ( ) ( ) ( ( )) Var X E X E X i Detailed proof on the 2 2 2 ( ) E X i e ! i board. Also see here. 0 i ( ) Var X MGF: tX ( ) ( ) t E e X ti i t i / ! ( ) / ! e e i e e i 0 0 i i t t ( 1 ) e e e e e 30
Properties Mode: is a mode if k ( ) ( 1 ) 1 P X k P X k k and ( ) ( 1 ) P X k P X k k Thus we seek an that satisfies both these integer k conditions - note that often, is an integer. not 31
Notice: the mean and variance both increase with lambda. 0.4 lambda = 1 lambda = 5 0.35 lambda = 10 0.3 0.25 0.2 0.15 0.1 0.05 0 0 10 20 30 40 50 60 32
Properties Consider independent Poisson random variables X and Y having parameters λ 1 and λ 2 respectively. Then Z = X + Y is also a Poisson random variable with parameter λ 1 + λ 2 . Detailed proof on the board and in tutorial 1. PMF – recurrence relation: i ( ) P X i e ! i 1 i ( 1 ) P X i e ( 1 )! i ( 1 ) P X i , ( 0 ) P X e ( ) 1 P X i i 33
Properties If X ~ Poisson ( λ ), P( Y | X = l ) = Binomial( l , p ) where λ > 0 and 0 ≤ p ≤ 1, then Y ~ Poisson ( λ p ). This is called as thinning of a Poisson random variable by a Binomial . We will cover this derivation in a tutorial. 34
Poisson distribution: examples The number of misprints in a book (assuming the probability p of a misprint is small, and the number n of letters typed is very large, with np = expected number of misprints remaining constant) Number of traffic rule violations in a typical city in the USA (assuming the probability p of a violation is small, and the number of vehicles is very large). In general, the Poisson distribution is used to model rare events, even though the event has plenty of “opportunities” to occur. (Sometimes called the law of rare events or the law of small numbers ). 35
Poisson distribution: examples Number of people in a country who live up to 100 years Number of wrong numbers dialed in a day Number of laptops that fail on the first day of use Number of photons of light counted by the detector element of a camera 36
Multinomial distribution 37
Definition Consider a sequence of n independent trials each of which will produce one out of k possible outcomes, where the set of possible outcomes is the same for each trial. Assume that the probability of each of the k outcomes is known and constant and given by p 1 , p 2 , …, p k . 38
Definition Let X be a k -dimensional random variable for which the i th element represents the number of trials that produced the i th outcome (also known as the number of successes for the i th category) Eg: in 20 throws of a die, you had 2 ones, 4 twos, 7 threes, 4 fours, 1 five and 2 sixes. 39
Definition Then the pmf of X is given as follows: ( ( , ,..., )) P X x x x The number of ways 1 2 k to arrange n objects ( , ,..., ) P X x X x X x which can be divided 1 1 2 2 k k into k groups of ! n identical objects. x x x ... p p p 1 2 k 1 2 k There are x 1 objects ! !... ! x x x 1 2 k of type 1, x 2 objects of n type 2, and x k objects , 0 1 , 1 , ... i p p x x x n of type k . 1 2 i i k 1 i This is called the multinomial pmf . 40
Definition The success probabilities for each category, i.e. p 1 , p 2 ,…, p k are all parameters of the multinomial pmf. Remember: The multinomial random variable is a vector whose i th component is the number of successes of the i th category (i.e. the number of times that the trials produced a result of the i th category). 41
Properties Mean vector : ( ) ( , ,..., ), ( ) E X np np np E X np 1 2 k i i Variance of a component n n ( ) ( ) ( ) ( 1 ) Var X Var X Var X np p i ij ij i i 1 1 j j X ij is a Bernoulli random variable which Assuming independent tells you whether or not there was a trials success in the i th category on the j th trial 42
Properties For vector-valued random variables, the variance is replaced by the covariance matrix . The covariance matrix C in this case will have size k x k , where we have: ( , ) [( )( )] ( , ) C i j E X X Cov X X i i j j i j ( , ) , Cov X X np p i j i j i j ( 1 ), np p i j i i Pr oof : next page 43
( , ) Cov X X np p i j i j Pr oof : n These are independent Bernoulli random # successes in category X i X i ik variables – each representing the outcome k 1 of a trial (indexed by k and l ) n # successes in category X j X j jl 1 l n n By linearity of covariance ( , ) ( , ) Cov X X Cov X X i j ik jl k 1 l 1 n n n By independence of trials ( , ) ( , ) Cov X X Cov X X ik jl il jl 1 1 , 1 k l l k l n 0 ( ( ) ( ) ( )) E X X E X E X il jl il jl Since in a trial, success can be achieved l 1 only in one category n ( 0 ) p p np p i j i j 1 l 44
MGF for a Multinomial For k = 2, the multinomial reduces to the binomial. Let us derive the MGF for k = 3 (trinomial): ( , ), ( , ) X X X x x x 1 2 1 2 ! n x x n x x ( ) ( 1 ) P X x p p p p 1 2 1 2 1 2 1 2 ! ! ( )! x x n x x 1 2 1 2 t X t X ( ) ( , ) ( ) X t t t E e 1 1 2 2 1 2 X n x n ! n 1 x x n x x t x t x ( 1 ) p p p p e e 1 2 1 2 1 1 2 2 1 2 1 2 ! ! ( )! x x n x x 0 0 x x 1 2 1 2 1 2 n x n ! n 1 x x t t n x x 1 2 ( 1 ) p e p e p p 1 2 1 2 1 2 1 2 ! ! ( )! x x n x x 0 0 x x 1 2 1 2 1 2 This follows from the t t n ( 1 ) p e p e p p 1 2 1 2 1 2 multinomial theorem. 45
MGF for a Multinomial Multinomial theorem: m ! n n k ( ... ) x x x x i 1 2 m i ! !... ! k k k ... 1 k k k n i 1 2 m 1 2 m For arbitrary k : ( , ,..., ), ( , ,..., ) X x X X X x x x 1 2 1 1 2 1 k k t t t n ( ) ( ... 1 ... ) t p e p e p e p p p 1 2 k 1 X 1 2 1 1 2 1 k k 46
Hypergeometric distribution 47
Sampling with and without replacement Suppose there are k objects each of a different type. When you sample 2 objects from these with replacement , you pick a particular object with probability 1/ k , and you place it back ( replace it). The probability of picking an object of another type is again 1/ k . When you sample without replacement , the probability that your first object was of so and so type is 1/ k . The probability that your second object was of so and so type is now 1/( k -1) because you didn’t put the first object back! 48
Definition Consider a set of objects of which N are of good quality and M are defective. Suppose you pick some n objects out of these without replacement. There are C( N + M , n ) ways of doing this. Let X be a random variable denoting the number of good quality objects picked (out of a total of n ). 49
Definition There are C( N , i )C( M , n - i ) ways to pick i good quality objects and n - i bad objects. So we have ( , ) ( , ) C N i C M n i ( ) , 0 P X i i n ( , ) C N M n ( , ) 0 if or 0 C a b b a b Such a random variable X is called a hypergeometric random variable . 50
Properties Consider random variable X i which has value 1 if the i th trial produces a good quality object and 0 otherwise. Now consider the following probabilities: N ( 1 ) P X 1 N M ( 1 ) ( 1 | 1 ) ( 1 ) P X P X X P X 2 2 1 1 ( 1 | 0 ) ( 0 ) P X X P X 2 1 1 1 N N N M N 1 1 N M N M N M N M N M N In general, ( 1 ) P X i N M 51
Properties Note that: Each X i is a Bernoulli random variable with parameter n X X p = N /( N + M ). i i 1 n nN ( ) E X E X i N M 1 i n n n n ( ) ( ) 2 ( , ) Var X Var X Var X Cov X X i i i j 1 1 1 1 i i i j i NM ( ) ( 1 )( 1 ( 1 )) Var X P X P X i i i N M 52
Properties Note that: n n n n ( ) ( ) 2 ( , ) Var X Var X Var X Cov X X i i i j 1 1 1 1 i i i j i NM ( ) ( 1 )( 1 ( 1 )) Var X P X P X i i i N M ( , ) ( ) ( ) ( ) Cov X X E X X E X E X i j i j i j ( ) ( 1 ) ( 1 , 1 ) E X X P X X P X X i j i j j i ( 1 | 1 ) ( 1 ) P X X P X j i i 1 N N 1 N M N M 53
Properties Note that: ( , ) ( ) ( ) ( ) Cov X X E X X E X E X i j i j i j ( ) ( 1 ) ( 1 , 1 ) E X X P X X P X X i j i j j i ( 1 | 1 ) ( 1 ) P X X P X j i i 1 N N 1 N M N M 2 ( 1 ) N N N ( , ) Cov X X i j ( )( 1 ) N M N M N M NM 2 ( ) ( 1 ) N M N M 54
Properties Note that: 2 ( 1 ) N N N ( , ) Cov X X i j ( )( 1 ) N M N M N M NM 2 ( ) ( 1 ) N M N M ( 1 ) nNM n n NM ( ) Var X 2 2 ( ) ( ) ( 1 ) N M N M N M 1 nNM n 1 2 ( ) 1 N M N M Recall: Each X i is a 1 n ( 1 ) 1 np p Bernoulli random variable 1 N M with parameter p = N /( N + M ). ( 1 ) when N and/or M is/are very large np p 55
Gaussian distribution 56
Definition A continuous random variable is said to be normally distributed with parameters mean μ and standard deviation σ if it has a probability density function given as: 2 1 ( ) x ( ) exp denoted as f x 2 2 2 N 2 ( , ) This pdf is symmetric about the mean μ and has the shape of the “bell curve”. 57
https://upload.wikimedia.org/wikipedia/com mons/7/74/Normal_Distribution_PDF.svg 58
Definition If μ =0 and σ =1, it is called the standard normal distribution 2 1 x N ( ) exp denoted as ( 0 , 1 ) f x 2 2 To verify that this is a valid pdf: 2 2 Note the change from ( x , y ) 2 2 2 2 ( ) x x y r to polar coordinates ( r , θ ). e dx e dxdy e rdrd x = r cos ( θ ) 0 0 y = r sin ( θ ) 2 r s 2 e rdr e ds 0 0 This is a Gaussian pdf with mean 0 and standard 1 deviation 1/sqrt(2). Thus we have verified that this 2 x 1 e dx particular Gaussian function is a valid pdf. You ( 1 / 2 ) 2 can verify that Gaussians with arbitrary mean and 59 variance are valid pdfs by a change of variables.
Properties Mean: 1 2 2 ( ) /( 2 ) x ( ) ( ) E X x e dx 2 2 ( y ) /( 2 ) ( ) y e dy 2 0 ? why [ ] E X 60
Properties Variance: 1 2 2 2 2 ( ) /( 2 ) x (( ) ) ( ) E X x e dx 2 2 2 2 ( ) /( 2 ) y y e dy 2 2 ? why 61
Properties 2 If ~ ( , ) and if , then X N Y aX b 2 2 ~ ( , ) Y N a b a Proof on board. And in the book. 62
Properties Median = mean (why?) Because of symmetry of the pdf about the mean Mode = mean – can be checked by setting the first derivative of the pdf to 0 and solving, and checking the sign of the second derivative. CDF for a 0 mean Gaussian with variance 1 – is given by: x 1 2 ( ) /( 2 ) x ( ) ( ) x F x e dx X 2 63
Properties CDF – it is given by: x 1 2 ( ) /( 2 ) x ( ) ( ) x F x e dx X 2 It is closely related to the error function erf( x ) defined as: x 2 2 x ( ) erf x e dx 0 It follows that: 1 x Verify for yourself ( ) 1 x erf 2 2 64
Properties For a Gaussian with mean μ and standard deviation σ , it follows that: 1 x x 1 erf 2 2 The probability that a Gaussian random variable has values from μ - n σ to μ + n σ is given by: 1 1 n n n ( ) ( ) n n erf erf erf 2 2 2 2 2 65
Properties The probability that a Gaussian random variable has values from μ - n σ to μ + n σ is given by: n ( ) ( ) n n erf 2 Φ ( n )- Φ (- n ) n Hence a Gaussian random variable lies within 1 68.2% ±3 σ from its mean with more than 99% 2 95.4% probability 3 99.7% 4 99.99366% 5 99.9999% 6 99.9999998 % 66
Properties MGF: 2 t 2 ( ) exp / 2 t t Proof here. X 67
A strange phenomenon Let’s say you draw n = 2 values, called x 1 and x 2 , from a [0,1] uniform random distribution and compute: n x i Sampling index = i, 1 ≤ i ≤ n i 1 y n j Iteration index = j, 1 ≤ j ≤ m n (where μ is the true mean of the uniform random distribution) You repeat this process some m =5000 times (say), and then plot the histogram of the computed { y j },1 ≤ j ≤ m , values. Now suppose you repeat the earlier two steps with larger and larger n . 68
A strange phenomenon Now suppose you repeat the earlier two steps with larger and larger n . It turns out that as n grows larger and larger, the histogram starts resembling a 0 mean Gaussian distribution with variance equal to that of the sampling distribution (i.e. the [0,1] distribution). Now if you repeat the experiment with samples drawn from any other distribution instead of [0,1] uniform random (i.e. you change the sampling distribution). The phenomenon still occurs, though the resemblance may start showing up at smaller or larger values of n . This leads us to a very interesting theorem called the central limit theorem . Demo code: http://www.cse.iitb.ac.in/~ajitvr/CS215_Fall2017/CLT/ 69
Central limit theorem Consider X 1 , X 2 ,…, X n to be a sequence of independent and identically distributed (i.i.d.) random variables each with mean μ and variance σ 2 < ∞. Then as n →∞, the distribution (i.e. CDF) of the following quantity: n x i 1 i Y n n n converges to that of N (0, σ 2 ). Or, we say Y n converges in distribution to N (0, σ 2 ). This is called the Lindeberg- Levy central limit theorem . 70
Central limit theorem: some comments Note that the random variables X 1 , X 2 ,…, X n must be independent and identically distributed. Converge in distribution means the following: lim ( ) ( / ) P Y z z n n There is a version of the central limit theorem that requires only independence – and allows the random variables to belong to different distributions. This extension is called the Lindeberg Central Limit theorem , and is given on the next slide. 71
Lindeberg’s Central limit theorem Consider X 1 , X 2 ,…, X n to be a sequence of independent random variables each with mean μ i and variance ( σ i ) 2 < ∞. Then as n →∞, the distribution of the following quantity: 1 n : { 0 , 1 } X Indicator ( ) x A i i 1 function ( ) 1 if x x A i 1 Y A n n 2 0 otherwise i 1 i converges to that of N (0, 1), provided for every ε > 0 n 2 [( ) . 1 ] E x {| | } i i x s n i i n 2 1 lim i 0 , s n n i s 1 i n 72
Lindeberg’s Central limit theorem Informally speaking, the take home message from the previous slide is that the CLT is valid even if the random variables emerge from different distributions. This provides a major motivation for the widespread usage of the Gaussian distribution. The errors in experimental observations are often modelled as Gaussian – because these errors often stem from many different independent sources, and are modelled as being weighted combinations of errors from each such source. 73
Central limit theorem versus law of large numbers The law of large numbers says that the empirical mean calculated from a large number of samples is equal to (or very close) to the true mean μ (of the distribution from which the samples were drawn). The central limit theorem says that source the empirical mean calculated from a large number of samples is a random variable drawn from a Empirical mean Gaussian distribution with mean can take any of these values! equal to the true mean μ (of the distribution from which the samples were drawn). 74
Central limit theorem versus law of large numbers Is this a contradiction? 75
Central limit theorem versus law of large numbers The answer is NO! Go and look back at the central limit theorem. n x i N 2 1 i ~ ( 0 , ) Y n n n n x i N 2 1 i ~ ( 0 , / ) ( ?) n why n n This variance drops to 0 when n is x i very large! All the probability is N 2 1 i ~ ( , / ) n now concentrated at the mean! n 76
Proof of Lindberg-Levy CLT using MGFs Consider the n i.i.d. random variables X 1 , X 2 ,…,X n with mean and variance σ 2 . Let their sum be S n . Then we have to prove that: 2 x / 2 y S n e n lim P x dy n 2 n For that we will prove that the MGF of Z n equals the MGF of the standard normal distribution (i.e. exp( t 2 /2)) where S n n Z n n 77
Proof of Lindberg-Levy CLT using MGFs By properties of the MGF, we have: n ( ) ( ) t t S n X n n t tb Recall : ( ) ( ) t e at ( ) t Y X Z X n n We need to prove that: Recall : for a Gaussian r.v. with X t 2 lim log / 2 n t mean and std. dev. , n X n 2 t 2 ( ) exp / 2 t t X 78
Proof of Lindberg-Levy CLT using MGFs Labelling x = 1/ 𝑜 , we have: ' ( / ) t tx X ( / ) log ( / ) tx tx X X lim lim L’Hospital’s 0 0 x x 2 2 x x rule ' ( / ) t tx lim X 0 x 2 ( / ) x tx X ( / ) ' ' ( / ) t t tx X lim 0 x 2 ( / ) ( / ) ' ( / ) tx t x tx X X 2 2 2 2 2 2 ' ' ( 0 ) ( ) t t E X t t Recall: X 2 2 0 2 2 ( 0 ) 2 ( ) 2 2 E X ( ) r r 0 , ( 0 ) ( ) X r E X X ' ( 0 ) 1 , ( 0 ) 0 X X 79
Application Your friend tells you that in 10000 successive independent unbiased coin tosses, he counted 5200 heads. Is (s)he serious or joking? Answer: Let X 1 , X 2 ,…, X n be the random variables indicating whether or not the coin toss was a success (a heads). These are i.i.d. random variables whose sum is a random variable with mean n μ =10000(0.5) = 5000 and standard deviation σ n 1/2 = sqrt(0.5(1- 0.5))sqrt(10000) = 50. 80
Application Your friend tells you that in 10000 successive independent unbiased coin tosses, he counted 5200 heads. Is (s)he serious or joking? Answer: The given number of heads is 5200 which is 4 standard deviations away from the mean. The chance of that occurring is of the order of 0.00001 (see the slide on error functions) since the total number of heads is a Gaussian random variable (as per central limit theorem). So your friend is (most likely) joking. Notice that this answer is much more principled than giving an answer purely based on some arbitrary threshold over | X -5000|. You will study much more of this when you do a topic called hypothesis testing. 81
Binomial distribution and Gaussian distribution 82
Binomial distribution and Gaussian distribution The binomial distribution begins to resemble a Gaussian distribution with an appropriate mean for large values of n . In fact this resemblance begins to show up for surprisingly small values of n . Recall that a binomial random variable is the number of successes of independent Bernoulli trials n th , 1 ( heads on trial) else 0 X X X i i i 1 i 83
Binomial distribution and Gaussian distribution Each X i has a mean of p and standard deviation of p (1- p ). Hence the following random variable is a standard normal random variable by CLT: X np ( 1 ) np p Watch the animation here. 84
Binomial distribution and Gaussian distribution Another way of stating the afore-mentioned facts is that: When , we have , , n a,b a b X np ( ) ( ) ( ) P a b b a ( 1 ) np p where ~ ( , ) X Binomial n p This is called the de Moivre-Laplace theorem and is a special case of the CLT. But its proof was published almost 80 years before that of the CLT! 85
Distribution of the sample mean Consider independent and identically distributed random variables X 1 , X 2 , …, X n with mean μ and standard deviation σ . We know that the sample mean (or empirical mean) is a random variable given by: n Note yet again: The true mean X μ is NOT a random variable. i The sample mean is, and its 1 i X value converges to the true n mean μ by the law of large numbers. 86
Distribution of the sample mean Now we have: n X i 1 ( ) i E X E n n ( ) Var X 2 i i 1 ( ) Var X 2 n n If , , ..., were normal random variables , X X X 1 2 n then it can be proved that is also a normal random variable (how?). X Otherwise if , , ..., weren' t normal random variables , X X X 1 2 n would be only normally distribute d, as per X approximat ely the central limit theo rem. 87
Distribution of the sample variance The sample variance is given by: n n 2 2 2 ( ) X X X n X i i 2 1 1 i i S 1 1 n n The sample standard deviation is S . 88
Distribution of the sample variance The expected value of the sample variance is derived as follows: n 2 2 2 (( 1 ) ) ( ) E n S E X n X i 1 i 2 2 ( ) ( ) nE X nE X 1 2 2 ( ) ( ) ( ( )) E W Var W E W 2 2 (( 1 ) ) ( ) ( ( )) E n S nVar X n E X 1 1 2 ( ) ( ( )) nVar X n E X 89
Distribution of the sample variance The expected value of the sample variance is derived as follows: 2 2 (( 1 ) ) ( ) ( ( )) E n S nVar X n E X 1 1 2 ( ) ( ( )) nVar X n E X 2 2 2 (( 1 ) ) E n S n n 2 2 2 ( / ) ( ) ( 1 ) n n n n 2 ) 2 ( E S 90
Distribution of the sample variance The expected value of the sample variance is derived as follows: 2 2 (( 1 ) ) ( 1 ) E n S n 2 ) 2 ( E S the sample variance were instead defined as If n n 2 2 2 ( ) X X X n X i i 2 1 1 i i , S n n would have : we This is undesirable – as we would like to have the expected value of the sample variance to equal the true variance! Hence S 2 here 2 ( 1 ) n above is multiplied by ( n -1)/ n to correct for this anomaly giving rise 2 ( ) E S to our strange definition of sample variance. This multiplication by n ( n -1)/ n is called Bessel’s correction . 91
Distribution of the sample variance But the mean and the variance alone does not determine the distribution of a random variable. So what about the distribution of the sample variance? For that we need to study another distribution first – the chi-squared distribution . 92
Chi-square distribution If Z 1 , Z 2 , …., Z n are independent standard normal random variables, then the following quantity 2 2 2 ... X Z Z Z 1 2 n is said to have a chi-square distribution with n degrees of freedom and is denoted as follows The formula for this is as follows: / 2 1 / 2 n x x e ( ) f x X / 2 n 2 ( / 2 ) n ( ) ( 1 )! ( integer) y y y y 1 x ( real) x e dx y 93 0
Chi-square distribution To obtain the expression for the chi-square distribution when n = 1: ~ ( 0 , 1 ) Z N 1 2 X Z 1 2 ( ) ( ) ( ) F x P Z x P x Z x 1 1 X ( ) ( ) F x F x Z Z 1 1 1 1 ( ) ( ) ( ) f x f x f x X Z Z 2 1 2 1 x x / 2 / 2 x x 1 1 e e 2 2 2 2 x x 94
Chi-square distribution MGF of a chi-square distribution with n degrees of freedom: / 2 n ( ) ( 1 2 ) t t X Proof on the board. And here. Please note that the aforementioned MGF is defined only for t < ½. 95
96
Additive property If X 1 and X 2 are independent chi-square random variables with n 1 and n 2 degrees of freedom respectively, then X 1 + X 2 is also a chi-square random variable with n 1 + n 2 degrees of freedom. This is called the additive property . It is easy to prove this property by observing that X 1 + X 2 is basically the sum of n 1 + n 2 independent normal random variables. 97
Chi-square distribution Tables for the chi-square distribution are available for different number of degrees of freedom, and for different values of the independent variable. 98
Back to the distribution of the sample variance n n 2 2 2 2 ( 1 ) ( ) ( ) ( ) n S X X X n X i i 1 1 i i n n 2 2 ( ) ( ) X X X 2 i i ( ) n X 1 1 i i 2 2 2 n 2 ( ) X X 2 2 i n ( ) X n X 1 i i 2 1 i The square of a standard The sum of squares of n normal random variable standard normal random 99 variables
Back to the distribution of the sample variance n 2 ( ) X X 2 2 i n ( ) X n X 1 i i 2 1 i The sum of squares of n The square of a standard standard normal random normal random variable variables It turns out that these two quantities are independent random variables. The proof of this requires multivariate statistics and transformation of random variables, and is deferred to a later point in the course. If you are curious, you can browse this link , but it’s not on the exam for now. Given this fact about independence, it then follows that the middle term is a chi-square distribution with n -1 degrees of freedom. 100
Recommend
More recommend