HOW MANY TRIALS IT TAKES TO FIND ALL DIFFERENT SPECIES OF A KIND Vassilis G. Papanicolaou Department of Mathematics National Technical University of Athens Zografou Campus 157 80, Athens, GREECE e-mail : papanico@math.ntua.gr 1
� Consider a population whose members are of N different types (e.g. colors). � For 1 ≤ j ≤ N we denote by p j the probability that a member of the population is of type j . � The members of the population are sampled independently with replacement and their types are recorded. � Our main object of study is the number T N of trials it takes until all N types are detected (at least once). � Of course, P { T N ≥ k } = 1 , 1 ≤ k ≤ N. if 2
It is convenient to introduce the events A k j , 1 ≤ j ≤ N , that the type j is not detected until trial k (included). Then A k − 1 ∪ · · · ∪ A k − 1 � � P { T N ≥ k } = P , k = 1 , 2 , ... . 1 N By invoking the inclusion-exclusion principle one gets A k − 1 A k − 1 � � � � P { T N ≥ k } = P + · · · + P 1 N A k − 1 A k − 1 A k − 1 N − 1 A k − 1 � � � � − P − · · · − P 1 2 N . . . +( − 1) N − 1 P A k − 1 · · · A k − 1 � � , 1 N or, in a more compact notation �� � ( − 1) | J |− 1 P � A k − 1 P { T N ≥ k } = , (1) j J ⊂{ 1 ,...,N } j ∈ J J � = ∅ where the sum extends over all 2 N − 1 nonempty subsets J of { 1 , ..., N } , while | J | denotes the cardinality of J . 3
Now = (1 − p j ) k − 1 , = [1 − ( p j + p i )] k − 1 , A k − 1 A k − 1 A k − 1 � � � � P P j j i and, in general, if J ⊂ { 1 , ..., N } , then �� k − 1 �� � � �� A k − 1 P = 1 − p j . (2) j j ∈ J j ∈ J Combining formulas (2) and (1) yields �� k − 1 � �� � ( − 1) | J |− 1 P { T N ≥ k } = 1 − p j , k = 1 , 2 , ... (3) J ⊂{ 1 ,...,N } j ∈ J J � = ∅ (valid also for the trivial case k = 1 under the convention 0 0 = 1 ). 4
Remark. A side result of the above analysis is the (somehow nontrivial) algebraic formula �� n � �� � ( − 1) | J | 1 − p j = 0 , n = 0 , 1 , ..., ( N − 1) . for J ⊂{ 1 ,...,N } j ∈ J 5
Lemma 0. Let X be a random variable taking values in N = { 0 , 1 , 2 , ... } . If g : N → R is a function such that E [ g ( X )] < ∞ , then ∞ � E [ g ( X )] = g (0) + [ g ( k ) − g ( k − 1)] P { X ≥ k } . (4) k =1 In particular ( g ( k ) = k ), ∞ � E [ X ] = P { X ≥ k } . (5) k =1 6
Proof. We have ∞ � E [ g ( X )] = g ( k ) P { X = k } k =0 = g (0) P { X = 0 } + g (1) P { X = 1 } + g (2) P { X = 2 } + g (3) P { X = 3 } + · · · . The above sum can be rewriten as g (0) [ P { X = 0 } + P { X = 1 } + P { X = 2 } + P { X = 3 } + · · · ] + [ g (1) − g (0)] [ P { X = 1 } + P { X = 2 } + P { X = 3 } + · · · ] + [ g (2) − g (1)] [ P { X = 2 } + P { X = 3 } + · · · ] + [ g (3) − g (2)] [ P { X = 3 } + · · · ] . . . � which is the right-hand side of (4). Remark. If g : N → R is increasing, then (4) is valid even if E [ g ( X )] = ∞ (similarly if g is decreasing). 7
Combination of (3) in (5) yields �� k − 1 ∞ � �� � ( − 1) | J |− 1 � E [ T N ] = 1 − p j . J ⊂{ 1 ,...,N } k =1 j ∈ J J � = ∅ Summation of the geometric series gives � − 1 �� � ( − 1) | J |− 1 E [ T N ] = p j , (6) J ⊂{ 1 ,...,N } j ∈ J J � = ∅ or N 1 � � ( − 1) m − 1 E [ T N ] = . p j 1 + · · · + p j m m =1 1 ≤ j 1 < ··· <j m ≤ N 8
We proceed by noticing that N � � ( − 1) | J |− 1 exp � � � 1 − e − p j t � � 1 − = − t p j . j =1 j ∈ J J ⊂{ 1 ,...,N } J � = ∅ Thus, � ∞ � N � ( − 1) | J |− 1 � � 1 − e − p j t � � 1 − dt = � . �� j ∈ J p j 0 j =1 J ⊂{ 1 ,...,N } J � = ∅ and hence � ∞ � N � � 1 − e − p j t � � E [ T N ] = 1 − dt, (7) 0 j =1 or, by substituting x = e − t in the integral, � 1 � N � dx � (1 − x p j ) E [ T N ] = 1 − x . (8) 0 j =1 Formulas (6), (7), and (8) are well-known. 9
Likewise, for the generating function of T N z − T N � � F ( z ) := E , we have the formulas ( − 1) | J |− 1 � F ( z ) = 1 − ( z − 1) � , �� z − 1 + j ∈ J p j J ⊂{ 1 ,...,N } J � = ∅ � ∞ � N � � 1 − e − p j t � e − ( z − 1) t dt, � F ( z ) = 1 − ( z − 1) 1 − 0 j =1 and � 1 � N � � (1 − x p j ) x z − 2 dx. F ( z ) = 1 − ( z − 1) 1 − 0 j =1 10
And for the second moment of T N we have � − 2 �� � ( − 1) | J |− 1 E [ T N ( T N + 1)] = 2 p j j ∈ J J ⊂{ 1 ,...,N } J � = ∅ � ∞ � N � � 1 − e − p j t � � E [ T N ( T N + 1)] = 2 1 − t dt, 0 j =1 and � 1 � N � ln x � (1 − x p j ) E [ T N ( T N + 1)] = − 2 1 − x dx. 0 j =1 11
Naturally, the simplest case regarding the previous formulas occurs when one takes p 1 = · · · = p N = 1 N . It is easy to check (e.g. by taking logarithms) that, for any fixed t > 0 , the maximun of the quantity N � 1 − e − p j t � � , j =1 subject to the constraint p 1 + · · · + p N = 1 , occurs when all p j ’s are equal. It follows that E [ T N ] attains its minimum value when all p j ’s are equal (see also M.V. Hildebrand [11]). The same is true for E [ T N ( T N + 1)] . As for z − T N � � F ( z ) = E , z > 1 , where it follows that it attains its maximum value , when all p j ’s are equal. 12
Let p 1 = · · · = p N = 1 /N . Then � 1 1 − x 1 /N � N � dx N � ( − 1) m − 1 � N � � � E [ T N ] = N = 1 − x . m m 0 m =1 The substituting u = 1 − x 1 /N in the integral yields � N � 1 � 1 � 1 − u N � u m − 1 E [ T N ] = N 1 − u du = N du = NH N , 0 0 m =1 where H N is “the N -th harmonic number” N 1 � H N = m. m =1 13
In a similar way we get N ( − 1) m − 1 � N � � F ( z ) = 1 − ( z − 1) m z − 1 + ( m/N ) m =1 � 1 � 1 − x 1 /N � N � x z − 2 dx, � = 1 − ( z − 1) 1 − 0 and N � N � H m 1 E [ T N ( T N + 1)] = 2 N 2 � m = N 2 H 2 � N + , m 2 m =1 m =1 which also implies N 1 � V [ T N ] = N 2 m 2 − NH N . m =1 14
Reminder (The Euler-Maclaurin sum formula) . If N � S ( N ) = f ( m ) , m =0 then, as N → ∞ , � N ∞ S ( N ) ∼ 1 ( − 1) k +1 B k +1 � ( k + 1)! f ( k ) ( N ) , 2 f ( N ) + f ( x ) dx + C + 0 k =1 where C is a constant and B k is the k -th Bernoulli number defined by the formula ∞ z B k � k ! z k e z − 1 = k =0 (e.g. B 0 = 0 , B 1 = − 1 / 2 , B 2 = 1 / 6 , B 4 = − 1 / 30 , B k = 0 , for all odd k ≥ 3 ). For example N ∞ m ∼ ln N + γ + 1 1 B k � � H N = 2 N − kN k , m =1 k =2 where γ = 0 . 5772 ... is Euler’s constant. Also N ∞ m 2 ∼ π 2 1 6 − 1 1 B k � � N + 2 N 2 − N k +1 . m =1 k =2 15
If p 1 = · · · = p N = 1 /N , then by using the Euler-Maclaurin sum formula one obtains ∞ E [ T N ] ∼ N ln N + γN + 1 B k � 2 − kN k − 1 , k =2 (ln N ) 2 + 2 γ ln N + γ 2 + π 2 � � ln N �� E [ T N ( T N + 1)] = N 2 6 + O , N and V [ T N ] = π 2 N 2 � ln N � − N ln N − ( γ + 1) N + O , 6 N as N → ∞ . 16
QUIZ. The town F has population 1825 ( = 5 × 365 ) while the town S has population 2190 ( = 6 × 365 ). Let f and s be the probabilities that all 365 birthdays are represented in F and S respectively. Estimate s f . 17
Answer. s f ≈ 4 . 8 . In fact, f ≈ 0 . 085 s ≈ 0 . 4051 and Hint. It can be shown (see R. Durrett [8]) that, as N → ∞ , − e − x � � P { T N − N ln N ≤ Nx } → exp . 18
Some Applications. The above formulas are associated to what is usually called the “Coupon Collector Problem” (CCP) , where N different coupons (of arbitrary occurring fre- quencies) are sampled independently with replacement. We now mention some applications. The first three examples introduce probabilistic computational algorithms which can be modeled by the CCP. 1. Constraint classification in mathematical programming. In 1983, Karwan et al. [14] described a class of randomized algorithms for classifying all the constraints in a mathemat- ical programming problem as necessary or redundant. The basic algorithm, also known as PREDUCE (Probabilistic REDUCE), can be briefly described as follows: Given an interior feasible point, each iteration consists of generating a ray in a random direction, and recording the first constraint it intersects. Such a constraint is a necessary one. The algorithm gener- ates rays until a stopping rule is satisfied. Then, all the constraints which were not hit at all are classified as redundant—possibly erroneously. Each iteration corresponds to drawing one coupon, with N being the number of necessary constraints. Thus, the CCP model can help to determine an efficient stopping rule. 19
Recommend
More recommend