Poisson Approximation for Two Scan Statistics with Rates of Convergence Poisson Approximation for Two Scan Statistics with Rates of Convergence Xiao Fang (Joint work with David Siegmund) National University of Singapore May 28, 2015
Poisson Approximation for Two Scan Statistics with Rates of Convergence Outline The first scan statistic The second scan statistic Other scan statistics
Poisson Approximation for Two Scan Statistics with Rates of Convergence A statistical testing problem Let { X 1 , . . . , X n } be an independent sequence of random variables. We want to test the hypothesis H 0 : X 1 , . . . , X n ∼ F θ 0 ( · ) against the alternative H 1 : for some i < j , X i +1 , . . . , X j ∼ F θ 1 ( · ) X 1 , . . . , X i , X j +1 , . . . , X n ∼ F θ 0 ( · ) i and j are called change-points. They are not specified in the alternative hypothesis. θ 0 may be given, or may need to be estimated. θ 1 may be given, or may be a nuisance parameter.
Poisson Approximation for Two Scan Statistics with Rates of Convergence The first scan statistic If j − i = t is given and F θ 0 ( · ) and F θ 1 ( · ) have different mean values, a natural statistic is M n ; t = 1 � i � n − t − 1 T i , max T i = X i + · · · + X i + t − 1 . We are interested in its p -value: Assume X 1 , . . . , X n ∼ F θ 0 ( · ), P ( M n ; t � b ) = P ( 1 � i � n − t +1 T i � b ) max =?
Poisson Approximation for Two Scan Statistics with Rates of Convergence Known results Let Y i = I ( T i � b ). { max 1 � i � n − t +1 T i � b } = { � n − t +1 Y i � 1 } . i =1 Dembo and Karlin (1992) proved that if t is fixed and b , n → ∞ plus mild conditions on F θ 0 ( · ), then n − t +1 � Y i � 1) → 1 − e − λ P ( M n ; t � b ) = P ( i =1 where λ = ( n − t + 1) E ( Y 1 ). Mild conditions on F θ 0 ( · ) ensures that P ( Y i +1 = 1 | Y i = 1) → 0 .
Poisson Approximation for Two Scan Statistics with Rates of Convergence t → ∞ : If X i ∼ Bernoulli( p ) and b is an integer, Arratia, Gordon and Waterman (1990) prove that | P ( M n ; t � b ) − (1 − e − λ ) | � C ( e − ct + t n )( λ ∧ 1) (1) where λ = ( n − t + 1) P ( T 1 = b )( b t − p ). Haiman (2007) derived more accurate approximations using the distribution function of Z k := max { T 1 , . . . , T kt +1 } for k = 1 and 2 . The distribution functions of Z k for k = 1 and 2 are only known for Bernoulli and Poisson random variables. Our objective is to extend (1) to other random variables.
Poisson Approximation for Two Scan Statistics with Rates of Convergence Preparation for the main result: Let µ 0 = E ( X 1 ). We assume b = at where a > µ 0 . X i + · · · + X i + t − 1 P ( 1 � i � n − t +1 T i � b ) = P ( max max � a ) . t 1 � i � n − t +1 We assume the distribution of X 1 can be imbedded in an exponential family of distributions dF θ ( x ) = e θ x − Ψ( θ ) dF ( x ) , θ ∈ Θ . (2) It is known that F θ has mean Ψ ′ ( θ ) and variance Ψ ′′ ( θ ). Assume θ 0 = 0, i.e., X 1 ∼ F and there exists θ a ∈ Θ o such that Ψ ′ ( θ a ) = a . Example: X 1 ∼ N (0 , 1), Ψ( θ ) = θ 2 2 , θ a = a , F θ a ∼ N ( a , 1).
Poisson Approximation for Two Scan Statistics with Rates of Convergence Assumption (2) is used in two places: 1 To obtain an accurate approximation to the marginal probability P ( T 1 � at ) by change of measure. 2 Local limit theorem Diaconis and Freedman (1988): m )) � Cm d TV ( L ( X 1 , . . . , X m | T 1 = at ) , L ( X a 1 , . . . , X a t where X a 1 , . . . , X a m are i.i.d. and X a 1 ∼ F θ a . Let D k = � k i =1 ( X a i − X i ). Let σ 2 a = Ψ ′′ ( θ a ).
Poisson Approximation for Two Scan Statistics with Rates of Convergence Theorem Under the assumption (2), for some constant C depending only on the exponential family (2), µ 0 , and a, we have � � C ((log t ) 2 +(log t ∧ log( n − t )) � P ( M n ; t � at ) − (1 − e − λ ) � � )( λ ∧ 1) , t n − t where if X 1 is nonlattice plus mild conditions, ∞ λ = ( n − t + 1) e − [ a θ a − Ψ( θ a )] t 1 k E ( e − θ a D + � k )] , exp[ − θ a σ a (2 π t ) 1 / 2 k =1 and if X 1 is integer-valued with span 1, ∞ λ = ( n − t + 1) e − ( a θ a − Ψ( θ a )) t e − θ a ( ⌈ at ⌉− at ) 1 k E ( e − θ a D + � k )] . exp[ − (1 − e − θ a ) σ a (2 π t ) 1 / 2 k =1
Poisson Approximation for Two Scan Statistics with Rates of Convergence Remarks: We don’t have an explicit expression for the constant C . The relative error → 0 if t , n − t → ∞ . Let g ( x ) = Ee ixD 1 and ξ ( x ) = log { 1 / [1 − g ( x )] } . Woodroofe (1979) proved that for the nonlattice case, � ∞ ∞ θ 2 a [ I ξ ( x ) − π 2 ] 1 k ) = − log[( a − µ 0 ) θ a ] − 1 k E ( e − θ a D + � a + x 2 ) dx x ( θ 2 π 0 k =1 � ∞ + 1 θ a { R ξ ( x ) + log[( a − µ 0 ) x ] } dx θ 2 a + x 2 π 0 where R and I denote real and imaginary parts. Tu and Siegmund (1999) proved that for the arithmetic case, ∞ 1 k E ( e − θ a D + � k ) = − log( a − µ 0 ) k =1 � 2 π � ξ ( x ) e − θ a − ix 1 − e − θ a − ix + ξ ( x ) + log[( a − µ 0 )(1 − e ix )] + 1 � dx . 2 π 1 − e ix 0
Poisson Approximation for Two Scan Statistics with Rates of Convergence Example 1: Normal distribution. n t a p 1 p 2 1000 50 0.2 0.9315 0.9594 1000 50 0.4 0.2429 0.2624 1000 50 0.5 0.0331 0.0334 2000 50 0.5 0.0668 0.0672
Poisson Approximation for Two Scan Statistics with Rates of Convergence Example 2: Bernoulli distribution. n t µ 0 a p 1 p 2 7680 30 0.1 11/30 0.14097 0.14021 7680 30 0.1 0.4 0.029614 0.029387 15360 30 0.1 0.4 0.058458 0.058003
Poisson Approximation for Two Scan Statistics with Rates of Convergence Sketch of proof: Let m = ⌊ C (log t ∧ log( n − t )) ⌋ . Let Y i = I ( T i � at , T i +1 < T i , . . . , T i + m < T i T i − 1 < T i , . . . , T i − m < T i ) . Let n − t +1 � W = Y i , λ 1 = EW = ( n − t + 1) EY 1 . i =1 P ( M n ; t � at ) ≈ P ( W � 1). From the Poisson approximation theorem of Arratia, Goldstein and Gordon (1990), we have | P ( W � 1) − (1 − e − λ 1 ) | � C (1 1 t + n − t )( λ ∧ 1) .
Poisson Approximation for Two Scan Statistics with Rates of Convergence Approximating λ 1 by λ : EY 1 = P ( T 1 � at , T 2 < T 1 , . . . , T 1+ m < T 1 ; T 0 < T 1 , . . . , T 1 − m < T 1 ) ≈ P ( T 1 � at ) P 2 ( T 1 − T 2 > 0 , . . . , T 1 − T 1+ m > 0 | T 1 ≈ at ) Note that T 1 − T 2 = X 1 − X t +1 and that given T 1 ≈ at , X 1 ∼ F θ a approximately and X t +1 ∼ F . Thus, { T 1 − T 2 > 0 } ≈ { D 1 > 0 } where D 1 = X a 1 − X 1 . Similarly, { T 1 − T k +1 > 0 } ≈ { D k > 0 } , D k = � k i =1 ( X a i − X i ) . Therefore, EY 1 ≈ P ( T 1 � at ) P 2 ( D k > 0 , k = 1 , 2 , . . . ) . Recall ∞ λ = ( n − t + 1) e − [ a θ a − Ψ( θ a )] t 1 k E ( e − θ a D + � k )] . exp[ − θ a σ a (2 π t ) 1 / 2 k =1
Poisson Approximation for Two Scan Statistics with Rates of Convergence Corollary Let { X 1 , . . . , X n } be i.i.d. random variables with distribution function F that can be imbedded in an exponential family, as in (2). Let EX 1 = µ 0 . Assume X 1 is integer-valued with span 1 . Suppose a = sup { x : p x := P ( X 1 = x ) > 0 } is finite. Let b = at. Then we have, with constants C and c depending only on p a , � P ( M n ; t � b ) − (1 − e − λ ) � � C ( λ ∧ 1) e − ct � � where λ = ( n − t ) p t a (1 − p a ) + p t a .
Poisson Approximation for Two Scan Statistics with Rates of Convergence The second scan statistic Recall that we want to test H 0 : X 1 , . . . , X n ∼ F θ 0 ( · ) against the alternative H 1 : for some i < j , X i +1 , . . . , X j ∼ F θ 1 ( · ) X 1 , . . . , X i , X j +1 , . . . , X n ∼ F θ 0 ( · ) Now assume j − i is not given, and F θ 0 and F θ 1 are from the same exponential family of distributions dF θ ( x ) = e θ x − Ψ( θ ) dF ( x ) , θ ∈ Θ . Then the log likelihood ratio statistic is j ( θ 1 − θ 0 )( X k − Ψ( θ 1 ) − Ψ( θ 0 ) � max ) . θ 1 − θ 0 0 � i < j � n k = i +1
Poisson Approximation for Two Scan Statistics with Rates of Convergence It reduces to the following problem: Let { X 1 , . . . , X n } be independent, identically distributed random variables. Let EX 1 = µ 0 < 0. Let S 0 = 0 and S i = � i j =1 X j for 1 � i � n . We are interested in the distribution of M n := 0 � i < j � n ( S j − S i ) . max Iglehart (1972) observed that it can be interpreted as the maximum waiting time of the first n customers in a single server queue. Karlin, Dembo and Kawabata (1990) discussed genomic applications.
Poisson Approximation for Two Scan Statistics with Rates of Convergence The limiting distribution was derived by Iglehart (1972): Assume the distribution of X 1 can be imbedded in an exponential family of distributions dF θ ( x ) = e θ x − Ψ( θ ) dF ( x ) , θ ∈ Θ . Assume EX 1 = Ψ ′ (0) = µ 0 < 0 and there exists a positive θ 1 ∈ Θ such that Ψ ′ ( θ 1 ) = µ 1 , Ψ( θ 1 ) = 0 . When X 1 is nonlattice, we have n →∞ P ( M n � log n + x ) = 1 − exp( − K ∗ e − θ 1 x ) . lim θ 1
Recommend
More recommend