Local Independence Tests for Point Processes Learning causality in event models Nikolaj Thams, University of Copenhagen November 21 st , 2019 Time to Event Data and Machine Learning Workshop Joint work with Niels Richard Hansen
Hawkes Processes Causality Local independence test Experimental results Conclusion
Learning causality in event models? b h c a 0 T Time
Learning causality in event models? b h c a 0 T Time
Hawkes Processes
Point process T 1 s d s . If the compensator A k t N t Point processes T 2 T 3 where T k T k i of random measures random A point process with marks V = { 1 , . . . , d } is a collection N k = ∑ i , i is the i ’th event of type k . This defjnes processes t �→ N k t := N k ( 0 , t ] . ∫ t 0 λ k s d s for some λ k , λ k is the intensity of N k . t of N k t equals ∫ t Observe that E N k t = 0 E λ k Famous examples: Poisson process ( λ t constant) and Hawkes process (next slide).
Hawkes processes This motivates using graphs for summarizing dependencies: 1 Hawkes process 2 The process with intensity: ∫ t − ∑ ∑ ∑ λ k t = β k 0 + g vk ( t − s ) N ( d s ) = β k 0 + g vk ( t − s ) −∞ v ∈ V v ∈ V s < t is called the (linear) Hawkes process, with kernels g for some integrable functions g . 1 e − β vk 2 ( x ) . E.g. g vk ( x ) = β vk N1 N2 0.6 0.5 Process Intensity N1 0.4 N2 0.3 0 5 10 15 20 0 5 10 15 20 Time
Hawkes processes This motivates using graphs for summarizing dependencies: 1 Hawkes process 2 The process with intensity: ∫ t − ∑ ∑ ∑ λ k t = β k 0 + g vk ( t − s ) N ( d s ) = β k 0 + g vk ( t − s ) −∞ v ∈ V v ∈ V s < t is called the (linear) Hawkes process, with kernels g for some integrable functions g . 1 e − β vk 2 ( x ) . E.g. g vk ( x ) = β vk N1 N2 0.6 0.5 Process Intensity N1 0.4 N2 0.3 0 5 10 15 20 0 5 10 15 20 Time
Causality
• The global Markov property if A • Faithfulness A The global Markov property and faithfullness is the motivation for developing Causal inference A graph ̏ satisfjes, in conjunction with a separation criterion B C A C P B P B C . A B C satisfjes: c . Static system Essential assumption: Also describes the system under interventions X i c X 3 X 2 X 1 X 3 X 2 X 1 parents in a graph. Structural Causal Models (SCMs) consist of functional assignments, summarized by conditional independence tests in causality. See (Peters et al. 2017) for details. X i = f i ( X pa i , ϵ i ) , i ∈ V
• Faithfulness A The global Markov property and faithfullness is the motivation for developing Causal inference satisfjes: B C A C P B P B C . A B C • The global Markov property if A A graph ̏ satisfjes, in conjunction with a separation criterion Static system X 3 X 2 X 1 X 3 X 2 X 1 parents in a graph. Structural Causal Models (SCMs) consist of functional assignments, summarized by conditional independence tests in causality. See (Peters et al. 2017) for details. X i = f i ( X pa i , ϵ i ) , i ∈ V := c Essential assumption: Also describes the system under interventions X i := c .
Causal inference X 1 The global Markov property and faithfullness is the motivation for developing Static system X 3 X 2 conditional independence tests in causality. See (Peters et al. 2017) for details. X 3 X 2 X 1 parents in a graph. Structural Causal Models (SCMs) consist of functional assignments, summarized by X i = f i ( X pa i , ϵ i ) , i ∈ V := c Essential assumption: Also describes the system under interventions X i := c . A graph ̏ satisfjes, in conjunction with a separation criterion ⊥ satisfjes: • The global Markov property if A ⊥ B | C = = ⇒ A P B | C . | P B | C = ⇒ A ⊥ B | C = • Faithfulness A |
Causal inference: Dynamical system X 3 X 2 X 1 t 3 X 3 t 3 X 2 t 3 X 1 Causal ideas have been generalized the dynamical setting, e.g. (Didelez 2008; t 2 t 2 X 2 t 2 X 1 t 1 X 3 t 1 X 2 t 1 X 1 Mogensen, Malinsky, et al. 2018; Mogensen and Hansen 2018) X 3 . . . . . . ⇝ . . .
Volterra series t Under faithfulness assumptions, there exist algorithms for learning the causal graph depends only on events of C . t version Local independence for some C . In practice, this requires an empirical test for independence! t Let N be a marked point process. For subsets A , B , C ⊆ V , we say that B is locally independent of A given C if for every b ∈ B : λ b , A ∪ C t | F A ∪ C = E [ λ b ] ∈ F C and we write A ̸→ B | C . Heuristically, the intensity of b , when observing A ∪ C , (Meek 2014; Mogensen and Hansen 2018), by removing the edge a → b if a ̸→ b | C
Local independence test
C gener- Local independence test We want to test: c h k ally, to retain level? may not be a Hawkes process. So how to estimate Problem: If there are latent variables, the marginalized model Then t 0 j k t . We propose to fjt: t t H 0 : j ̸→ k | C Equivalently to test if λ k , C is a version of λ k , C ∪{ j } ∫ t k , C ∪{ j } g jk ( t − s ) N j ( d s ) + λ k , C λ = β 0 + H 0 : g jk = 0 will have the right level, if we estimate the true λ k , C .
Local independence test k c h k ally, to retain level? Problem: If there are latent variables, the marginalized model Then t 0 We want to test: j . We propose to fjt: t t t H 0 : j ̸→ k | C Equivalently to test if λ k , C is a version of λ k , C ∪{ j } ∫ t k , C ∪{ j } g jk ( t − s ) N j ( d s ) + λ k , C λ = β 0 + H 0 : g jk = 0 will have the right level, if we estimate the true λ k , C . may not be a Hawkes process. So how to estimate λ C gener-
Voltera approximations Voltera series for continuous systems. Theorem N , such that letting: P t N To develop a non-parametric fjt for λ C , we prove the following theorem, resembling Suppose that N is a stationary point process. There exist a sequence of functions h α ∫ t ∫ t ∑ ∑ h α N ( t − s 1 , · · · t − s n ) N α 1 ( d s 1 ) · · · N α n ( d s n ) λ N t = h 0 N + · · · −∞ −∞ n = 1 | α | = n − → λ C for N → ∞ . and λ N
Approximating intensity C x C t k 0 t C x C t In vector notation: A2: Approximate kernels using tensor splines λ C approximations A1: Approximate by 2 nd order iterated integrals. h α ( x 1 , . . . , x n ) ≈ ∑ d j n = 1 β α j 1 = 1 · · · ∑ d j 1 ,..., j n b j 1 ( x 1 ) · · · b j n ( x n ) ∫ t − t ( β ) = β 0 + ∑ λ C ( β v ) T Φ 1 ( t − s ) N v ( d s ) −∞ v ∈ C ∫ t − ∑ ( β v 1 v 2 ) T Φ 2 ( t − s 1 , t − s 2 ) N ( v 1 , v 2 ) ( d s 2 ) + −∞ v 1 , v 2 ∈ C v 2 ≥ v 1 =: β T Similarly for g jk , such that ∫ t k , C g jk ( t − s ) N j ( d s ) + λ k , C λ = β 0 + = β T x t + β T t =: β T x t
Maximum Likelihood Estimation x k t x t x T 0 T The likelihood is concave for linear intensities! s.t. We penalize with a roughness penalty: t d t approx 0 log 0 ∫ T ∫ T log L T ( β ) = ( ) N k ( d t ) − β T β T x t log L T ( β ) − κ 0 β T Ω β max β X β ≥ 0 The distribution of maximum likelihood estimate is approximately normal: ( ) ˆ ( I + 2 κ 0 ˆ T Ω) β 0 , ˆ T ˆ K T ˆ J − 1 J − 1 J − 1 β ∼ N with ˆ β T x t d t and ˆ J T = ˆ ∫ T K T = K T − 2 κ 0 Ω ˆ
Local Independence Test (1) Given the distribution of β = ( β, β C ) , we can test the hypothesis H 0 : j ̸→ k | C . How do we test Φ T β ≡ 0 ? • First idea: β approximately normal, so test directly β = 0. • Better idea (see Wood 2012), evaluate basis Φ in a grid G = { x 1 , . . . , x M } . Fitted function values over grid is thus Φ( G ) T β . If β ∼ N ( µ j , Σ j ) then Wald test statistic for null hypothesis Φ( G ) T µ j = 0 is: ] − 1 Φ( G ) T β T α = ( β ) T Φ( G ) [ Φ( G ) T Σ j Φ( G ) This is χ 2 ( M ) -distributed, and we can test for signifjcance of components!
• If test is accepted, conclude local independence. Summary of test We summarize our proposed test. To test j ̸→ k | C : • Approximate λ C by Voltera expansion at degree 2 and with spline-kernels. k , C ( β ) within model class by penalized MLE. • Fit λ • Test ϕ T β ≡ 0 using grid evaluation and Wald approximation.
Experimental results
Experiment 1: Testing various structures L 3 : P 3 : P 2 : P 1 : b h c a b h a b a We obtain acceptance rates: L 2 : L 1 : b h c a b h c a b c a In each of the following 7 structures, we test a ̸→ b | b , C : L 1 L 2 L 3 P 1 P 2 P 3 100% H 0 acceptance rate 80% Test outcome 60% Accepted 40% Rejected 20% 0% 1 2 1 2 1 2 1 2 1 2 1 2
Experiment 1: Testing various structures L 3 : P 3 : P 2 : P 1 : b h c a b h a b a We obtain acceptance rates: L 2 : L 1 : b h c a b h c a b c a In each of the following 7 structures, we test a ̸→ b | b , C : L 1 L 2 L 3 P 1 P 2 P 3 100% H 0 acceptance rate 80% Test outcome 60% Accepted 40% Rejected 20% 0% 1 2 1 2 1 2 1 2 1 2 1 2
Causal discovery We evaluate the performance in the CA-algorithm, which estimates the causal graph. c b a d c b a d d c b a a ̸→ b | { b , c , d } · · · → 0 T Time
Recommend
More recommend