local independence tests for point processes learning
play

Local Independence Tests for Point Processes Learning causality in - PowerPoint PPT Presentation

Local Independence Tests for Point Processes Learning causality in event models Nikolaj Thams, University of Copenhagen November 21 st , 2019 Time to Event Data and Machine Learning Workshop Joint work with Niels Richard Hansen Hawkes Processes


  1. Local Independence Tests for Point Processes Learning causality in event models Nikolaj Thams, University of Copenhagen November 21 st , 2019 Time to Event Data and Machine Learning Workshop Joint work with Niels Richard Hansen

  2. Hawkes Processes Causality Local independence test Experimental results Conclusion

  3. Learning causality in event models? b h c a 0 T Time

  4. Learning causality in event models? b h c a 0 T Time

  5. Hawkes Processes

  6. Point process T 1 s d s . If the compensator A k t N t Point processes T 2 T 3 where T k T k i of random measures random A point process with marks V = { 1 , . . . , d } is a collection N k = ∑ i , i is the i ’th event of type k . This defjnes processes t �→ N k t := N k ( 0 , t ] . ∫ t 0 λ k s d s for some λ k , λ k is the intensity of N k . t of N k t equals ∫ t Observe that E N k t = 0 E λ k Famous examples: Poisson process ( λ t constant) and Hawkes process (next slide).

  7. Hawkes processes This motivates using graphs for summarizing dependencies: 1 Hawkes process 2 The process with intensity: ∫ t − ∑ ∑ ∑ λ k t = β k 0 + g vk ( t − s ) N ( d s ) = β k 0 + g vk ( t − s ) −∞ v ∈ V v ∈ V s < t is called the (linear) Hawkes process, with kernels g for some integrable functions g . 1 e − β vk 2 ( x ) . E.g. g vk ( x ) = β vk N1 N2 0.6 0.5 Process Intensity N1 0.4 N2 0.3 0 5 10 15 20 0 5 10 15 20 Time

  8. Hawkes processes This motivates using graphs for summarizing dependencies: 1 Hawkes process 2 The process with intensity: ∫ t − ∑ ∑ ∑ λ k t = β k 0 + g vk ( t − s ) N ( d s ) = β k 0 + g vk ( t − s ) −∞ v ∈ V v ∈ V s < t is called the (linear) Hawkes process, with kernels g for some integrable functions g . 1 e − β vk 2 ( x ) . E.g. g vk ( x ) = β vk N1 N2 0.6 0.5 Process Intensity N1 0.4 N2 0.3 0 5 10 15 20 0 5 10 15 20 Time

  9. Causality

  10. • The global Markov property if A • Faithfulness A The global Markov property and faithfullness is the motivation for developing Causal inference A graph ̏ satisfjes, in conjunction with a separation criterion B C A C P B P B C . A B C satisfjes: c . Static system Essential assumption: Also describes the system under interventions X i c X 3 X 2 X 1 X 3 X 2 X 1 parents in a graph. Structural Causal Models (SCMs) consist of functional assignments, summarized by conditional independence tests in causality. See (Peters et al. 2017) for details. X i = f i ( X pa i , ϵ i ) , i ∈ V

  11. • Faithfulness A The global Markov property and faithfullness is the motivation for developing Causal inference satisfjes: B C A C P B P B C . A B C • The global Markov property if A A graph ̏ satisfjes, in conjunction with a separation criterion Static system X 3 X 2 X 1 X 3 X 2 X 1 parents in a graph. Structural Causal Models (SCMs) consist of functional assignments, summarized by conditional independence tests in causality. See (Peters et al. 2017) for details. X i = f i ( X pa i , ϵ i ) , i ∈ V := c Essential assumption: Also describes the system under interventions X i := c .

  12. Causal inference X 1 The global Markov property and faithfullness is the motivation for developing Static system X 3 X 2 conditional independence tests in causality. See (Peters et al. 2017) for details. X 3 X 2 X 1 parents in a graph. Structural Causal Models (SCMs) consist of functional assignments, summarized by X i = f i ( X pa i , ϵ i ) , i ∈ V := c Essential assumption: Also describes the system under interventions X i := c . A graph ̏ satisfjes, in conjunction with a separation criterion ⊥ satisfjes: • The global Markov property if A ⊥ B | C = = ⇒ A P B | C . | P B | C = ⇒ A ⊥ B | C = • Faithfulness A |

  13. Causal inference: Dynamical system X 3 X 2 X 1 t 3 X 3 t 3 X 2 t 3 X 1 Causal ideas have been generalized the dynamical setting, e.g. (Didelez 2008; t 2 t 2 X 2 t 2 X 1 t 1 X 3 t 1 X 2 t 1 X 1 Mogensen, Malinsky, et al. 2018; Mogensen and Hansen 2018) X 3 . . . . . . ⇝ . . .

  14. Volterra series t Under faithfulness assumptions, there exist algorithms for learning the causal graph depends only on events of C . t version Local independence for some C . In practice, this requires an empirical test for independence! t Let N be a marked point process. For subsets A , B , C ⊆ V , we say that B is locally independent of A given C if for every b ∈ B : λ b , A ∪ C t | F A ∪ C = E [ λ b ] ∈ F C and we write A ̸→ B | C . Heuristically, the intensity of b , when observing A ∪ C , (Meek 2014; Mogensen and Hansen 2018), by removing the edge a → b if a ̸→ b | C

  15. Local independence test

  16. C gener- Local independence test We want to test: c h k ally, to retain level? may not be a Hawkes process. So how to estimate Problem: If there are latent variables, the marginalized model Then t 0 j k t . We propose to fjt: t t H 0 : j ̸→ k | C Equivalently to test if λ k , C is a version of λ k , C ∪{ j } ∫ t k , C ∪{ j } g jk ( t − s ) N j ( d s ) + λ k , C λ = β 0 + H 0 : g jk = 0 will have the right level, if we estimate the true λ k , C .

  17. Local independence test k c h k ally, to retain level? Problem: If there are latent variables, the marginalized model Then t 0 We want to test: j . We propose to fjt: t t t H 0 : j ̸→ k | C Equivalently to test if λ k , C is a version of λ k , C ∪{ j } ∫ t k , C ∪{ j } g jk ( t − s ) N j ( d s ) + λ k , C λ = β 0 + H 0 : g jk = 0 will have the right level, if we estimate the true λ k , C . may not be a Hawkes process. So how to estimate λ C gener-

  18. Voltera approximations Voltera series for continuous systems. Theorem N , such that letting: P t N To develop a non-parametric fjt for λ C , we prove the following theorem, resembling Suppose that N is a stationary point process. There exist a sequence of functions h α ∫ t ∫ t ∑ ∑ h α N ( t − s 1 , · · · t − s n ) N α 1 ( d s 1 ) · · · N α n ( d s n ) λ N t = h 0 N + · · · −∞ −∞ n = 1 | α | = n − → λ C for N → ∞ . and λ N

  19. Approximating intensity C x C t k 0 t C x C t In vector notation: A2: Approximate kernels using tensor splines λ C approximations A1: Approximate by 2 nd order iterated integrals. h α ( x 1 , . . . , x n ) ≈ ∑ d j n = 1 β α j 1 = 1 · · · ∑ d j 1 ,..., j n b j 1 ( x 1 ) · · · b j n ( x n ) ∫ t − t ( β ) = β 0 + ∑ λ C ( β v ) T Φ 1 ( t − s ) N v ( d s ) −∞ v ∈ C ∫ t − ∑ ( β v 1 v 2 ) T Φ 2 ( t − s 1 , t − s 2 ) N ( v 1 , v 2 ) ( d s 2 ) + −∞ v 1 , v 2 ∈ C v 2 ≥ v 1 =: β T Similarly for g jk , such that ∫ t k , C g jk ( t − s ) N j ( d s ) + λ k , C λ = β 0 + = β T x t + β T t =: β T x t

  20. Maximum Likelihood Estimation x k t x t x T 0 T The likelihood is concave for linear intensities! s.t. We penalize with a roughness penalty: t d t approx 0 log 0 ∫ T ∫ T log L T ( β ) = ( ) N k ( d t ) − β T β T x t log L T ( β ) − κ 0 β T Ω β max β X β ≥ 0 The distribution of maximum likelihood estimate is approximately normal: ( ) ˆ ( I + 2 κ 0 ˆ T Ω) β 0 , ˆ T ˆ K T ˆ J − 1 J − 1 J − 1 β ∼ N with ˆ β T x t d t and ˆ J T = ˆ ∫ T K T = K T − 2 κ 0 Ω ˆ

  21. Local Independence Test (1) Given the distribution of β = ( β, β C ) , we can test the hypothesis H 0 : j ̸→ k | C . How do we test Φ T β ≡ 0 ? • First idea: β approximately normal, so test directly β = 0. • Better idea (see Wood 2012), evaluate basis Φ in a grid G = { x 1 , . . . , x M } . Fitted function values over grid is thus Φ( G ) T β . If β ∼ N ( µ j , Σ j ) then Wald test statistic for null hypothesis Φ( G ) T µ j = 0 is: ] − 1 Φ( G ) T β T α = ( β ) T Φ( G ) [ Φ( G ) T Σ j Φ( G ) This is χ 2 ( M ) -distributed, and we can test for signifjcance of components!

  22. • If test is accepted, conclude local independence. Summary of test We summarize our proposed test. To test j ̸→ k | C : • Approximate λ C by Voltera expansion at degree 2 and with spline-kernels. k , C ( β ) within model class by penalized MLE. • Fit λ • Test ϕ T β ≡ 0 using grid evaluation and Wald approximation.

  23. Experimental results

  24. Experiment 1: Testing various structures L 3 : P 3 : P 2 : P 1 : b h c a b h a b a We obtain acceptance rates: L 2 : L 1 : b h c a b h c a b c a In each of the following 7 structures, we test a ̸→ b | b , C : L 1 L 2 L 3 P 1 P 2 P 3 100% H 0 acceptance rate 80% Test outcome 60% Accepted 40% Rejected 20% 0% 1 2 1 2 1 2 1 2 1 2 1 2

  25. Experiment 1: Testing various structures L 3 : P 3 : P 2 : P 1 : b h c a b h a b a We obtain acceptance rates: L 2 : L 1 : b h c a b h c a b c a In each of the following 7 structures, we test a ̸→ b | b , C : L 1 L 2 L 3 P 1 P 2 P 3 100% H 0 acceptance rate 80% Test outcome 60% Accepted 40% Rejected 20% 0% 1 2 1 2 1 2 1 2 1 2 1 2

  26. Causal discovery We evaluate the performance in the CA-algorithm, which estimates the causal graph. c b a d c b a d d c b a a ̸→ b | { b , c , d } · · · → 0 T Time

Recommend


More recommend