gradient based neural dag learning
play

Gradient-Based Neural DAG Learning for Causal Discovery Sbastien - PowerPoint PPT Presentation

Background GraN-DAG Experiments Conclusion Gradient-Based Neural DAG Learning for Causal Discovery Sbastien Lachapelle 1 Philippe Brouillard 1 Tristan Deleu 1 Simon Lacoste-Julien 1 , 2 1Mila, Universit de Montral 2Canada CIFAR AI Chair


  1. Background GraN-DAG Experiments Conclusion Gradient-Based Neural DAG Learning for Causal Discovery Sébastien Lachapelle 1 Philippe Brouillard 1 Tristan Deleu 1 Simon Lacoste-Julien 1 , 2 1Mila, Université de Montréal 2Canada CIFAR AI Chair September 6th, 2019 Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 1 / 17

  2. Background GraN-DAG Experiments Conclusion Causal graphical model (CGM) Random vector X ∈ R d ( d variables) d = 3 Let G be a directed acyclic graph (DAG) Assume p ( x ) = � d i = 1 p ( x i | x π G i ) π G = parents of i in G i Encodes statistical independences CGM is almost identical to a Bayesian network... p ( x 1 | x 2 ) p ( x 2 ) p ( x 3 | x 1 , x 2 ) ...except arrows are given a causal meaning Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 2 / 17

  3. Background GraN-DAG Experiments Conclusion Structure Learning X 1 X 2 X 3 sample 1 1.76 10.46 0.002 sample2 3.42 78.6 0.011 ... ... sample n 4.56 9.35 1.96 Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 3 / 17

  4. Background GraN-DAG Experiments Conclusion Structure Learning X 1 X 2 X 3 sample 1 1.76 10.46 0.002 sample2 3.42 78.6 0.011 ... ... sample n 4.56 9.35 1.96 Score-based algorithms ˆ G = arg max Score ( G ) G∈ DAG Often, Score ( G ) = regularized maximum likelihood under G Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 3 / 17

  5. Background GraN-DAG Experiments Conclusion Structure Learning Taxonomy of score-based algorithms (non-exhaustive) Discrete optim. Continuous optim. GES NOTEARS Linear [Zheng et al., 2018] [Chickering, 2003] CAM GraN-DAG Nonlinear [Bühlmann et al., 2014] [Our contribution] Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 4 / 17

  6. Background GraN-DAG Experiments Conclusion NOTEARS: Continuous optimization for structure learning Encode graph as a weighted adjacency matrix U = [ u 1 | . . . | u d ] ∈ R d × d     0 0 1 0 0 4 . 8         A = U = − 1 . 7 1 0 1 0 . 2 0         0 0 0 0 0 0 � �� � � �� � Adjacency matrix Weighted adjacency matrix Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 5 / 17

  7. Background GraN-DAG Experiments Conclusion NOTEARS: Continuous optimization for structure learning Encode graph as a weighted adjacency matrix U = [ u 1 | . . . | u d ] ∈ R d × d     0 0 1 0 0 4 . 8         A = U = − 1 . 7 1 0 1 0 . 2 0         0 0 0 0 0 0 � �� � � �� � Adjacency matrix Weighted adjacency matrix Represents coefficients in a linear model : X i := u ⊤ i X + noise i ∀ i Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 5 / 17

  8. Background GraN-DAG Experiments Conclusion NOTEARS: Continuous optimization for structure learning Encode graph as a weighted adjacency matrix U = [ u 1 | . . . | u d ] ∈ R d × d     0 0 1 0 0 4 . 8         A = U = − 1 . 7 1 0 1 0 . 2 0         0 0 0 0 0 0 � �� � � �� � Adjacency matrix Weighted adjacency matrix Represents coefficients in a linear model : X i := u ⊤ i X + noise i ∀ i For an arbitrary U , associated graph might be cyclic Acyclicity constraint NOTEARS [Zheng et al., 2018] uses this differentiable acyclicity constraint :  Mk  ∞ Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 5 / 17

  9. Background GraN-DAG Experiments Conclusion NOTEARS: Continuous optimization for structure learning NOTEARS [Zheng et al., 2018]: Solve this continuous constrained optimization problem : Tr e U ⊙ U − d = 0 −� X − X U � 2 max F − λ � U � 1 s.t. U � �� � Score where X ∈ R n × d is the design matrix containing all n samples Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 6 / 17

  10. Background GraN-DAG Experiments Conclusion NOTEARS: Continuous optimization for structure learning NOTEARS [Zheng et al., 2018]: Solve this continuous constrained optimization problem : Tr e U ⊙ U − d = 0 −� X − X U � 2 max F − λ � U � 1 s.t. U � �� � Score where X ∈ R n × d is the design matrix containing all n samples Solve approximately using an Augmented Lagrangian method Amounts to maximizing (with gradient ascent) F − λ � U � 1 − α t ( Tr e U ⊙ U − d ) − µ t 2 ( Tr e U ⊙ U − d ) 2 −� X − X U � 2 while gradually increasing α t and µ t Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 6 / 17

  11. Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

  12. Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph ( A k ) ii = number of cycles of length k passing through i Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

  13. Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph ( A k ) ii = number of cycles of length k passing through i ⇒ ( A k ) ii = 0 for all i and all k Graph acyclic ⇐ Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

  14. Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph ( A k ) ii = number of cycles of length k passing through i ⇒ ( A k ) ii = 0 for all i and all k Graph acyclic ⇐ �� ∞ � A k ⇐ ⇒ Tr = 0 k = 1 k ! Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

  15. Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph ( A k ) ii = number of cycles of length k passing through i ⇒ ( A k ) ii = 0 for all i and all k Graph acyclic ⇐ �� ∞ � A k ⇐ ⇒ Tr = 0 k = 1 k ! �� ∞ k ! − A 0 � A k ⇐ ⇒ Tr = 0 k = 0 Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

  16. Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph ( A k ) ii = number of cycles of length k passing through i ⇒ ( A k ) ii = 0 for all i and all k Graph acyclic ⇐ �� ∞ � A k ⇐ ⇒ Tr = 0 k = 1 k ! �� ∞ k ! − A 0 � A k ⇐ ⇒ Tr = 0 k = 0 ⇒ Tr e A − d = 0 ⇐ Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

  17. Background GraN-DAG Experiments Conclusion NOTEARS: The acyclicity constraint  ∞ Mk  Tr e U ⊙ U − d = 0  eM � �  k ! k = 0 Suppose A ∈ { 0 , 1 } d × d is an adjacency matrix for a certain directed graph ( A k ) ii = number of cycles of length k passing through i ⇒ ( A k ) ii = 0 for all i and all k Graph acyclic ⇐ �� ∞ � A k ⇐ ⇒ Tr = 0 k = 1 k ! �� ∞ k ! − A 0 � A k ⇐ ⇒ Tr = 0 k = 0 ⇒ Tr e A − d = 0 ⇐ The argument is almost identical when using weighted adjacency U instead of A ... Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 7 / 17

  18. Background GraN-DAG Experiments Conclusion Gradient-Based Neural DAG Learning Can we go nonlinear ? Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 8 / 17

  19. Background GraN-DAG Experiments Conclusion Gradient-Based Neural DAG Learning φ ( i ) � { W ( 1 ) ( i ) , . . . , W ( L + 1 ) } ( i ) W ( ℓ ) ( i ) = ℓ th weight matrix of NN φ ( i ) φ � { φ ( i ) } d i = 1 Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 9 / 17

  20. Background GraN-DAG Experiments Conclusion Gradient-Based Neural DAG Learning φ ( i ) � { W ( 1 ) ( i ) , . . . , W ( L + 1 ) } ( i ) W ( ℓ ) ( i ) = ℓ th weight matrix of NN φ ( i ) φ � { φ ( i ) } d i = 1 � d i = 1 p ( x i | x − i ; θ ( i ) ) does not decompose according to a DAG! We need to constrain the networks to be acyclic! How? Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 9 / 17

  21. Background GraN-DAG Experiments Conclusion Gradient-Based Neural DAG Learning Key idea: Construct a weighted adjacency matrix A φ (analogous to U from the linear case) which could be used in the acyclicity constraint Then maximize likelihood under acyclicity constraint via augmented Lagrangian d � log p φ ( x i | x − i ) − α t ( Tr e A φ − d ) − µ t 2 ( Tr e A φ − d ) 2 max φ i = 0 � �� � Augmented Lagrangian Sébastien Lachapelle Mila MAIS 2019 September 6th, 2019 10 / 17

Recommend


More recommend