granger causal attentive mixtures of experts
play

Granger-causal Attentive Mixtures of Experts Learning Important - PowerPoint PPT Presentation

Granger-causal Attentive Mixtures of Experts Learning Important Features with Neural Networks Patrick Schwab 1 @schwabpa Djordje Miladinovic 2 and Walter Karlen 1 1 Institute of Robotics and Intelligent Systems, ETH Zurich 2 Department of Computer


  1. Granger-causal Attentive Mixtures of Experts Learning Important Features with Neural Networks Patrick Schwab 1 @schwabpa Djordje Miladinovic 2 and Walter Karlen 1 1 Institute of Robotics and Intelligent Systems, ETH Zurich 2 Department of Computer Science, ETH Zurich

  2. Motivation

  3. Motivation �

  4. Motivation � Age Weight Blood Pressure inputs

  5. Motivation � Age Weight � Heart Failure Risk Blood Pressure inputs model output

  6. Motivation � Age Weight � Heart Failure Risk Blood Pressure inputs model output

  7. Motivation � Age What was the decision Weight � based on? Heart Failure Risk Blood Pressure inputs model output

  8. Motivation black box � Age Weight � Heart Failure Risk Blood Pressure inputs model output

  9. Motivation � Age Weight � Heart Failure Risk Blood Pressure We desire explanation. inputs model output

  10. The Idea Can we train a neural network to output both (1) accurate predictions , and (2) feature importance scores ?

  11. Use Cases • Model understanding • Human-ML cooperation - why was this decision made? • Does this decision make sense ? • Are my model’s decisions justi fi able ? • What patterns has my model discovered ? Schwab et al. Granger-causal Attentive Mixtures of Experts: Learning Important Features with Neural Networks

  12. Approach

  13. Attentive Mixture of Experts (AME) y Granger-causally grounded + Attentive Gating G 1 G 1 G 2 a 1 a 2 a 3 Network G 3 c 1 c 2 c 3 h 1 h 2 h 3 h all = (h 1 ,c 1 ,h 2 ,c 2 ,h 3 ,c 3 ) Expert E 1 Expert E 2 Expert E 3

  14. Attentive Mixture of Experts (AME) y Granger-causally grounded + Attentive Gating G 1 G 1 G 2 a 1 a 2 a 3 Network G 3 c 1 c 2 c 3 h 1 h 2 h 3 h all = (h 1 ,c 1 ,h 2 ,c 2 ,h 3 ,c 3 ) Expert E 1 Expert E 2 Expert E 3 One independent expert per feature / feature group

  15. Attentive Mixture of Experts (AME) Attentive gates control expert contributions y Granger-causally grounded + Attentive Gating G 1 G 1 G 2 a 1 a 2 a 3 Network G 3 c 1 c 2 c 3 h 1 h 2 h 3 h all = (h 1 ,c 1 ,h 2 ,c 2 ,h 3 ,c 3 ) Expert E 1 Expert E 2 Expert E 3

  16. Attentive Mixture of Experts (AME) Experts can only contribute to y after modulation by a i y Granger-causally grounded + Attentive Gating G 1 G 1 G 2 a 1 a 2 a 3 Network G 3 c 1 c 2 c 3 h 1 h 2 h 3 h all = (h 1 ,c 1 ,h 2 ,c 2 ,h 3 ,c 3 ) Expert E 1 Expert E 2 Expert E 3

  17. However, on its own this structure has the same issue as naive soft attention mechanisms: - No incentive to learn to output accurate feature importance estimates [1]. - Often collapses to use only very few or a single expert early on during training [2, 3]. [1] Sundararajan, Taly, and Yan 2017; [2] Bengio et al. 2015; [3] Shazeer et al. 2017

  18. Granger-causal Objective • Granger (1969) postulated Granger-causality • declares relationship X ➞ Y if we are better able to predict Y using all information than if all information apart from X had been used* * Other assumptions apply that are not relevant in the presented setting.

  19. Granger-causal Objective ε X ε X/{1} f aux f aux,1 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4

  20. Granger-causal Objective ε X ε X/{1} f aux f aux,1 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 Error when considering all information

  21. Granger-causal Objective ε X ε X/{1} f aux f aux,1 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 Error when considering Error when considering all information information apart from E 1

  22. Granger-causal Objective ε X ε X/{1} f aux f aux,1 - E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 We de fi ne feature importance as the reduction in prediction error associated with adding that feature.

  23. Granger-causal Objective 1 a 0 ε X/{1} ε X - f aux f aux,1 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 We de fi ne feature importance as the reduction in prediction error associated with adding that feature.

  24. Granger-causal Objective 1 a 0 ε X/{1} ε X ε X/{2} ε X - - f aux f aux,1 f aux f aux,2 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 We de fi ne feature importance as the reduction in prediction error associated with adding that feature.

  25. Granger-causal Objective 1 a 0 ε X/{1} ε X ε X/{2} ε X/{3} ε X ε X - - - f aux f aux,1 f aux f aux,2 f aux f aux,3 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 We de fi ne feature importance as the reduction in prediction error associated with adding that feature.

  26. Granger-causal Objective 1 a 0 ε X/{1} ε X ε X/{2} ε X/{3} ε X/{4} ε X ε X ε X - - - - f aux f aux,1 f aux f aux,2 f aux f aux,3 f aux f aux,4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 We de fi ne feature importance as the reduction in prediction error associated with adding that feature.

  27. Granger-causal Objective We now have a di ff erentiable link between labels (prediction error) and feature importance. 1 a 0 ε X/{1} ε X ε X/{2} ε X/{3} ε X/{4} ε X ε X ε X - - - - f aux f aux,1 f aux f aux,2 f aux f aux,3 f aux f aux,4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 We de fi ne feature importance as the reduction in prediction error associated with adding that feature.

  28. Evaluation

  29. Important Features in Handwritten Digits

  30. Important Features in Handwritten Digits Estimation accuracy comparable to SHAP.

  31. Important Features in Handwritten Digits Orders of magnitude faster at importance estimation

  32. Important Features in Handwritten Digits

  33. Important Features in Handwritten Digits Lower MGE correlates with better feature importance estimates.

  34. Drivers of Medical Prescription Demand 35,00 34,98 33,85 33,08 32,50 SMAPE [%] 32,87 32,79 30,00 27,50 25,00 R F A A A N N M M R N I N E E M ( ( A a a = = 0 0 ) . 0 4 ) Slightly lower prediction accuracy when using AME architecture

  35. Drivers of Medical Prescription Demand 35,00 34,98 33,85 33,08 32,50 SMAPE [%] 32,87 32,79 30,00 27,50 25,00 R F A A A N N M M R N I N E E M ( ( A a a = = 0 0 ) . 0 4 ) Slightly lower prediction accuracy when using Granger-causal objective

  36. Discriminatory Genes across Cancer Types AME LIME All All BRCA BRCA KIRC KIRC COAD COAD LUAD LUAD a i PRAD a i PRAD AADAT 100133144 729884 90288 A ABCA13 ABCB6 ABCB9 ABCC9 ABCD2 ABCB9 ABCC3 ABCC4 ABCC6P1 A 553137 A1BG A1CF AASS ABCC11 729884 G A B SHAP All BRCA KIRC COAD LUAD a i PRAD ABCB1 553137 ABCA3 ABCC3 ABCC9 729884 90288 A1BG A1CF AADAT

  37. Discriminatory Genes across Cancer Types AME LIME All All BRCA BRCA KIRC KIRC COAD COAD LUAD LUAD a i PRAD a i PRAD AADAT 100133144 729884 90288 A ABCA13 ABCB6 ABCB9 ABCC9 ABCD2 ABCB9 ABCC3 ABCC4 ABCC6P1 A 553137 A1BG A1CF AASS ABCC11 729884 G A B SHAP All BRCA AME discriminates well between KIRC 1- cancer types, and COAD 2- important and unimportant genes LUAD a i PRAD ABCB1 553137 ABCA3 ABCC3 ABCC9 729884 90288 A1BG A1CF AADAT

  38. Discriminatory Genes across Cancer Types 10 10 10 8 8 7 Recall @ 10 2 1 A R S L D A I M H M F e M e A E E E p P ( L ( a a I F = = T 0 0 Associations discovered by AMEs are consistent . ) 0 5 ) with those reported by domain experts.

  39. Discriminatory Genes across Cancer Types 10 10 10 8 8 7 Recall @ 10 2 1 A R S L D A I M H M F e M e A E E E p P ( L ( a a I F = = T 0 0 . ) 0 5 ) Granger-causal objective is crucial for estimation accuracy.

  40. Limitations • No information about direction of importance, i.e. negative evidence • Large numbers of experts (>200) can become slow at training time • Workaround: Feature grouping • Requires speci fi c model architecture

  41. Conclusion

  42. Conclusion • We present a feature importance estimation approach that • learns to estimate importance from labelled data ✔ • produces accurate predictions and importance scores ✔ in a single model • is orders of magnitude faster at estimating importance ✔ than perturbation-based approaches • is consistent with associations reported by domain ✔ experts

  43. Questions? Patrick Schwab @schwabpa patrick.schwab@hest.ethz.ch Institute for Robotics and Intelligent Systems ETH Zurich Schwab, Patrick, Miladinovic, Djordje, and Karlen, Walter. Granger-causal Attentive Mixtures of Experts: Learning Important Features with Neural Networks. AAAI 2019 Source Code: github.com/d909b/AME 43

Recommend


More recommend