Granger-causal Attentive Mixtures of Experts Learning Important Features with Neural Networks Patrick Schwab 1 @schwabpa Djordje Miladinovic 2 and Walter Karlen 1 1 Institute of Robotics and Intelligent Systems, ETH Zurich 2 Department of Computer Science, ETH Zurich
Motivation
Motivation �
Motivation � Age Weight Blood Pressure inputs
Motivation � Age Weight � Heart Failure Risk Blood Pressure inputs model output
Motivation � Age Weight � Heart Failure Risk Blood Pressure inputs model output
Motivation � Age What was the decision Weight � based on? Heart Failure Risk Blood Pressure inputs model output
Motivation black box � Age Weight � Heart Failure Risk Blood Pressure inputs model output
Motivation � Age Weight � Heart Failure Risk Blood Pressure We desire explanation. inputs model output
The Idea Can we train a neural network to output both (1) accurate predictions , and (2) feature importance scores ?
Use Cases • Model understanding • Human-ML cooperation - why was this decision made? • Does this decision make sense ? • Are my model’s decisions justi fi able ? • What patterns has my model discovered ? Schwab et al. Granger-causal Attentive Mixtures of Experts: Learning Important Features with Neural Networks
Approach
Attentive Mixture of Experts (AME) y Granger-causally grounded + Attentive Gating G 1 G 1 G 2 a 1 a 2 a 3 Network G 3 c 1 c 2 c 3 h 1 h 2 h 3 h all = (h 1 ,c 1 ,h 2 ,c 2 ,h 3 ,c 3 ) Expert E 1 Expert E 2 Expert E 3
Attentive Mixture of Experts (AME) y Granger-causally grounded + Attentive Gating G 1 G 1 G 2 a 1 a 2 a 3 Network G 3 c 1 c 2 c 3 h 1 h 2 h 3 h all = (h 1 ,c 1 ,h 2 ,c 2 ,h 3 ,c 3 ) Expert E 1 Expert E 2 Expert E 3 One independent expert per feature / feature group
Attentive Mixture of Experts (AME) Attentive gates control expert contributions y Granger-causally grounded + Attentive Gating G 1 G 1 G 2 a 1 a 2 a 3 Network G 3 c 1 c 2 c 3 h 1 h 2 h 3 h all = (h 1 ,c 1 ,h 2 ,c 2 ,h 3 ,c 3 ) Expert E 1 Expert E 2 Expert E 3
Attentive Mixture of Experts (AME) Experts can only contribute to y after modulation by a i y Granger-causally grounded + Attentive Gating G 1 G 1 G 2 a 1 a 2 a 3 Network G 3 c 1 c 2 c 3 h 1 h 2 h 3 h all = (h 1 ,c 1 ,h 2 ,c 2 ,h 3 ,c 3 ) Expert E 1 Expert E 2 Expert E 3
However, on its own this structure has the same issue as naive soft attention mechanisms: - No incentive to learn to output accurate feature importance estimates [1]. - Often collapses to use only very few or a single expert early on during training [2, 3]. [1] Sundararajan, Taly, and Yan 2017; [2] Bengio et al. 2015; [3] Shazeer et al. 2017
Granger-causal Objective • Granger (1969) postulated Granger-causality • declares relationship X ➞ Y if we are better able to predict Y using all information than if all information apart from X had been used* * Other assumptions apply that are not relevant in the presented setting.
Granger-causal Objective ε X ε X/{1} f aux f aux,1 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4
Granger-causal Objective ε X ε X/{1} f aux f aux,1 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 Error when considering all information
Granger-causal Objective ε X ε X/{1} f aux f aux,1 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 Error when considering Error when considering all information information apart from E 1
Granger-causal Objective ε X ε X/{1} f aux f aux,1 - E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 We de fi ne feature importance as the reduction in prediction error associated with adding that feature.
Granger-causal Objective 1 a 0 ε X/{1} ε X - f aux f aux,1 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 We de fi ne feature importance as the reduction in prediction error associated with adding that feature.
Granger-causal Objective 1 a 0 ε X/{1} ε X ε X/{2} ε X - - f aux f aux,1 f aux f aux,2 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 We de fi ne feature importance as the reduction in prediction error associated with adding that feature.
Granger-causal Objective 1 a 0 ε X/{1} ε X ε X/{2} ε X/{3} ε X ε X - - - f aux f aux,1 f aux f aux,2 f aux f aux,3 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 We de fi ne feature importance as the reduction in prediction error associated with adding that feature.
Granger-causal Objective 1 a 0 ε X/{1} ε X ε X/{2} ε X/{3} ε X/{4} ε X ε X ε X - - - - f aux f aux,1 f aux f aux,2 f aux f aux,3 f aux f aux,4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 We de fi ne feature importance as the reduction in prediction error associated with adding that feature.
Granger-causal Objective We now have a di ff erentiable link between labels (prediction error) and feature importance. 1 a 0 ε X/{1} ε X ε X/{2} ε X/{3} ε X/{4} ε X ε X ε X - - - - f aux f aux,1 f aux f aux,2 f aux f aux,3 f aux f aux,4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 E 1 E 2 E 3 E 4 We de fi ne feature importance as the reduction in prediction error associated with adding that feature.
Evaluation
Important Features in Handwritten Digits
Important Features in Handwritten Digits Estimation accuracy comparable to SHAP.
Important Features in Handwritten Digits Orders of magnitude faster at importance estimation
Important Features in Handwritten Digits
Important Features in Handwritten Digits Lower MGE correlates with better feature importance estimates.
Drivers of Medical Prescription Demand 35,00 34,98 33,85 33,08 32,50 SMAPE [%] 32,87 32,79 30,00 27,50 25,00 R F A A A N N M M R N I N E E M ( ( A a a = = 0 0 ) . 0 4 ) Slightly lower prediction accuracy when using AME architecture
Drivers of Medical Prescription Demand 35,00 34,98 33,85 33,08 32,50 SMAPE [%] 32,87 32,79 30,00 27,50 25,00 R F A A A N N M M R N I N E E M ( ( A a a = = 0 0 ) . 0 4 ) Slightly lower prediction accuracy when using Granger-causal objective
Discriminatory Genes across Cancer Types AME LIME All All BRCA BRCA KIRC KIRC COAD COAD LUAD LUAD a i PRAD a i PRAD AADAT 100133144 729884 90288 A ABCA13 ABCB6 ABCB9 ABCC9 ABCD2 ABCB9 ABCC3 ABCC4 ABCC6P1 A 553137 A1BG A1CF AASS ABCC11 729884 G A B SHAP All BRCA KIRC COAD LUAD a i PRAD ABCB1 553137 ABCA3 ABCC3 ABCC9 729884 90288 A1BG A1CF AADAT
Discriminatory Genes across Cancer Types AME LIME All All BRCA BRCA KIRC KIRC COAD COAD LUAD LUAD a i PRAD a i PRAD AADAT 100133144 729884 90288 A ABCA13 ABCB6 ABCB9 ABCC9 ABCD2 ABCB9 ABCC3 ABCC4 ABCC6P1 A 553137 A1BG A1CF AASS ABCC11 729884 G A B SHAP All BRCA AME discriminates well between KIRC 1- cancer types, and COAD 2- important and unimportant genes LUAD a i PRAD ABCB1 553137 ABCA3 ABCC3 ABCC9 729884 90288 A1BG A1CF AADAT
Discriminatory Genes across Cancer Types 10 10 10 8 8 7 Recall @ 10 2 1 A R S L D A I M H M F e M e A E E E p P ( L ( a a I F = = T 0 0 Associations discovered by AMEs are consistent . ) 0 5 ) with those reported by domain experts.
Discriminatory Genes across Cancer Types 10 10 10 8 8 7 Recall @ 10 2 1 A R S L D A I M H M F e M e A E E E p P ( L ( a a I F = = T 0 0 . ) 0 5 ) Granger-causal objective is crucial for estimation accuracy.
Limitations • No information about direction of importance, i.e. negative evidence • Large numbers of experts (>200) can become slow at training time • Workaround: Feature grouping • Requires speci fi c model architecture
Conclusion
Conclusion • We present a feature importance estimation approach that • learns to estimate importance from labelled data ✔ • produces accurate predictions and importance scores ✔ in a single model • is orders of magnitude faster at estimating importance ✔ than perturbation-based approaches • is consistent with associations reported by domain ✔ experts
Questions? Patrick Schwab @schwabpa patrick.schwab@hest.ethz.ch Institute for Robotics and Intelligent Systems ETH Zurich Schwab, Patrick, Miladinovic, Djordje, and Karlen, Walter. Granger-causal Attentive Mixtures of Experts: Learning Important Features with Neural Networks. AAAI 2019 Source Code: github.com/d909b/AME 43
Recommend
More recommend