conditional distribution variability
play

Conditional distribution variability measures for causality - PowerPoint PPT Presentation

NIPS 2013 Workshop on Causality Conditional distribution variability measures for causality detection Jos A. R. Fonollosa December 9, 2013 Outline Introduction Preprocessing Conditional distributions similarity measures


  1. NIPS 2013 Workshop on Causality Conditional distribution variability measures for causality detection José A. R. Fonollosa December 9, 2013

  2. Outline • Introduction • Preprocessing • Conditional distributions similarity measures • Additional features • Model • Results • Conclusions

  3. Introduction • Heterogeneous Cause-effect pairs • Statistical / Machine learning approach (3 classes) • Standard features • Measures of the similarity of ‘shape’ of the conditional distributions • Robust estimation methods: – Limited number of samples – Noise – Quantization – Avoid overfitting • Tree-based ensemble learning model (Gradient Boosting)

  4. Preprocessing • Mean and Variance normalization: all the features are scale and mean invariant. • Homogeneous set of features from mixed numerical/categorical data: – Discretization of numerical variables – Relabeling of categorical variables. 0.60 0.60 0.40 0.40 0.20 0.20 0.00 0.00 0 1 2 3 A B C D Arbitrary labels or numbers

  5. Conditional distributions similarity Rationale : the conditional distribution P(Y|X=x) is expected to be simpler to describe in the causal direction. Similar: – Normalized Shape/histogram for different values of the given variable x. – Similar entropy and moments. – Similar Bayesian error probability. Related with functional causal models y = f(x) + g x (e) but f(x) is replaced by the conditional mean in an interval Independence tests are replaced by similarity measures ( Image from a presentation of Kun Zhang on functional causal models )

  6. Additional Features (I) • Information-theoretic measures – Discrete entropy and joint entropy – Discrete conditional entropy – Discrete mutual information (+ 2 normalized versions) – Adjusted (discrete) mutual information – Gaussian divergence (Differential entropy) – Uniform divergence • Slope-based Information Geometric Causal Inference (IGCI) • Hilbert Schmidt Independence Criterion (HSIC) • Pearson R (Adapted versions)

  7. Additional Features (II) • Number of samples and number of unique samples • Moments and mixed moments: skewness, kurtosis and mixed moments (1,2) (1,3) • Polynomial fit (order 2)

  8. Ternary symmetric problem. Single output (+1) A is a cause of B Model schemes ( -1) B is a cause of A ( 0) Neither P a (1) Features + A single ternary P a (0) P a (1)-P a (-1) classification model. - P a (-1) P c Two binary models: a model for 1 versus -1, and a model x P i P c P i for 0 versus the rest. P s (1) Two binary models: a model + for class 1 versus the rest of ½ (P s (1)-P s (-1)) classes, and a model for -1 - versus the rest. P s (-1) Similar performance

  9. Gradient Boosting Model (GBM) • Gradient boosting – Large number of boosting stages = 500 – Large tree size = 9 (higher-order interaction)

  10. Results Features Score Baseline(21) 0.742 Baseline(21) + Moment31(2) 0.750 Baseline(21) + Moment21(2) 0.757 Baseline(21) + Error probability(2) 0.749 Baseline(21) + Polyfit (2) 0.757 Baseline(21) + Polyfit error(2) 0.757 Baseline(21) + Skewness(2) 0.754 Baseline(21) + Kurtosis(2) 0.744 Baseline(21) + the above statistics set (14) 0.790 Baseline(21) + Standard deviation of conditional distributions(2) 0.779 Baseline(21) + Standard deviation of the skewness of conditional distributions(2) 0.765 Baseline(21) + Standard deviation of the kurtosis of conditional distributions(2) 0.759 Baseline(21) + Standard deviation of the entropy of conditional distributions(2) 0.759 Baseline(21) + Measures of variability of the conditional distribution(8) 0.789 Full set(43 features) 0.820 Training time: 45 minutes (4-core server) Test predictions: 12 minutes

  11. Conclusions • A statistical machine learning approach to deal with heterogeneous causal-effect pairs • We need to combine several features to obtain good results. (higher-order interaction) • The proposed measures of the similarity of the conditional distributions provide significant additional performance. • Competitive results, open source code, simple and fast. • Next step: detailed study of the performance in different type of data pairs.

Recommend


More recommend