NIPS 2013 Workshop on Causality Conditional distribution variability measures for causality detection José A. R. Fonollosa December 9, 2013
Outline • Introduction • Preprocessing • Conditional distributions similarity measures • Additional features • Model • Results • Conclusions
Introduction • Heterogeneous Cause-effect pairs • Statistical / Machine learning approach (3 classes) • Standard features • Measures of the similarity of ‘shape’ of the conditional distributions • Robust estimation methods: – Limited number of samples – Noise – Quantization – Avoid overfitting • Tree-based ensemble learning model (Gradient Boosting)
Preprocessing • Mean and Variance normalization: all the features are scale and mean invariant. • Homogeneous set of features from mixed numerical/categorical data: – Discretization of numerical variables – Relabeling of categorical variables. 0.60 0.60 0.40 0.40 0.20 0.20 0.00 0.00 0 1 2 3 A B C D Arbitrary labels or numbers
Conditional distributions similarity Rationale : the conditional distribution P(Y|X=x) is expected to be simpler to describe in the causal direction. Similar: – Normalized Shape/histogram for different values of the given variable x. – Similar entropy and moments. – Similar Bayesian error probability. Related with functional causal models y = f(x) + g x (e) but f(x) is replaced by the conditional mean in an interval Independence tests are replaced by similarity measures ( Image from a presentation of Kun Zhang on functional causal models )
Additional Features (I) • Information-theoretic measures – Discrete entropy and joint entropy – Discrete conditional entropy – Discrete mutual information (+ 2 normalized versions) – Adjusted (discrete) mutual information – Gaussian divergence (Differential entropy) – Uniform divergence • Slope-based Information Geometric Causal Inference (IGCI) • Hilbert Schmidt Independence Criterion (HSIC) • Pearson R (Adapted versions)
Additional Features (II) • Number of samples and number of unique samples • Moments and mixed moments: skewness, kurtosis and mixed moments (1,2) (1,3) • Polynomial fit (order 2)
Ternary symmetric problem. Single output (+1) A is a cause of B Model schemes ( -1) B is a cause of A ( 0) Neither P a (1) Features + A single ternary P a (0) P a (1)-P a (-1) classification model. - P a (-1) P c Two binary models: a model for 1 versus -1, and a model x P i P c P i for 0 versus the rest. P s (1) Two binary models: a model + for class 1 versus the rest of ½ (P s (1)-P s (-1)) classes, and a model for -1 - versus the rest. P s (-1) Similar performance
Gradient Boosting Model (GBM) • Gradient boosting – Large number of boosting stages = 500 – Large tree size = 9 (higher-order interaction)
Results Features Score Baseline(21) 0.742 Baseline(21) + Moment31(2) 0.750 Baseline(21) + Moment21(2) 0.757 Baseline(21) + Error probability(2) 0.749 Baseline(21) + Polyfit (2) 0.757 Baseline(21) + Polyfit error(2) 0.757 Baseline(21) + Skewness(2) 0.754 Baseline(21) + Kurtosis(2) 0.744 Baseline(21) + the above statistics set (14) 0.790 Baseline(21) + Standard deviation of conditional distributions(2) 0.779 Baseline(21) + Standard deviation of the skewness of conditional distributions(2) 0.765 Baseline(21) + Standard deviation of the kurtosis of conditional distributions(2) 0.759 Baseline(21) + Standard deviation of the entropy of conditional distributions(2) 0.759 Baseline(21) + Measures of variability of the conditional distribution(8) 0.789 Full set(43 features) 0.820 Training time: 45 minutes (4-core server) Test predictions: 12 minutes
Conclusions • A statistical machine learning approach to deal with heterogeneous causal-effect pairs • We need to combine several features to obtain good results. (higher-order interaction) • The proposed measures of the similarity of the conditional distributions provide significant additional performance. • Competitive results, open source code, simple and fast. • Next step: detailed study of the performance in different type of data pairs.
Recommend
More recommend