Causal and Non-Causal Feature Selection for Ridge Regression Gavin Cawley School of Computing Sciences University of East Anglia Norwich, United Kingdom gcc@cmp.uea.ac.uk Wednesday 3 rd June 2008
Introduction ◮ Causal feature selection useful given covariate-shift. ◮ What works best? ◮ Use the same base classifier (ridge regression). ◮ Careful (but efficient) optimisation of ridge parameter. ◮ Minimal pre-processing. ◮ Causal feature selection strategies: ◮ Markov blanket. ◮ Direct causes + direct effects. ◮ Direct causes. ◮ For comparison: ◮ Non-causal feature selection (BLogReg). ◮ No feature selection (regularisation only). ◮ WCCI-2008 Causality and Prediction Challenge ◮ Solution a little “heuristic”!
Ridge Regression ◮ Linear classifier with regularised sum-of-squares loss function ℓ L = 1 y i ] 2 + λ � 2 � β � 2 ˆ y i = x i · β [ y i − ˆ and 2 i =1 ◮ Weights found via the “Normal equations” � � X T X + λ I β = X T y . ◮ Optimise regularisation parameter, λ , via VLOO ℓ P ( λ ) = 1 � 2 − y i = ˆ y i − y i � y ( − i ) y ( − i ) � ˆ − y i ˆ . where i i 1 − h ii ℓ i =1 � − 1 X T . and the “hat” matrix is H = [ h ij ] ℓ � X T X + λ I i , j =1 = X
Linear Kernel Ridge Regression ◮ Useful for problems with more features than patterns. ◮ Dual representation of model ℓ � ˆ y i = α j < x j , x i > j =1 ◮ Model parameters given by system of linear equations � XX T + λ I � α = y . ◮ Optimise regularisation parameter, λ , via VLOO ℓ P ( λ ) = 1 � 2 − y i = α i � y ( − i ) y ( − i ) � ˆ − y i where ˆ . i i C ii ℓ i =1 ◮ Computational complexity O ( ℓ 3 ) instead of O ( d 3 ).
Optimisiation of the Regularisation Parameter ◮ Sneaky trick well known to statisticians! ◮ Eigen-decomposition of covariance matrix: X T X = V T ΛV . ◮ We can then re-write the normal equations as [ Λ + λ I ] α = V T X T y α = V T β where ◮ Similarly, the “hat” matrix can be written as H = V [ Λ + λ I ] − 1 V T ◮ Note only a diagonal matrix need be inverted ◮ Performing the eigen-decomposition is expensive. ◮ Cost is amortized across the investigation of many values for λ . ◮ Regularisation parameter, λ , optimised via gradient descent. ◮ A similar trick can be implemented for KRR.
Feature Selection ◮ Non-causal - logistic regression with Laplace prior ◮ Regularisation parameter integrated out using reference prior. ◮ Causal feature selection using Causal Explorer ◮ Selecting the Markov blanket using HITON MB ◮ Direct the edges of the DAG ◮ PC algorithm for problems with continuous features. ◮ MMHC algorithm for binary only problems. ◮ Use HITON MB to pre-select features. ◮ Use an ensemble of 100 models ◮ Variability of feature selection methods. ◮ Gives an indication of generalisation performance.
Results for the REGED Benchmark ◮ Non-causal feature selection works well. Dataset Selection FNUM FSCORE DSCORE TSCORE AUC REGED0 None 999.00 0.9204 1.0000 0.9983 0.9612 Non-causal 14.69 0.8070 1.0000 0.9997 0.9997 Markov blanket 26.86 0.8988 0.9999 0.9997 0.9994 Causes & effects 8.60 0.8095 0.9999 0.9996 0.9978 Causes only 1.56 0.7143 0.9984 0.9955 0.9346 REGED1 None 999.00 0.9078 1.0000 0.9321 — Non-causal 14.69 0.7798 1.0000 0.9508 — Markov blanket 24.85 0.8438 0.9999 0.9346 — Causes & effects 8.60 0.7822 0.9999 0.9329 — Causes only 1.56 0.7124 0.9984 0.8919 — REGED2 None 999.00 0.9950 1.0000 0.7184 — Non-causal 14.69 0.9980 1.0000 0.7992 — Markov blanket 24.85 0.9975 0.9999 0.7644 — Causes & effects 8.60 0.9970 0.9999 0.7989 — Causes only 1.56 0.9970 0.9984 0.7653 —
Results for SIDO Benchmark ◮ Very large dataset - not all results available. ◮ Best performance achieved without feature selection. Dataset Selection FNUM FSCORE DSCORE TSCORE AUC SIDO0 None 4928.00 0.5890 0.9840 0.9427 0.9472 Non-causal 28.96 0.5160 0.9482 0.9294 0.9226 Markov blanket 136.47 0.5818 0.9563 0.9418 0.9356 None 4928.00 0.5314 0.9840 — SIDO1 0.7532 Non-causal 28.96 0.4909 0.9482 0.6971 — Markov blanket 136.47 0.5348 0.9563 0.6948 — SIDO2 None 4928.00 0.5314 0.9840 0.6684 — Non-causal 28.96 0.4909 0.9482 0.6298 — Markov blanket 136.47 0.5348 0.9563 0.6298 —
Results for CINA Benchmark ◮ Non-causal, Markov blanket & no selection all work well. Dataset Selection FNUM FSCORE DSCORE TSCORE AUC CINA0 None 132.00 0.7908 0.9677 0.9674 0.9664 Non-causal 29.44 0.5708 0.9682 0.9679 0.9660 Markov blanket 55.30 0.7708 0.9669 0.9669 0.9660 Causes & effects 21.21 0.6826 0.9654 0.9661 0.9653 Causes 1.02 0.5174 0.7923 0.7911 0.5351 CINA1 None 132.00 0.5865 0.9677 0.7953 — Non-causal 29.44 0.6436 0.9682 0.7609 — Markov blanket 55.30 0.5261 0.9669 0.7979 — Causes & effects 21.21 0.5477 0.9654 0.7749 — Causes 1.02 0.5114 0.7923 0.5402 — CINA2 None 132.00 0.5865 0.9677 0.5502 — Non-causal 29.44 0.6436 0.9682 0.5464 — Markov blanket 55.30 0.5261 0.9669 0.5469 — Causes & effects 21.21 0.5477 0.9654 0.5394 — Causes 1.02 0.5114 0.7923 0.4825 —
Pre-processing for the MARTI Benchmark ◮ MARTI has correllated noise. ◮ Use KRR to estimate the noise as a function of (x,y) co-ords. 0 , σ 2 � � y i = φ ( x i ) · W + ε i , ǫ i ∼ N i I where . ◮ Radial basis function kernel defines φ ( x ). ◮ Iteratively re-estimate noise variance for each spot using residuals. 5 5 5 10 10 10 15 15 15 20 20 20 25 25 25 30 30 30 5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30 Raw Noise Signal
Results for MARTI Benchmark ◮ Markov blanket and non-causal selection work well. Dataset Selection FNUM FSCORE DSCORE TSCORE AUC MARTI0 None 1024.00 0.7980 1.0000 0.9970 0.9950 Non-causal 15.19 0.8029 0.9998 0.9993 0.9986 Markov blanket 26.86 0.8862 1.0000 0.9994 0.9994 Causes & effects 8.60 0.7894 0.9987 0.9986 0.9978 Causes only 1.56 0.5714 0.9821 0.9775 MARTI1 None 1024.00 0.7923 1.0000 0.9085 — Non-causal 15.19 0.7752 0.9998 0.9310 — Markov blanket 26.86 0.8264 1.0000 0.9234 — Causes & effects 8.60 0.7820 0.9987 0.8929 — Causes only 1.56 0.5347 0.9821 0.6370 — MARTI2 None 1024.00 0.9951 1.0000 0.9085 — Non-causal 15.19 0.9976 0.9998 0.7975 — Markov blanket 26.86 0.9966 1.0000 0.7740 — Causes & effects 8.60 0.9956 0.9987 0.7416 — Causes only 1.56 0.7485 0.9821 0.6607 —
Results for Final Submission ◮ Ridge regression provides a satisfactory base classifier. ◮ ARD/RBF kernel classifier may be better for CINA. ◮ Feature selection beneficial for manipulated datsets. Causal Discovery Target Prediction Dataset Rank Fnum Fscore Dscore Tscore Top Ts Max Ts cina0 128 0.5166 0.9737 0.9743 0.9765 0.9788 cina1 128 0.5860 0.9737 0.8691 0.8691 0.8977 3 cina2 64 0.5860 0.9734 0.7031 0.8157 0.8910 marti0 128 0.8697 1.0000 0.9996 0.9996 0.9996 marti1 32 0.8064 1.0000 0.9470 0.9470 0.9542 1 marti2 64 0.9956 0.9998 0.7975 0.7975 0.8273 reged0 128 0.9410 0.9999 0.9997 0.9998 1.0000 reged1 32 0.8393 0.9970 0.9787 0.9888 0.9980 2 reged2 8 0.9985 0.9996 0.8045 0.8600 0.9534 sido0 4928 0.5890 0.9840 0.9427 0.9443 0.9467 sido1 4928 0.5314 0.9840 0.7532 0.7532 0.7893 1 sido2 4928 0.5314 0.9840 0.6684 0.6684 0.7674
Summary ◮ Things that worked well: ◮ Regularisation can suppress irrelevant features. ◮ Use an ensemble to average over sources of uncertainty. ◮ Pre-processing is important (e.g. MARTI) ◮ Things that didn’t work so well: ◮ Computational expense — need more efficient tools. ◮ Effective non-linear models for large datasets (e.g. CINA). ◮ Challenge makes a convincing case for causal feature selection. ◮ Can deal with covariate shift. ◮ Rather difficult! ◮ Availability of MATLAB code ◮ http://theoval.cmp.uea.ac.uk/ ∼ gcc/cbl/blogreg/ ◮ http://theoval.cmp.uea.ac.uk/ ∼ gcc/projects/gkm/
Recommend
More recommend