Augmenting simple models with machine learning Jim Savage Data Science Lead Lendable @khakieconomist
Cheers to • Sarah Tan • David Miller • Chris Edmond • Eugene Dubossarsky
Outline • Estimating causal relationships • Proximity score matching • The problem of model non-stationarity in time series models
Problem #1: drawing causal inference from observational data Question for the audience: What is a college degree worth? How would you go about estimating it?
Experimental vs observational data Experimental data = easy causal inference Observational data = hard “causal” inference • We want to know E(dy|exogenous intervention in X)— how much we expect y to change • We have never observed an exogenous intervention in X
Not a predictive problem! • Predictive models give us E(y|X)—fancy correlation • Looks the same, but is wildly different • In absence of causal reasoning, more data & fancier models often just make us more certain of the wrong answer.
Neyman-Rubin causal model The fundamental problem of causal inference is that booting up a parallel universe whenever we want to draw causal inference is too much work.
How do we estimate treatment effects? • Regression with controls (try to take care of the effect of observed confounders) • Panel data • Natural experiments • Matching
Regression helps us deal with observed confounders
But be careful what you control for
Multiple observations of the same unit over time can help control for unobserved confounders that don’t vary over time
IV & natural experiments can help… and are difficult to find, impossible to verify
Pros and cons • All the above methods are better than comparing averages! • Often no good natural experiments exist (makes IV hard!) • Often we’re worried that unobserved confounders vary over time (fixed effects assumption violated) • Decisions still have to be made
Matching methods • Idea: build up a control group that is as similar as possible to the treatment group • Run your analysis (comparison of groups, regression, etc) on this sub-group. Discard those who were never likely to take up treatment.
Matching methods • Idea: build up a control group that is as similar as possible to the treatment group • Once you have this “synthetic control”, use some causal model. • Pray it has balanced your groups on the factors that matter.
Exact matching • Pair treated observation with untreated observation that is the same on observed covariates. • Run out of dimensions very quickly…
Matching using a metric • Define some metric for matching (Euclidian, Mahalanobis, etc.) • Group observations that are “close” in the X space • Run analysis on this subset. • But which Xs matter?
Propensity score matching • Estimate model of “propensity to get treatment” • p(treated | X) • For each treated observation, choose an untreated observation whose modelled propensity is closest (or some other matching technique).
Propensity score matching • Big problem: “Smith-Todd” • Change your propensity model, change your treatment effect. Can be meaningless. • Despite this, very widely used (~15k citations)
Proximity matching • Like metric matching, we match on the Xs • Like propensity score matching, we take into account how the Xs affect treatment probability. • Use the Random Forest proximity matrix
CART
The Random Forest • Essentially a collection of CART models • Each estimated on a random subset of the data • In each node, a sample of possible Xs drawn to be considered for a split • Each tree fairly different.
Random forest proximity • When two people end up in the same terminal node of a tree, they are said to be proximate • The proximity score (i, j) is the proportion of terminal nodes shared by individuals i and j . • We calculate it on held-out observations • It is a measure of similarity between two individuals in terms of their Xs • But only the similarity in terms of the Xs that matter to y • A metric-free, scale invariant supervised similarity score
Introduction to analogy weighting • Motivation 1: want parameters most relevant to today. • Motivation 2: want to know when model is least likely to do a good job.
Unprincipled approaches
Analogy weighting: the idea • Train a random forest on the dependent variable of interest with potentially many Xs • Take the proximity matrix from the random forest • Use the relevant row from this matrix to weight the observations in your parametric model • This is akin to training your model on the relevant history
Implementing • For very simple models, canned functions normally take a weights argument. • For complex models, weights are not normally included. • Use Stan • Direct call to increment_log_prob rather than sampling notation
When should I ignore my model? ?
And when history is not relevant?
Covariance in scale- correlation form Σ = diag( σ ) Ω diag( σ ) • Here, sigma is a vector of standard deviations, and Omega is a correlation matrix • We can give sigma a non-negative prior (say, half Cauchy), and Omega an LKJ prior • LKJ is a one-parameter distribution of correlation matrices. • Low values of the parameter give (approaching 1) give uniform prior over correlations. • High values (approaching infinity) give an identity matrix.
Application: volatility modelling during financial crisis • Most volatility models work like so: returns vector(t) ~ multivariate distribution(expected return(t), covariance(t)) • Expected returns model is just a forecasting model • Covariance needs to be explicitly modelled • Multivariate GARCH common. • CCC Garch allows time varying shock magnitudes • DCC allows time varying correlations that update with correlated shocks
LKJ as a “danger prior” in volatility models • Idea: when we have relevant histories, we learn correlation structure from the data. • When we have no relevant history, our likelihood does not impact the posterior and we revert to the prior. • Using an LKJ prior with low degrees of freedom gives us highly correlated returns in unprecedented states.
Questions?
Recommend
More recommend