augmenting simple models with machine learning
play

Augmenting simple models with machine learning Jim Savage Data - PowerPoint PPT Presentation

Augmenting simple models with machine learning Jim Savage Data Science Lead Lendable @khakieconomist Cheers to Sarah Tan David Miller Chris Edmond Eugene Dubossarsky Outline Estimating causal relationships Proximity


  1. Augmenting simple models with machine learning Jim Savage Data Science Lead Lendable @khakieconomist

  2. Cheers to • Sarah Tan • David Miller • Chris Edmond • Eugene Dubossarsky

  3. Outline • Estimating causal relationships • Proximity score matching • The problem of model non-stationarity in time series models

  4. Problem #1: drawing causal inference from observational data Question for the audience: What is a college degree worth? How would you go about estimating it?

  5. Experimental vs observational data Experimental data = easy causal inference Observational data = hard “causal” inference • We want to know E(dy|exogenous intervention in X)— how much we expect y to change • We have never observed an exogenous intervention in X

  6. Not a predictive problem! • Predictive models give us E(y|X)—fancy correlation • Looks the same, but is wildly different • In absence of causal reasoning, more data & fancier models often just make us more certain of the wrong answer.

  7. Neyman-Rubin causal model The fundamental problem of causal inference is that booting up a parallel universe whenever we want to draw causal inference is too much work.

  8. How do we estimate treatment effects? • Regression with controls (try to take care of the effect of observed confounders) • Panel data • Natural experiments • Matching

  9. Regression helps us deal with observed confounders

  10. But be careful what you control for

  11. Multiple observations of the same unit over time can help control for unobserved confounders that don’t vary over time

  12. IV & natural experiments can help… and are difficult to find, impossible to verify

  13. Pros and cons • All the above methods are better than comparing averages! • Often no good natural experiments exist (makes IV hard!) • Often we’re worried that unobserved confounders vary over time (fixed effects assumption violated) • Decisions still have to be made

  14. Matching methods • Idea: build up a control group that is as similar as possible to the treatment group • Run your analysis (comparison of groups, regression, etc) on this sub-group. Discard those who were never likely to take up treatment.

  15. Matching methods • Idea: build up a control group that is as similar as possible to the treatment group • Once you have this “synthetic control”, use some causal model. • Pray it has balanced your groups on the factors that matter.

  16. Exact matching • Pair treated observation with untreated observation that is the same on observed covariates. • Run out of dimensions very quickly…

  17. Matching using a metric • Define some metric for matching (Euclidian, Mahalanobis, etc.) • Group observations that are “close” in the X space • Run analysis on this subset. • But which Xs matter?

  18. Propensity score matching • Estimate model of “propensity to get treatment” • p(treated | X) • For each treated observation, choose an untreated observation whose modelled propensity is closest (or some other matching technique).

  19. Propensity score matching • Big problem: “Smith-Todd” • Change your propensity model, change your treatment effect. Can be meaningless. • Despite this, very widely used (~15k citations)

  20. Proximity matching • Like metric matching, we match on the Xs • Like propensity score matching, we take into account how the Xs affect treatment probability. • Use the Random Forest proximity matrix

  21. CART

  22. The Random Forest • Essentially a collection of CART models • Each estimated on a random subset of the data • In each node, a sample of possible Xs drawn to be considered for a split • Each tree fairly different.

  23. Random forest proximity • When two people end up in the same terminal node of a tree, they are said to be proximate • The proximity score (i, j) is the proportion of terminal nodes shared by individuals i and j . • We calculate it on held-out observations • It is a measure of similarity between two individuals in terms of their Xs • But only the similarity in terms of the Xs that matter to y • A metric-free, scale invariant supervised similarity score

  24. Introduction to analogy weighting • Motivation 1: want parameters most relevant to today. • Motivation 2: want to know when model is least likely to do a good job.

  25. Unprincipled approaches

  26. Analogy weighting: the idea • Train a random forest on the dependent variable of interest with potentially many Xs • Take the proximity matrix from the random forest • Use the relevant row from this matrix to weight the observations in your parametric model • This is akin to training your model on the relevant history

  27. Implementing • For very simple models, canned functions normally take a weights argument. • For complex models, weights are not normally included. • Use Stan • Direct call to increment_log_prob rather than sampling notation

  28. When should I ignore my model? ?

  29. And when history is not relevant?

  30. Covariance in scale- correlation form Σ = diag( σ ) Ω diag( σ ) • Here, sigma is a vector of standard deviations, and Omega is a correlation matrix • We can give sigma a non-negative prior (say, half Cauchy), and Omega an LKJ prior • LKJ is a one-parameter distribution of correlation matrices. • Low values of the parameter give (approaching 1) give uniform prior over correlations. • High values (approaching infinity) give an identity matrix.

  31. Application: volatility modelling during financial crisis • Most volatility models work like so: returns vector(t) ~ multivariate distribution(expected return(t), covariance(t)) • Expected returns model is just a forecasting model • Covariance needs to be explicitly modelled • Multivariate GARCH common. • CCC Garch allows time varying shock magnitudes • DCC allows time varying correlations that update with correlated shocks

  32. LKJ as a “danger prior” in volatility models • Idea: when we have relevant histories, we learn correlation structure from the data. • When we have no relevant history, our likelihood does not impact the posterior and we revert to the prior. • Using an LKJ prior with low degrees of freedom gives us highly correlated returns in unprecedented states.

  33. Questions?

Recommend


More recommend