economics for data science
play

Economics for Data Science Chiara Binelli Academic year 2019-2020 - PowerPoint PPT Presentation

Economics for Data Science Chiara Binelli Academic year 2019-2020 Email: chiara.binelli@unimib.it Data Science and Economics Economics approach: 1. Have a theory that identifies a relationship of interest (ex. impact of completing


  1. Economics for Data Science Chiara Binelli Academic year 2019-2020 Email: chiara.binelli@unimib.it

  2. Data Science and Economics • Economics approach: 1. Have a theory that identifies a relationship of interest (ex. impact of completing college on wages). 2. Estimate the impact of a treatment (ex. completing college) on an outcome variable (ex. wages) holding everything else constant.  Focus on some coefficients of interest to estimate causal effects.  Effort to estimate unbiased effects with carefully constructed standard errors. • Data Science approach – data-driven approach: 1. Predict how a given outcome varies with a large number of potential predictors. 2. May or not use prior theory to establish which predictors are relevant.  Data-driven model selection to identify meaningful predictive variables.  Less attention to statistical uncertainty and standard errors and more to model uncertainty.

  3. Data Science and Economics Two main limitations of the Data Science approach: 1. Lack of theory: data-driven approach (predictive models are chosen using data driven cross-validation methods). 2. Lack of statistical significance: focus on predictions that minimize mean-squared errors without much attention to statistical significance since the exact source of variation identifying the prediction is difficult to assess. – Thus, bias is allowed in order to reduce variance . – Example : LASSO penalizes the inclusion of covariates so that if two covariates are correlated only one will be included and its parameter will reflect the impact of both included and excluded;  OVB !

  4. Data Science and Economics • Economists often interested in assessing the effectiveness of a policy or testing theories that predict a causal relationship – Main goal is to identify statistically significant causal effects. – A model with high degree of predictive fit is seen as secondary to finding an empirical specification that identifies a causal effect. • Common Data Science techniques such as classification and regression trees, lasso, boosting, and cross- validation have not been much used in Economics.

  5. Data Science and Economics • Concrete example (Einav and Levin 2014): assess if taking online classes improves earnings. • Economics approach : – Either design an experiment that induces some workers to take on line classes for reasons unrelated to their earning potential. • e.g. change in the price of online classes. – Or, absent the experiment, use observational data to estimate the impact of online classes on earnings in an unbiased way. – Focus on: • Obtaining a point estimate of the impact of online classes on earnings that is precisely estimated. • Discussing whether there are omitted variables that might confound a causal interpretation (e.g. workers’ ambition driving a decision to take classes and work harder at the same time).

  6. Data Science and Economics • Data Science approach : – Identify which variables predict earnings, given a vast set of predictors in the data, and the potential for building a model that predicts earnings well, both in sample and out of sample. – Focus on: • Model that predicts earnings both for individuals that have and have not taken online classes. – NOTE : focus is not on causal effect and statistical significance but rather on prediction.

  7. Machine Learning and Statistical Inference • Flexibility of machine learning algorithms means that two different functions that use different variables can produce similar predictions: – In traditional estimation, large standard errors express the uncertainty in attributing effects. – In machine learning, lack of consistency in model’s selection – how to measure this? • Computing standard errors in machine learning algorithms is difficult due to the data-driven approach: – Leeb and Potscher (2006, 2008) develop conditions under which it is impossible to compute consistent estimates of model parameters after data-driven selection.

  8. Big Data and Statistical Inference • When big data represent all the data for a given set of variables, should we compute standard errors? • Very much YES! – The error of a model comes from two sources: omitted variables and measurement error • Omitted variables error: some relevant explanatory variables are omitted, thus get in then error term. • Measurement error: the dependent variable is measured with error.

  9. Big Data and Statistical Inference • Sample error is very different from model error: it is the difference between the sample-based regression results and the results based on the full population. • Probability theory tells us that for a well constructed sample, regression coefficients are unbiased estimates of population regression coefficients. • Tests of statistical significance are relevant both for samples and for entire data populations. – To read more on this, see Babones, S. J. 2013. Statistical Modeling with Cross-Sectional Designs , Chapter 5, pp. 107-118.

  10. Data Science and Economics • Due to theory, the Economics approach is more interpretable in terms of which variation identifies the impact of interest and its statistical significance. • The Data Science approach is better for predictions: – Examples: comparison of performance of OLS vs machine learning algorithms (regression trees, random forest, LASSO, ensemble) in Mullainathan and Spiess (2017 Journal of Economic Perspectives ); advantages of using ensembles methods to improve predictions (Athey et al. 2019). – Intuition: machine learning algorithms easily allow introducing pairwise interactions between all potential predictors. • The two approaches have mutual benefits.

  11. Economics for Data Science • From Economics to Data Science: 2 main contributions 1. Provide a theory: theory to ask interesting questions and to analyze complex big datasets. With data complexity, crucial to have models to guide choice of variables, relationships between variables, hypothesis to test and experiments to run. 2. Focus on causality: crucial to answer important questions. • From Data Science to Economics: 3 main contributions 1. Test robustness to misspecifications (Athey and Imbens 2015). 2. New tools for causal inference. 3. Better predictions.

  12. From Economics to Data Science: 1. Provide a Theory • Example: online advertising auctions. • Important question for Google or Facebook: – Which ads to show online and how much to charge for the ads? 1. Machine learning methods to build a predictive model to assess the likelihood that a user will click on an ad. By exploiting the enormous amount of data available online, this predictive model tells us which ads to show. 2. Economic theory to build auction models to set prices. • Several e-commerce companies have built teams of economists (often with PhDs in Economics), statisticians and computer scientists.

  13. From Economics to Data Science: 1. Provide a Theory • A theory is a way to investigate the mechanism through which X affects Y. It is a way to make ML interpretable. • “Interpretable machine learning”: ML field to go beyond a “black box” approach to explain the logic behind predictions. – To interpret a model, we require the following insights: 1. Identify the most important features. 2. For any single prediction, the effect of each feature in the data on that particular prediction. 3. Effect of each feature over a large number of possible predictions • Molnar (2019): https://christophm.github.io/interpretable-ml- book/ and Kaggle crash course on ML explainability: https://www.kaggle.com/learn/machine-learning-explainability

  14. From Economics to Data Science: 2. Focus on Causality • Machine learning algorithms optimize properties of the observed data: improve performance by optimizing parameters over a set of inputs. E.g. to build a predictive model we minimize over fit. – “As long as we optimize some properties of the observed data, however noble or sophisticated, while making no reference to the world outside the observed data, we are limited to questions of association.” Pearl (2018) • However, lots of important questions involve cause-and-effect relationships. Until recently we had no mathematical framework to articulate and answer these questions. • “More has been learned about causal inference in the last few decades than the sum total of everything that had been learned about it in all prior recorded history“ Garry King, Harvard. “The Causal Revolution" (Pearl and Mackenzie, 2018).

  15. From Economics to Data Science: 2. Focus on Causality (Pearl 2018) • Human-level AI cannot emerge solely from model-blind learning machines; it requires the symbiotic collaboration of data and models. • Data science is only as much of a science as it facilitates the interpretation of data - a two-body problem, connecting data to reality. • Data alone are hardly a science, regardless how big they get and how skillfully they are manipulated. – We need a theory to interpret the data.

Recommend


More recommend