big data what are you missing
play

Big data: What are you missing? the risks of assuming data equals all - PowerPoint PPT Presentation

Big data: What are you missing? the risks of assuming data equals all David J. Hand Imperial College London and Winton Capital Management 6 th January 2016 Theory of Big Data, UCL 1 BACKGROUND The promise of big data: McKinseys big data


  1. Big data: What are you missing? the risks of assuming data equals “all” David J. Hand Imperial College London and Winton Capital Management 6 th January 2016 Theory of Big Data, UCL 1

  2. BACKGROUND The promise of big data: McKinsey’s big data report : ‘ we are on the cusp of a tremendous wave of innovation, productivity, and growth, as well as new modes of competition and value capture, all driven by big data as consumers, companies, and economic sectors exploit its potential ’ and endless ditto by others Theory of Big Data, UCL 2

  3. However, as I have argued elsewhere 1) big data is not the solution it’s what you do with it that counts 2) big data carries risks Theory of Big Data, UCL 3

  4. Two kinds of big data opportunities 1) Computer science: through data manipulation merging, linking, matching, concatenating, sorting, basic arithmetic, ... Database heritage: could conceivably have all the data e.g. stock in the warehouse e.g. employees in the firm 2) Statistics: through inference and predictive analytics Many (most?) problems cannot have all the data e.g. observations in clinical trials e.g. forecasting e.g. physics experiments Theory of Big Data, UCL 4

  5. The challenges of big data 1) Computational and mathematical challenges ‐ large n and/or d ‐ speed of acquisition, realtime analysis Hand’s Law: the requirements for increased computer power always increase faster than the increase in power itself 2) Inferential and statistical challenges ‐ complexity – networks, mixed data types, ... Theory of Big Data, UCL 5

  6. 3) Data challenges ‐ data quality ‐ non ‐ stationarity ‐ formulating the question ‐ correlation vs causation ‐ . . . . . . . . Theory of Big Data, UCL 6

  7. THE AIM OF THIS PAPER: To focus on one problem and show how it is pervasive in big data opportunities ‐ risking misleading conclusions ‐ incorrect understanding ‐ mistaken decisions ‐ wasted money ‐ . . . . . . And to show what’s needed to tackle it Theory of Big Data, UCL 7

  8. This is the problem of SELECTION BIAS Theory of Big Data, UCL 8

  9. SOME EXAMPLES: Example 1: Potholes ‐ Streetbump smartphone app ‐ Detects potholes using accelerometer and emails location to local authority using GPS ‐ “Big data”, but no sophisticated computation or analytics Theory of Big Data, UCL 9

  10. SOME EXAMPLES: Example 1: Potholes ‐ Streetbump smartphone app ‐ Detects potholes using accelerometer and emails location to local authority using GPS ‐ “Big data”, but no sophisticated computation or analytics ‐ But lower income people less likely to have smartphones and cars, older people less likely to have smartphones, ... → streets in richer areas get fi xed Theory of Big Data, UCL 10

  11. Example 2: Hurricane Sandy 20 million tweets between 27 October and 1 November 2012 But a distorted impression of where problems are: ‐ most tweets came from Manhattan ‐ few from “more severely affected locations, such as Breezy Point, Coney Island and Rockaway” ‐ because of relative density of population/smartphones ‐ because power outages meant phones not recharged → distorted impression of where the damage occurred Theory of Big Data, UCL 11

  12. Example 3: Retail finance scorecard construction Aim : build model to decide which applicants should be given a loan Data : characteristics and (default/repay) outcome of those granted loans in past ‐ but those granted loans in the past were selected on the basis of some previous scorecard ‐ they do not represent the entire population of applicants Same structure for student selection, staff recruitment, ..... Theory of Big Data, UCL 12

  13. Example 4: Crime rates Points to note: 1) The difference between the CSE&W and PRC 2) The dramatic fall in CSE&W from 1995 Theory of Big Data, UCL 13

  14. 1) Crime Survey for E&W versus Police Recorded Crime CSE&W: aged ≥ 16; children 10 ‐ 15; not group residences; not crimes against commercial or public sector bodies; victim ‐ based (not include murder); not fraud and cyber; capping repeat victimisation; ... PRC: reported to and recorded by police; crime defined by “Notifiable Offence List” (incl. murder, public order, ...); incl. residents of institutions and tourists; incl. commercial bodies; 2) CSE&W: 19m in 1995 to 7m in y.e. June 2015 Less crime or shifting patterns of crime e.g to fraud, not measured on CSE&W Theory of Big Data, UCL 14

  15. Plastic card fraud in the UK, 2004 ‐ 2014 Theory of Big Data, UCL 15

  16. Example 5: Publication bias Relevant factors include: ‐ tendency not to submit negative results (file ‐ drawer effect) ‐ positive results are more interesting to editors; ‐ anomalous results may be regarded as errors, and not submitted; In an exploration of publication bias in the Cochrane database of systematic reviews: “ In the meta ‐ analyses of efficacy, outcomes favoring treatment had on average a 27% ... higher probability to be included than other outcomes. In the meta ‐ analyses of safety, results showing no evidence of adverse effects were on average 78% ... more likely to be included than results demonstrating that adverse effects existed.” Kicinski et al 5015 Theory of Big Data, UCL 16

  17. WHAT DRIVES SELECTION BIAS: 1) Natural mechanisms Abraham Wald and the WWII bomber armour The bullet holes in returning bombers showed where they could be hit without bringing them down A lesson for business schools? Look at the failures, not the successes Francis Bacon “when they showed him hanging in a temple a picture of those who had paid their vows as having escaped shipwreck, and would have him say whether he did not now acknowledge the power of the gods — ‘Aye,’ asked he again, ‘but where are they painted that were drowned after their vows?’ " Theory of Big Data, UCL 17

  18. 2) Non ‐ response and refusals LFS quarterly survey wave ‐ specific response rates: March ‐ May 2000 to July ‐ Sept 2015 http://www.ons.gov.uk/ons/guide ‐ method/method ‐ quality/specific/labour ‐ market/labour ‐ force ‐ survey/index.html Theory of Big Data, UCL 18

  19. 3) Self ‐ selection (i) The magazine survey which asks the one question: do you reply to magazine surveys? (ii) The Literary Digest disastrous prediction that Landon would beat Roosevelt in the 1936 presidential election Standard explanation: the prediction was based on polling people with phones, who are more likely to be Republican But this is a myth In fact 10m people were polled, but only 2.3m replied A self ‐ selected sample, and in this election the anti ‐ Roosevelt voters felt more strongly than the pro Theory of Big Data, UCL 19

  20. (iii) The Actuary edition of July 2006 included an editorial which said ‘ A couple of months ago I invited you ‐ all 16,245 of you ‐ to participate in our online survey concerning the sex of actuarial offspring. ... Well, I’m pleased to say that a number of you (13, in fact) replied to our poll. ’ Particularly web ‐ based surveys ‐ who replies? ‐ under ‐ representation of some groups ‐ multiple responding Theory of Big Data, UCL 20

  21. 4) Data dredging Test enough (true null) hypotheses and you expect some to be significant by chance This does not have to be dishonest: if 1000 teams each test one true null hypothesis at the 5% level .... Charles Babbage termed such data dredging “ cooking ”: “ make multitudes of observations, and out of these to select only those which agree, or very nearly agree. If a hundred observations are made, the cook must be very unlucky if he cannot pick out fifteen or twenty which will do for serving up ” Robert Millikan, Gregor Mendel, .... Theory of Big Data, UCL 21

  22. 5) Harking Hypothesising after the results are known Presenting post ‐ hoc hypotheses as if they were a priori Popperian science: Step 1: data suggest theory Step 2: theory is tested with new data Step 3: loop through steps 1 and 2 Harking arises when the same data are used in Steps 1 and 2 Theory of Big Data, UCL 22

  23. 6) Feedback and asymmetric information (i) The market for lemons The buyer of a used car, with no further information on the vehicle in question, offers the average price of such vehicles The seller can keep the better quality ones and sell only the poor quality ones Theory of Big Data, UCL 23

  24. (ii) Crimemaps Theory of Big Data, UCL 24

  25. But People will not bother to report minor crime if they feel there’s no point or for other reasons “More than 5.2 million people have not reported crimes for fear of deterring home buyers or renters since the online crime map was launched in February 2011” “A quarter (24 per cent) of people would not report a crime for fear it would harm their chances of selling or renting their property” http://www.directline.com/media/archive ‐ 2011/news ‐ 11072011 Theory of Big Data, UCL 25

  26. (iii) Evaluating new scorecards Apply incumbent and challenger to a sample of customers But this sample will have been accepted by the incumbent → data asymmetry Standard scorecard performance measures favour the challenger Theory of Big Data, UCL 26

  27. iv) Credit card transaction fraud detection Transaction stream terminated when incumbent detects a fraudulent transaction, not when the challenger does → data asymmetry Standard fraud detection measures favour the incumbent Theory of Big Data, UCL 27

Recommend


More recommend