challenges of forecasting with fat tailed data
play

Challenges of forecasting with fat tailed data Aaron Clauset - PowerPoint PPT Presentation

Challenges of forecasting with fat tailed data Aaron Clauset @aaronclauset Assistant Professor, Computer Science and BioFrontiers Institute, University of Colorado Boulder External Faculty, Santa Fe Institute 15 October 2013 lion people, 1


  1. Challenges of forecasting with fat tailed data Aaron Clauset @aaronclauset Assistant Professor, Computer Science and BioFrontiers Institute, University of Colorado Boulder External Faculty, Santa Fe Institute 15 October 2013 lion people, 1

  2. joint work with Mark Newman Cosma Shalizi Ryan Woodard 2

  3. 1. predicting the unpredictable 2. modeling rare events 3. historical probability 4. statistical forecast 5. financial data 6. outlook 3

  4. 1. predicting the unpredictable complex systems “heavy” or “fat” tailed quantities • book sales • earthquakes • terrorist attacks • civil or international wars • stock market crashes • electrical power outages • solar flare intensity • etc. etc. 4

  5. 24 empirical data sets 0 0 0 0 0 0 10 10 10 10 10 10 (m) (n) (o) (a) (b) (c) � 1 � 1 � 1 10 10 � 1 10 � 1 10 10 � 2 10 � 2 � 2 10 10 P(x) P(x) � 2 � 3 � 2 � 2 10 10 10 10 � 3 � 3 10 10 � 4 10 � 3 � 3 10 10 � 4 � 4 � 5 10 10 10 cities email fires words proteins metabolic � 5 � 4 � 6 � 5 � 4 � 4 10 10 10 10 10 10 0 2 4 6 8 0 1 2 3 0 2 4 0 2 4 0 1 2 0 1 2 3 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 0 0 0 0 0 0 10 10 10 10 10 10 (p) (q) (r) (d) (e) (f) � 1 � 1 � 1 10 10 10 � 2 10 � 2 � 2 � 2 10 10 10 P(x) P(x) � 1 � 1 10 10 � 4 � 3 � 3 � 3 10 10 10 10 � 4 � 4 � 4 10 10 10 � 6 10 flares quakes religions Internet calls wars � 5 � 5 � 2 � 5 � 2 10 10 10 10 10 1 10 2 10 3 10 4 10 5 10 6 0 2 4 6 8 6 7 8 0 2 4 0 2 4 6 0 1 2 3 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 0 0 0 0 0 0 10 10 10 10 10 10 (s) (t) (u) (g) (h) (i) � 1 � 1 10 10 � 1 10 � 1 � 2 � 2 � 1 10 10 10 10 � 2 10 P(x) P(x) � 2 � 3 10 10 � 3 10 � 2 � 4 � 4 � 2 10 10 10 10 � 3 10 � 4 � 5 10 10 surnames wealth citations terrorism HTTP species � 4 � 3 � 6 � 5 � 6 � 3 10 10 10 10 10 10 4 5 6 7 8 9 10 11 0 1 2 3 0 2 4 2 4 6 8 0 1 2 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 0 0 0 0 0 0 10 10 10 10 10 10 (v) (w) (x) (j) (k) (l) � 1 � 1 � 2 10 10 10 � 2 � 1 � 1 � 1 � 2 10 10 10 10 10 � 4 10 P(x) P(x) � 3 10 � 3 10 � 6 10 � 4 � 2 � 2 � 2 10 10 10 10 � 4 10 � 8 � 5 10 10 � 5 authors web hits web links birds blackouts book sales 10 � 6 � 10 � 3 � 3 � 3 10 10 10 10 10 0 10 1 10 2 10 3 10 4 10 0 1 2 3 5 0 2 4 6 0 2 4 6 3 4 5 6 7 6 7 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 x x x x x x 5

  6. 1. predicting the unpredictable complex systems “heavy” or “fat” tailed quantities • book sales • earthquakes • terrorist attacks • civil or international wars • stock market crashes • electrical power outages • solar flare intensity • etc. etc. 6

  7. 1906 San Francisco, M7.8 2008 Sichuan, M7.9 2011 Japan, M8.9 7

  8. earthquake physics Gutenberg-Richter law frequency vs. size 1 Proportion � M 9 0.1 8 0.01 7 Magnitude, M 6 0.001 0 1 2 3 4 5 6 5 4 3 2 1 0 0 50 100 150 200 250 300 Earthquake number (frequency) ∝ (seismic moment) − α 8

  9. 0 10 earthquakes vs. wars Proportion � S − 1 10 10 10 9 10 Battle deaths, S (severity) − 2 10 8 3 4 5 6 7 8 10 10 10 10 10 10 10 WWII 7 10 WWI inter-state wars 6 10 1816-2007 5 10 4 10 3 10 20 40 60 80 Interstate war number (1816 − 2007) (frequency) ∝ (deaths) − α 9

  10. earthquakes vs. global terrorism 4 10 0 10 Proportion � S − 1 10 9 − 11 Jan. 1998-2008 − 2 10 Severity, S (deaths) − 3 10 3 10 13,274 deadly attacks − 4 10 0 1 2 3 4 worldwide 2 10 Richardson’s law (1941) 1 10 2000 4000 6000 8000 10000 Attack number (Jan 1998 − 2008) (frequency) ∝ (deaths) − α 10

  11. terrorism & insurgency earthquakes Gutenberg-Richter law Richardson’s law F ∝ M − α F ∝ S − α physics largely known processes largely unknown processes fixed processes dynamic, adaptive forecasting possible how do we forecast? (years of successes) prediction very hard what can we predict? (years of failures) what can we not predict? 11

  12. 1. predicting the unpredictable 2. modeling rare events 3. historical probability 4. statistical forecast 5. financial data 6. outlook 12

  13. 2. modeling rare events • not in financial markets (yet) • but in global terrorism • how probable was a 9/11-sized event? • how probable is another 9/11-sized event? 13

  14. deadly terrorist events, 1968-2008 14000 12280 12000 number of incidents 10000 8000 6000 4000 2000 957 36 1 0 1 − 9 10 − 99 100 − 999 1000+ deaths per attack RAND-MIPT event database 14

  15. deadly terrorist events, 1968-2008 14000 12280 { 12000 number of incidents 10000 “normal,” 92% 8000 6000 large, 8% { 4000 very large, 0.3% 2000 { 957 36 1 0 1 − 9 10 − 99 100 − 999 1000+ deaths per attack RAND-MIPT event database 15

  16. how probable was a 9/11-sized event? requires a probability model Pr( x ) 16

  17. how probable was a 9/11-sized event? requires a probability model Pr( x ) key observations • care only about large events disproportionate consequences • unknown upper tail structure several models fit well • little data in upper tail large statistical uncertainty 17

  18. how probable was a 9/11-sized event? requires a probability model Pr( x ) key observations • care only about large events separate tail from body disproportionate consequences • unknown upper tail structure multiple tail models several models fit well • little data in upper tail distribution over conclusions large statistical uncertainty model-based, data-driven forecasts 18

  19. 1. predicting the unpredictable 2. modeling rare events 3. historical probability 4. statistical forecast 5. financial data 6. outlook 19

  20. step 1: the data Terrorism event data from 0 10 RAND-MIPT Terrorism Knowledge Base (2008). − 1 10 40 years of data (1968-2007) − 2 10 Worldwide (~200 countries) Pr(X � x) 13,274 deadly events − 3 Each event is localized in time 10 and space, and MIPT records its severity (deaths). − 4 9/11 10 9/11 recorded as three events; the NYC event records 2749 deaths. − 5 10 0 1 2 3 4 10 10 10 10 10 severity, x (deaths) 20

  21. step 2: separate tail from body Choose such that x min = y 0 10 h i S ( x ≥ y ) , F ( x ≥ y | ˆ d θ ) is minimized. Here, we let d[ · , · ] tail be the KS-statistic. − 1 10 body − 2 10 Pr(X � x) − 3 10 − 4 9/11 10 − 5 10 0 1 2 3 4 10 10 10 10 10 severity, x (deaths) 21

  22. step 2: model the upper tail Let for values Pr( x ) ∝ x − α 0 10 Pareto distribution . x ≥ x min For the empirical data, we tail − 1 estimate , α = 2 . 4 ± 0 . 1 ˆ 10 with . x min = 10 body This yields 994 tail events (7.5%). − 2 A Monte Carlo hypothesis test 10 Pr(X � x) finds , p = 0 . 68 ± 0 . 03 meaning the power law cannot be rejected as a model of these − 3 10 data. A likelihood ratio test finds the − 4 9/11 stretched exponential and log- 10 normal distributions also plausible. − 5 10 0 1 2 3 4 10 10 10 10 10 severity, x (deaths) 22

  23. step 3: bootstrap the data and repeat Given observed event sizes, n 0 10 generate by drawing , Y Pareto distribution y j , uniformly at j = 1 , . . . , n random, with replacement from the observed events. X = { x i } − 1 10 For each tail model the MLE Pr( x | θ , x min ) parameter choice is θ ( Y, x min ) − 2 10 deterministic. Pr(X � x) The produces a bootstrap distribution that Pr( θ , x min ) − 3 10 capture the statistical uncertainty Pr( � ) within this model. − 4 9/11 10 2.2 2.4 2.6 − 5 10 0 1 2 3 4 10 10 10 10 10 severity, x (deaths) 23

  24. step 4: repeat with alternative models Repeat the above steps, but with 0 10 additional tail models. Here, we Pareto distribution choose: Stretched exponential Stretched exponential − 1 Log-normal Pr( x ) ∝ x β − 1 e − λ x − β 10 Log-normal Pr( x ) ∝ 1 x e − (ln x − µ )2 − 2 10 2 σ 2 Pr(X � x) Both of which cannot be rejected, under a LRT, as a model of − 3 events . x ≥ x min = 10 10 Multiple tail models better represents model uncertainty. − 4 9/11 10 − 5 10 0 1 2 3 4 10 10 10 10 10 severity, x (deaths) 24

Recommend


More recommend