from mud to mire managing inherent risk in the enterprise
play

From MUD to MIRE: Managing Inherent Risk in the Enterprise Peter - PowerPoint PPT Presentation

From MUD to MIRE: Managing Inherent Risk in the Enterprise Peter J. Haas IBM Almaden Research Center San Jose, CA 1 MUD Workshop, September, 2010 The Two Perpetual Questions Where do the probabilities come from? Who


  1. From MUD to MIRE: Managing Inherent Risk in the Enterprise Peter J. Haas IBM Almaden Research Center San Jose, CA 1 MUD Workshop, September, 2010

  2. The Two Perpetual Questions • “Where do the probabilities come from?” • “Who is going to use this stuff in the real world?” 2 MUD Workshop, September, 2010

  3. My background in probabilistic DB 3 MUD Workshop, September, 2010

  4. RAQA: Resolution-Aware Query Answering for Business Intelligence (Sismanis et al. 2009) • OLAP querying (datacubes: roll-up, drill-down) • Uncertainty due to entity City State Strict range Status resolution San Francisco CA [$30,$230] guaranteed San Jose CA [$70,$200] non-guaranteed • Bounds on query answers Sum(Sales) group by City,State • Implemented via SQL queries State Strict range Status • Conservative approach CA [$230,$230] guaranteed Sum(Sales) group by State 4 MUD Workshop, September, 2010

  5. The MCDB System (with Chris Jermaine & students) i.i.d. samples from possible-worlds dist’n Random DB = D d 1 Q Schema Q( d 1 ) VG Functions Q Monte Carlo Q( d 2 ) d 2 Parameter : Generator Tables Q( d n ) ... Q( D ) = Q i.i.d. samples from Select SUM(sales) query-result AS t_sales d n dist’n ˆ E [ t_sales ] ˆ Var [ t_sales ] Many implementation tricks ˆ q .01 [ t_sales ] to ensure acceptable performance Monte Histogram Carlo Error bounds Estimator Inference 5 MUD Workshop, September, 2010

  6. Query-Result Distributions Long tail in Delivery times Q1 Q2 80 60 Frequency Frequency 60 40 40 20 20 0 0 8.2 8.25 8.3 8.35 8.4 8.45 8.5 200 250 300 350 400 450 Revenue change Days until completion 9 x 10 Q3 Q4 40 80 Frequency Frequency 30 60 20 40 10 20 0 0 1.3375 1.338 1.3385 1.339 1.3395 1.34 1.3405 1.341 −8.842 −8.84 −8.838−8.836−8.834−8.832 −8.83 −8.828 Total supplier cost Additional profits 10 10 x 10 x 10 6 MUD Workshop, September, 2010

  7. MC 3 : MapReduce + MCDB (Xu et al. 2009) High-level query www.jaql.org language for //code.google.com/p/jaql semi-structured JSON data Jaql Parallel batch processing Map-Reduce Hadoop HDFS Distributed File System Tricks to manage Pseudo-random numbers 7 MUD Workshop, September, 2010

  8. Where do the probabilities come from? 8 MUD Workshop, September, 2010

  9. Data-Warehouse Uncertainty Data Integration {John Smith, San Jose} Name City {John Smith, San Jose} ETL John Smith (SJ, 0.66), (LA, 0.33) {John Smith, Los Angeles} Name City City Sales Similarity Name Sales John Smith LA LA $50K ? (0.92) Join J. Smith $50K Information extraction Hotels (Michelakis et al., 2009) NY Marriott A lovely thing to behold System T is Paris Hilton in the ? (0.20) Hotel Annotator Paris Hilton Springtime … 09/09/2007 Re: system crash Source Problem type -------------------------- Text Miner This morning, my ORACLE Cust0385 (DBMS, 0.8), (OS, 0.2) system on LINUX exploded in a spectacular fireball … 9 MUD Workshop, September, 2010

  10. Data-Warehouse Uncertainty – Cont’d Measurement Uncertainty f(t) Sensor_ID Temp (F) Sensor S23 78.32 t 78.32 f(t) Event Time System Monitor Buffer overflow 10/17/2007:18:20:02 t 10 MUD Workshop, September, 2010

  11. Real-World Challenges with Data-Warehouse Uncertainty • People don’t like to admit that it exists! – Retailers view uncertainty as failure of security, supply chain management • IBM research relationship manager for retail – Law enforcement • Photo ID in meth dealer trial – Scientists pretend data is perfect: uncertainty undermines results • Hans-Joachim Lenz – Database vendors • Data “cleaning” products • Data warehouse may not even exist! – Ex: cancer data at medical center – Ex: tomato soup supply chain data 11 MUD Workshop, September, 2010

  12. Stochastic Predictive Analytics on Big Data • Uncertain data describes future or hypothetical events – Based on complex, fine-grained stochastic model over big data – Minimizes denial problem • Intense recent interest in “business analytics” driven by – Need for low risk, quick payback projects (flexibility, low cost, fine data granularity) – Technical advances • Cloud computing • Software as a Service (SaaS) • Next generation tools, portals, visualization • Often with a spreadsheet front end – $8 Billion of such tools [Gnatovich06] – IBM services pricing • Lots of prototype activity – Fox/GreenPlum [Cohen09 MAD analytics paper] – VISA/IBM [Das10 SIGMOD paper] 12 MUD Workshop, September, 2010

  13. Ex. 1: Portfolio Values Customer EuroCallOptions CustID OptionID NumShares … OptionID InitVal … StrikeP OVal John Smith 23 50 … 23 $2.35 … $4.00 ? … … … … … … … … SELECT SUM (c.NumShares * o.Val) Option value one month from now FROM Customer c, EuroCallOptions o (exercise date) WHERE c.OptionID = o.OptionID AND c.CustType = ‘Institutional’ Modified Black-Scholes model for European call option:         OVal max ( ) ,0 dV rVdt a V VdW V t S final Simulation approximation (Euler approach):          ( ) ( ) ( ) ( ) ( ) V t t V t rV t t a V t V t tZ Sample from j Normal dist’n 13 Also CMOs, etc. MUD Workshop, September, 2010

  14. Ex. 2: Pricing Decisions Bayes Theorem price price Data for all Data for one customers customer demand demand Unit Order Amount CustID Price J. Smith $10.20 500 Global demand Individual demand … … … distribution (prior) distribution (posterior) • Can analyze arbitrary dynamically-defined customer segments when determining effect of price increase 14 MUD Workshop, September, 2010

  15. Ex. 3: Individual Click Behavior (EBay) Click data for all EBay customers x 13 y 13 p 3 p 3 p 1 p 1 x 34 Data for one y 34 x 14 x 32 y 14 y 32 customer p 4 p 2 p 4 p 2 x 24 y 24 Global Markov model Individual Markov model distribution (Dirichelet prior) distribution (posterior) • Can analyze arbitrary dynamic customer segments when determining effect of changing EBay pages 15 MUD Workshop, September, 2010

  16. Ex. 4: Clinic-Capacity Risk Medical data for all Stochastic Pharmacy data for all customers dosage model customers Cox hazard-rate disease model Clinic-resource demand model CustID Time period Resource needed Jane Smith June-Sept ? … … 16 MUD Workshop, September, 2010

  17. MCDB: Improvement of Traditional Analytics Workflow Arena, R, Matlab,… Arena, R, Matlab,… Model Model Data reduction Analyst (PhD) Develops model Model fitting Model application & querying • Data extraction slow and bug-prone • Hard to re-link model results to DB • Only coarse-grained modeling • Hard to deal with data updates • No encapsulation for user • Sensitivity, what-if analysis are hard Goal: Integrate model with Database Model 17 MUD Workshop, September, 2010

  18. Where do the probabilities come from? From stochastic predictive models over big data 18 MUD Workshop, September, 2010

  19. Who is going to use this stuff in the real world? 19 MUD Workshop, September, 2010

  20. Key Driver: Risk Management • Ex: Projected sales under SELECT SUM (s.amount) FROM SALES s, CUST c micromarketing campaign WHERE s.ID = c.ID • Ex: ERP AND c.city = ‘Los Angeles’ – # OS experts for help desk – Demand projected from historical Query-result text data (2x uncertainty) Loss distribution distribution probability probability – Provide principled safety factor • Regulatory pressure – Basel II, Solvency II • Business pressure expected expected 5% – Ex.: Energy Risk Professionals answer loss VaR Total LA sales Loss 20 MUD Workshop, September, 2010

  21. Challenge: Decision-makers’ Poor Intuition About Risk Flaw of averages (weak form): Flaw of averages (strong form): Mean correct, Variance ignored Wrong value of mean: Sam Savage’s book f(E[X]) ≠ E[f(X)] (why we underestimate risk) 21 MUD Workshop, September, 2010

  22. Examples • Red River (ND) flooding • Perishable Inventory (Red Lobster) • U.S. accounting standards (FASB) • Project completion time: 10 parallel tasks, E [ T i ] = 6 mo. • Data cleansing • Machine learning • Trio agg. paper “Expected to crest at 50 feet” (MUD 2008) $800 • Basic probability $600 cost $400 stock = E[demand] = 5 $200 0 2 4 6 8 10 22 demand MUD Workshop, September, 2010

  23. Probability Management and Interactive Spreadsheets • DIST 1.1 standard – DIST = distribution string – IID Monte Carlo (multivariate) samples – Compressed, with metadata • Ensures correct, coherent risk computations throughout enterprise and beyond – E.g., Royal Dutch Shell Audit seal of • “Electricity network” for probability approval – Royal Dutch Shell, Merck Pharmaceutical, Oracle, Wells Fargo Bank, Bessemer Trust, and IBM • DISTs can be manipulated like numbers – Facilitates interactive spreadsheets (demo) 23 MUD Workshop, September, 2010

  24. Demo 1 24 MUD Workshop, September, 2010

  25. Demo 2 25 MUD Workshop, September, 2010

Recommend


More recommend