part ii applications
play

Part II: Applications in Database Systems Some slides courtesy of - PowerPoint PPT Presentation

Application of Graphical Models Part II: Applications in Database Systems Some slides courtesy of Amol Deshpande Outline Selectivity Estimation and Query Optimization Probabilistic Relational Models Probabilistic Databases


  1. Application of Graphical Models Part II: Applications in Database Systems Some slides courtesy of Amol Deshpande

  2. Outline  Selectivity Estimation and Query Optimization  Probabilistic Relational Models  Probabilistic Databases  Sensor/Stream Data Management  References

  3. Selectivity Estimation  Estimate the intermediate result sizes that may be generated during query processing  Equivalently, selectivity of predicates over tables  Key to obtaining good plans during optimization Customer Single-table predicates: SSN .. Income .. Homeowner? income > 90,000 and homeowner = ‘yes’ .. .. 100000 .. Yes (on customer) .. .. 11000 .. Yes Multi-table predicates: Purchases c.homeowner = ‘yes’ and SSN Store .. Amount p.amount > 10,000 and p.ssn = c.ssn (over Customer c and Purchases p)

  4. Optimizer’s Assumption  Attribute value independence assumption  Attributes assumed to be independently distributed  Rarely true in practice Estimate Customer p(income > 90,000 and homeowner = yes) SSN .. Income .. Homeowner? as .. .. 100000 .. Yes p(income > 90,000) * p(homeowner = yes) .. .. 11000 .. Yes Can result in severe underestimation .. .. 50000 .. No .. .. 30000 .. No .. .. 200000 .. Yes In reality: p(income > 90,000, homeowner = yes) ≈ p(homeowner = yes)

  5. Optimizer’s Assumption  Join uniformity assumption  Tuples from one relation assumed equally likely to join with tuples from other relation  Real datasets exhibit large skews Purchases Customer SSN Store .. Amount SSN .. Income .. Homeowner? .. .. 100,000 .. Yes .. .. 11,000 .. Yes .. .. 50,000 .. No .. .. 30,000 .. No .. .. 200,000 .. Yes

  6. Selectivity Estimation using PGMs  Eliminating attribute value independence assumption [GTK’01,DGR’01,LWV’03,PMW’03] Customer Learn a PGM Income SSN age Income zipcode Home owner? Home..? Age .. .. 100000 .. Yes .. .. 11000 .. Yes .. .. 50000 .. No Approximate CPDs .. .. 30000 .. No using Histograms .. .. 200000 .. Yes Learning process modified to optimize for accuracy as well as storage space

  7. Selectivity Estimation using PGMs  Eliminating attribute value independence assumption [GTK’01,DGR’01,LWV’03,PMW’03] Customer Learn a PGM Income SSN age Income zipcode Home owner? Home..? Age .. .. 100000 .. Yes .. .. 11000 .. Yes .. .. 50000 .. No Approximate CPDs .. .. 30000 .. No using Histograms .. .. 200000 .. Yes Query Selectivity Inference Estimates Algorithm

  8. Outline  Selectivity Estimation and Query Optimization  Probabilistic Relational Models  Probabilistic Databases  Sensor/Stream Data Management  References

  9. Example Probabilistic Database  Example from Dalvi and Suciu [2004] Possible worlds S instance probability A B prob {s1, s2, t1} 0.12 {s1, s2} 0.18 0.6 m 1 s1 {s1, t1} 0.12 s2 n 1 0.5 {s1} 0.18 T {s2, t1} 0.08 {s2} 0.12 prob C D {t1} 0.08 t1 1 p 0.4 {} 0.12

  10. Probabilistic Databases  Much of probabilistic data is naturally correlated  E.g. sensor data, data integration [AFM’06]  If not, query processing introduces correlation  Can use graphical models to capture such correlations

  11. Example: Mutual Exclusiveness Possible worlds X s1 X t1 f1() S instance probability 0 0 0 {s1, s2, t1} 0 A B prob 0 1 0.4 {s1, s2} 0.3 s1 m 1 0.6 1 0 0.6 {s1, t1} 0 s2 n 1 0.5 1 1 0 {s1} 0.3 T {s2, t1} 0.2 {s2} 0 C D prob X s2 f2() {t1} 0.2 t1 1 p 0.4 0 0.5 {} 0 1 0.5 Possible worlds (if desired) computed using inference

  12. Outline  Selectivity Estimation and Query Optimization  Probabilistic Relational Models  Probabilistic Databases  Sensor/Stream Data Management  References

  13. Motivation  Unprecedented, and rapidly increasing, instrumentation of our every-day world Distributed measurement networks (e.g. GPS) RFID Wireless sensor networks Network Monitoring Industrial Monitoring

  14. Outline  A generic temporal model for sensor stream data  A range of applications  Model-based query processing  Object tracking and monitoring  …

  15. True temperature X 1,t at X 1 at time t X 3,t X 2,t Interpretation: X 4,t independent of X 2,t given X 1,t and X 5,t X 4,t X 5,t Observed temperature O 1,t at X 1 at time t O 3,t O 2,t O 4,t O 5,t 1 2 3 SENSOR NETWORK 4 5

  16. X 1,t X 1,t+1 X 1,t-1 X 3,t X 3,t+1 X 3,t-1 X 2,t+1 X 2,t-1 X 2,t O 1,t+1 O 1,t-1 O 1,t O 3,t+1 O 3,t-1 O 3,t O 2,t+1 O 2,t-1 O 2,t Markov Property Interpretation: {X i,t+1 } independent of {X i,t-1 } given {X i,t } 1 2 3 SENSOR NETWORK

  17. X 1,t X 1,t+1 X 1,t-1 X 3,t X 3,t+1 X 3,t-1 X 2,t+1 X 2,t-1 X 2,t O 1,t+1 O 1,t-1 O 1,t O 3,t+1 O 3,t-1 O 3,t O 2,t+1 O 2,t-1 O 2,t State evolution can be modeled as a Dynamic Bayesian Network

  18. X 1,t X 1,t+1 X 1,t-1 X 3,t X 3,t+1 X 3,t-1 X 2,t+1 X 2,t-1 X 2,t O 1,t+1 O 1,t-1 O 1,t O 3,t+1 O 3,t-1 O 3,t O 2,t+1 O 2,t-1 O 2,t Parameters ? (1) System model Prior: p( X 1,0 , X 2,0 , X 3,0 ) Evolution: p( X 1,t , X 2,t , X 3,t | X 1,t-1 , X 2,t-1 , X 3,t-1 )

  19. X 1,t X 1,t+1 X 1,t-1 X 3,t X 3,t+1 X 3,t-1 X 2,t+1 X 2,t-1 X 2,t O 1,t+1 O 1,t-1 O 1,t O 3,t+1 O 3,t-1 O 3,t O 2,t+1 O 2,t-1 O 2,t Parameters ? (2) Measurement model p( O 1,t , O 2,t , O 3,t | X 1,t , X 2,t , X 3,t )

  20. Application: Model-based Query Processing [DGMHH’04,SBEMY’06] USER Declarative Query Query Results Select nodeID, 1, 22.73, 100% temp ± .1C, conf(.95) … Where nodeID in {1..6} 6, 22.1, 99% Probabilistic Query Model Processor Data Observation Plan 1, temp = 22.73, {[temp, 1], 3, voltage = 2.73 [voltage, 3], 6, voltage = 2.65 [voltage, 6]} 1 2 4 3 SENSOR NETWORK 5 6

  21. Application: Model-based Query Processing [DGMHH’04,SBEMY’06] USER Declarative Query Query Results Select nodeID, 1, 22.73, 100% temp ± .1C, conf(.95) … Where nodeID in {1..6} 6, 22.1, 99% Advantages: Probabilistic Query Exploit correlations Model Processor Handle noise, biases in the data Data Observation Plan Predict missing or future values 1, temp = 22.73, {[temp, 1], 3, voltage = 2.73 [voltage, 3], Reduce communication cost 6, voltage = 2.65 [voltage, 6]} 1 2 4 3 SENSOR NETWORK 5 6

  22. Object Tracking and Monitoring  Mobile RFID readers  Handheld, robot-mounted +  Incomplete, noisy data  Environmental factors  Orientation of reading  Not directly queriable  Raw data: <tag id, reader id, ts>  Data needed for querying: e.g., precise object locations

  23. Graphical Modeling  A generative model p(X,O)  X: true object location (x,y,z)  O: boolean for RFID readings  How state of the world changes  Object movement, reader motion  How sensing generates data from the state of the world  Sensor measurement model  Probabilistic inference over RFID streams in mobile Environments. T. Tran, C. Sutton, R. Cocci, Y. Nie, Y. Diao, and P. Shenoy. ICDE 2009.

  24. Inference over RFID Streams  Probabilistic inference over streams -- p(X|O)  Particle filtering: sampling-based inference  Key to performance: using a small number of samples Particle filtering Our optimizations Accuracy 0.6 - 0.8 foot 0.1 - 0.5 foot Performance 0.1 reading/sec for > 1000 readings/sec 20 objects for 20,000 objects 7 orders of magnitude improvement!

  25. Open Discussion  Where does our contribution lie when applying graphical models?  Devise the right model  Local probability distributions  Parameter estimation  Efficiency and scalability  Number of variables (e.g., objects)  Inference on streams (one pass, constant time/item)  Distributed query processing  The giant graphical model is distributed

Recommend


More recommend