freddies dht based adaptive query processing via
play

Freddies: DHT-Based Adaptive Query Processing via Federated Eddies - PowerPoint PPT Presentation

Freddies: DHT-Based Adaptive Query Processing via Federated Eddies Ryan Huebsch Shawn Jeffery CS 294-4 Peer-to-Peer Systems 12/9/03 Outline Background: PIER Motivation: Adaptive Query Processing (Eddies) Federated Eddies


  1. Freddies: DHT-Based Adaptive Query Processing via Federated Eddies Ryan Huebsch Shawn Jeffery CS 294-4 Peer-to-Peer Systems 12/9/03

  2. Outline � Background: PIER � Motivation: Adaptive Query Processing (Eddies) � Federated Eddies (Freddies) � System Model � Routing Policies � Implementation � Experimental Results � Conclusions and Continuing Work

  3. PIER � Fully decentralized relational query processing engine � Principles: � Relaxed consistency � Organic Scaling � Data in its Natural Habitat � Standard Schemas via Grassroots software � Relational queries can be executed in a number of logically equivalent ways � Optimization step chooses the best performance-wise � Currently, PIER has no means to optimize queries

  4. Adaptive Query Processing � Traditional query optimization occurs at query time and is based on statistics. This is hard because: � Catalog (statistics) must be accurate and maintained � Cannot recover from poor choices � The story gets worse! Long running queries: � � Changing selectivity/costs of operators � Assumptions made at query time may no longer hold Federated/autonomous data sources: � � No control/knowledge of statistics Heterogeneous data sources: � � Different arrival rates � Thus, Adaptive Query Processing systems attempt to change execution order during the query � Query Scrambling, Tukwila, Wisconsin, Eddies

  5. Eddies � Eddy: A tuple router that dynamically chooses the order of operators in a query plan Optimize query at runtime on a per-tuple basis � Monitors selectivities and costs of operators to determine where � to send a tuple to next � Currently centralized in design and implementation Some other efforts for distributed Eddies from Wisconsin & � Singapore (neither use a DHT)

  6. Why use Eddies in P2P? (The easy answers) � Much of the promise of P2P lies in its fully distributed nature � No central point of synchronization � no central catalog � Distributed catalog with statistics helps, but does not solve all problems � Possibly stale, hard to maintain � Need CAP to do the best optimization � No knowledge of available resources or the current state of the system (load, etc) � This is the PIER Philosophy! � Eddies were designed for a federated query processor � Changing operator selectivities and costs � Federated/heterogeneous data sources

  7. Why Eddies in P2P? (The not so obvious answers) � Available compute resources in a P2P network are heterogeneous and dynamically changing � Where should the query be processed? � In a large P2P system, local data distributions, arrival rates, etc. maybe different than global

  8. Freddies: Federated Eddies � A Freddy is an adaptive query processing operator within the PIER framework � Goals: � Show feasibility of adaptive query processing in PIER � Build foundation and infrastructure for smarter adaptive query processing � Establish baseline for Freddy performance to improve upon with smarter routing policies

  9. An Example Freddy R join S S join T Put Local (Join Value RS) Operators Put (Join Value ST) To DHT Freddy Output Get(R) Get(T) Get(S) R S T From DHT

  10. System Model � Same functionality as centralized Eddy � Allows easy concept reuse � Freddy uses its Routing Policy to determine the next operator for a tuple � Tuples in a Freddy are tagged with DoneBits indicating which operators have processed it � Freddy does all state management, thus existing operators require no modifications � Local processing comes first (in most cases) � Conserve network bandwidth � Not as simple as it seems � Freddy: decide how to rehash a tuple � This determines join order � Challenge: Decoupling of routing decision and operator. Most Eddy techniques no longer valid

  11. Query Processing in Freddies � Query origin creates a query plan with a Freddy � Possible routings determined at this time, but not the order � Freddy operators on all participating nodes initiate data flow � As tuples arrive, the Freddy determines the next operator for this tuple based on the DoneBits and routing policy � Source tuples tagged with clean DoneBits and routed appropriately � When all DoneBits are set, the tuple is sent to the output operator (return to query origin)

  12. Tuple Routing Policy � Determines to which operator to send a tuple � Local information � Messages expensive � Monitor local usage and adjust locally � “Processing Buddy” information � During processing, discover general trends in input/output nodes’ processing capabilities/output rates, etc � For instance, want to alert previous Freddy of poor PUT decisions � Design space is huge � large research area

  13. Freddy Routing Policies � Simple (KISS): � Static � Random: Not as bad as you may think � Local Stat Monitoring (sampling) � More complex: � Queue lengths � Somewhat analogous to the “back-pressure” effect � Monitors DHT PUT ACKs � Load balancing through “learning” of global join key distribution � Piggyback stats on other messages � Don’t need global information, only stats about processing buddies (nodes with which we communicate) � Different sample than local – may or may not be better

  14. Implementation & Experimental Setup � Design Decisions: � Simplicity is key � Roughly 300 of NCSS (PIER is about 5300) � Single query processing operator � Separate routing policy module loaded at query time � Possible routing orders determined by simple optimizer � Required generalizations to the PIER execution engine to deal with generic operators � Allow PIER to run any dataflow operator � Simulator with 256 nodes, 100 tuples/table/node � Feasibility, not scalability � In the absence of global (or stale) knowledge, a static optimizer could chose any join ordering � we compare Freddy performance to all possible static plans

  15. 3-way join � R join S join T � R join S is expensive (multiples tuple count by 25) � S join T is highly selective (drops 90%) � Possible static join orderings: T R R S S T

  16. 3 Way Join Results 1000 900 800 700 Completion Time (s) 600 RST 500 STR Eddy 400 300 200 100 0 25 50 100 150 Bandwidth/Node (KB/s)

  17. 4-way join � R join S join T join U � S join T is expensive � Possible static join orderings: U R U R T U R S R S S T S T T U Note: A traditional optimizer can’t make R S T U this plan

  18. 4-Way Join 350 300 250 Completion Time (s) RSTU 200 STRU STUR TUSR Bushy 150 Eddy 100 50 0 50 75 100 125 150 Bandwidth/Node (KB/s)

  19. The Promise of Routing Policy � Illustrative example of how routing policy can 120 improve performance Aggregate Bandwidth (MB/s) 100 � This not meant to be an 80 exhaustive comparison of policies, rather to 60 show the possibilities 40 � EddyQL considers 20 number of outstanding PUT s (queue length) to 0 RST STR Eddy EddyQL decide where to send

  20. Conclusions and Continuing Work � Freddies provide adaptable query processing in a P2P system � Require no global knowledge � Baseline performance shows promise for smarter policies � In the future… � Explore Freddy performance in a dynamic environment � Explore more complex routing policies

  21. Questions? Comments? Snide remarks for Ryan? Glorious praise for Shawn? Thanks!

Recommend


More recommend