Freddies: DHT-Based Adaptive Query Processing via Federated Eddies Ryan Huebsch Shawn Jeffery CS 294-4 Peer-to-Peer Systems 12/9/03
Outline � Background: PIER � Motivation: Adaptive Query Processing (Eddies) � Federated Eddies (Freddies) � System Model � Routing Policies � Implementation � Experimental Results � Conclusions and Continuing Work
PIER � Fully decentralized relational query processing engine � Principles: � Relaxed consistency � Organic Scaling � Data in its Natural Habitat � Standard Schemas via Grassroots software � Relational queries can be executed in a number of logically equivalent ways � Optimization step chooses the best performance-wise � Currently, PIER has no means to optimize queries
Adaptive Query Processing � Traditional query optimization occurs at query time and is based on statistics. This is hard because: � Catalog (statistics) must be accurate and maintained � Cannot recover from poor choices � The story gets worse! Long running queries: � � Changing selectivity/costs of operators � Assumptions made at query time may no longer hold Federated/autonomous data sources: � � No control/knowledge of statistics Heterogeneous data sources: � � Different arrival rates � Thus, Adaptive Query Processing systems attempt to change execution order during the query � Query Scrambling, Tukwila, Wisconsin, Eddies
Eddies � Eddy: A tuple router that dynamically chooses the order of operators in a query plan Optimize query at runtime on a per-tuple basis � Monitors selectivities and costs of operators to determine where � to send a tuple to next � Currently centralized in design and implementation Some other efforts for distributed Eddies from Wisconsin & � Singapore (neither use a DHT)
Why use Eddies in P2P? (The easy answers) � Much of the promise of P2P lies in its fully distributed nature � No central point of synchronization � no central catalog � Distributed catalog with statistics helps, but does not solve all problems � Possibly stale, hard to maintain � Need CAP to do the best optimization � No knowledge of available resources or the current state of the system (load, etc) � This is the PIER Philosophy! � Eddies were designed for a federated query processor � Changing operator selectivities and costs � Federated/heterogeneous data sources
Why Eddies in P2P? (The not so obvious answers) � Available compute resources in a P2P network are heterogeneous and dynamically changing � Where should the query be processed? � In a large P2P system, local data distributions, arrival rates, etc. maybe different than global
Freddies: Federated Eddies � A Freddy is an adaptive query processing operator within the PIER framework � Goals: � Show feasibility of adaptive query processing in PIER � Build foundation and infrastructure for smarter adaptive query processing � Establish baseline for Freddy performance to improve upon with smarter routing policies
An Example Freddy R join S S join T Put Local (Join Value RS) Operators Put (Join Value ST) To DHT Freddy Output Get(R) Get(T) Get(S) R S T From DHT
System Model � Same functionality as centralized Eddy � Allows easy concept reuse � Freddy uses its Routing Policy to determine the next operator for a tuple � Tuples in a Freddy are tagged with DoneBits indicating which operators have processed it � Freddy does all state management, thus existing operators require no modifications � Local processing comes first (in most cases) � Conserve network bandwidth � Not as simple as it seems � Freddy: decide how to rehash a tuple � This determines join order � Challenge: Decoupling of routing decision and operator. Most Eddy techniques no longer valid
Query Processing in Freddies � Query origin creates a query plan with a Freddy � Possible routings determined at this time, but not the order � Freddy operators on all participating nodes initiate data flow � As tuples arrive, the Freddy determines the next operator for this tuple based on the DoneBits and routing policy � Source tuples tagged with clean DoneBits and routed appropriately � When all DoneBits are set, the tuple is sent to the output operator (return to query origin)
Tuple Routing Policy � Determines to which operator to send a tuple � Local information � Messages expensive � Monitor local usage and adjust locally � “Processing Buddy” information � During processing, discover general trends in input/output nodes’ processing capabilities/output rates, etc � For instance, want to alert previous Freddy of poor PUT decisions � Design space is huge � large research area
Freddy Routing Policies � Simple (KISS): � Static � Random: Not as bad as you may think � Local Stat Monitoring (sampling) � More complex: � Queue lengths � Somewhat analogous to the “back-pressure” effect � Monitors DHT PUT ACKs � Load balancing through “learning” of global join key distribution � Piggyback stats on other messages � Don’t need global information, only stats about processing buddies (nodes with which we communicate) � Different sample than local – may or may not be better
Implementation & Experimental Setup � Design Decisions: � Simplicity is key � Roughly 300 of NCSS (PIER is about 5300) � Single query processing operator � Separate routing policy module loaded at query time � Possible routing orders determined by simple optimizer � Required generalizations to the PIER execution engine to deal with generic operators � Allow PIER to run any dataflow operator � Simulator with 256 nodes, 100 tuples/table/node � Feasibility, not scalability � In the absence of global (or stale) knowledge, a static optimizer could chose any join ordering � we compare Freddy performance to all possible static plans
3-way join � R join S join T � R join S is expensive (multiples tuple count by 25) � S join T is highly selective (drops 90%) � Possible static join orderings: T R R S S T
3 Way Join Results 1000 900 800 700 Completion Time (s) 600 RST 500 STR Eddy 400 300 200 100 0 25 50 100 150 Bandwidth/Node (KB/s)
4-way join � R join S join T join U � S join T is expensive � Possible static join orderings: U R U R T U R S R S S T S T T U Note: A traditional optimizer can’t make R S T U this plan
4-Way Join 350 300 250 Completion Time (s) RSTU 200 STRU STUR TUSR Bushy 150 Eddy 100 50 0 50 75 100 125 150 Bandwidth/Node (KB/s)
The Promise of Routing Policy � Illustrative example of how routing policy can 120 improve performance Aggregate Bandwidth (MB/s) 100 � This not meant to be an 80 exhaustive comparison of policies, rather to 60 show the possibilities 40 � EddyQL considers 20 number of outstanding PUT s (queue length) to 0 RST STR Eddy EddyQL decide where to send
Conclusions and Continuing Work � Freddies provide adaptable query processing in a P2P system � Require no global knowledge � Baseline performance shows promise for smarter policies � In the future… � Explore Freddy performance in a dynamic environment � Explore more complex routing policies
Questions? Comments? Snide remarks for Ryan? Glorious praise for Shawn? Thanks!
Recommend
More recommend