 
              From a Calculus to an Execution Environment for Stream Processing Robert Soulé Martin Hirzel Bu ğ ra Gedik Robert Grimm Cornell University IBM Research Bilkent University New York University DEBS 2012 1
… to an Execution Environment Source languages CQL Sawzall StreamIt (StreamSQL) (MapReduce) (SDF) Fusion Optimizations (merge ops) Benefits of execution River environment: Fission (execution (replicate ops) • Language portability environment) • Optimization reuse Placement (assign hosts) System S (platform) 2
From a Calculus … • Calculus = formal language + semantics – Stream calculus, Soulé et al. [ESOP’10] • Graph language: • Semantics: – Stream operators – Small-step with functions ( F ) – Operational – Queues ( Q ) – Sequence of – Variables ( V ) “operator firings” F < Q 1 , V 1 > q q'  b < Q 2 , V 2 > f v  b * … 3
Benefits of Calculus: Translation Correctness Proofs Input Execute Output Translate Translate Input Execute Output 4 ¡
From Abstractions to the Real World Brooklet calculus River execution environment Sequence of atomic steps Operators execute concurrently Pure functions, state threaded Stateful functions, protected through invocations with automatic locking Restricted execution: bounded Non-deterministic execution queues and back-pressure Opaque functions Function implementations No physical platform, Abstract representation of independent from runtime platform, e.g. placement Finite execution Indefinite execution 5
Concurrent Execution Case 1: No Shared State Single-threaded Atomic queue operators operations o 1 o 2 o 3 v w x • Brooklet operators fire one at a time • River operators fire concurrently • For both, data must be available 6
Concurrent Execution Case 2: With Shared State Minimal locking o 1 o 2 o 3 v w w • Locks form equivalence classes over shared variables • Every shared variable is protected by one lock • Shared variables in the same class protected by same lock • Locks acquired/released in standard order 7
Restricted Execution Bounded Queues q o 1 o 2 o 3 v w w o 2 waits b/c o 3 waits b/c Deadlock! output q is full o 2 locked w • Naïve approach: block when output queue is full 8
Restricted Execution Safe Back-Pressure 3. Buffer data 2. Fire operator in local queue q o 1 o 2 o 3 v w w 5. Move data to 1. Acquire locks output queue 4. Release locks • Our approach: only block on output queue when not holding locks on variables 9
Applications of an Execution Environment • Easier to develop source languages – Implementation language – Language modules – Operator templates • Possible to reuse optimizations – Annotations provide additional information between source and intermediate language 10
Function Implementations and Translations logs : {origin : string; target : string} stream; hits : {origin : string; count : int} stream = select istream(origin, count(origin)) from logs[range 300] where origin != target Expose operators, communication, Pre-existing and state operator Bag.filter (fun x -> #expr) templates Bag.filter (fun x -> origin != target) Select Range Aggr IStream win count 11
Translation Support: Pluggable Compiler Modules Symbol table has-a has-a select istream(*) from quotes[now], history SQL Expression where quotes.ask<=history.low analyzer analyzer has-a and quotes.ticker=history.ticker is-a CQL analyzer CQL = SQL + Streaming + Expressions 12
Optimization Support: Extensible Annotations Source Establishes by Needs to know: language construction, e.g., • Safety Sawzall reducers • Profitability commute River (execution Optimizer environment) Establishes, e.g., System S (platform) available resources 13
Optimization Support: Current Annotations Annotation Description Optimization Fuse operators with same ID @Fuse(ID) Fusion in the same process Perform fission on an @Parallel() Fission operator An operator’s function is @Commutative() Fission commutative An operator’s state is @Keys( k 1 ,…, k n ) Fission partitionable by fields k 1 ,…, k n Place operators with same ID @Group(ID) Placement on the same machine 14
Evaluation • Four benchmark • Three optimizations applications – Placement – CQL linear road – Fission – StreamIt FM radio – Fusion – Sawzall web log analyzer (batch) – CQL web log analyzer (continuous) 15
Distributed Linear Road (simplified version from Arasu/Babu/Widom [VLDBJ’06]) istre parti now am tion proj dis dup- proj join join ect tinct split ect ran aggre ge gate proj istre dup pro rstre now join ect am split ject am ran aggre se pro ge gate lect ject First distributed CQL implementation 16
CQL: Placement, Fusion, Fission • Placement + Fusion • Fission  4x speedup on 4 machines  2x speedup on 16 machines • Insufficient work per operator 17
StreamIt: Placement • Optimization reuse  1.8x speedup on 4 machines 18
Sawzall (MapReduce on River) Fission + Fusion • Same fission optimizer for Sawzall as for CQL • 8.92x speedup on 16 machines, 14.80x on 64 cores • With fusion, 50.32x on 64 cores 19
Related Work SVM Labonte et al. Stream Execution [PACT’04] processing environment This paper CQL P-Code Arasu et al. Nelson [VLDB J.’06] [CC’79] Translators from languages to IL 20
Conclusions • River, execution environment for streaming • Semantics specified by formal calculus – Brooklet, Soulé et al. [ESOP’10] • 3 source languages, 3 optimizations – First distributed CQL – Language compiler module reuse – Optimization enabled by annotations • Encourages innovation in stream processing • h$p://www.cs.nyu.edu/brooklet/ 21
Recommend
More recommend