dynamic expressivity with static optimization for
play

Dynamic Expressivity with Static Optimization for Streaming - PowerPoint PPT Presentation

Dynamic Expressivity with Static Optimization for Streaming Languages Robert Soul Michael I. Gordon Saman Amarasinghe Robert Grimm Martin Hirzel Cornell MIT MIT NYU IBM DEBS 2013 1 Problem Stream (FIFO queue) Operator Rate


  1. Dynamic Expressivity with Static Optimization for Streaming Languages Robert Soulé Michael I. Gordon Saman Amarasinghe Robert Grimm Martin Hirzel Cornell MIT MIT NYU IBM DEBS 2013 1 ¡

  2. Problem Stream (FIFO queue) Operator “Rate” = number of queue pushes/pops per operator firing Video Input * Dynamic rate (varies at runtime)  Requires dynamic expressivity Huffman IQuant Static rate (known at compile time)  Enables static optimization IDCT How to get both? Observation: applications are “mostly static” (Thies, Amarasinghe [PACT 2010]) 2 ¡

  3. StreamIt, a Streaming Language Designed for Static Optimization … float->float pipeline ABC { add float->float filter A() { A work pop … push 2 { … } 2 } 3 B pops 3 per firing add float->float filter B () { B work pop 3 push 1 B pushes 1 per firing { … } 1 } 2 add float->float filter C() { C work pop 2 push … { … } … } }  Statically known push/pop rates (SDF = Synchronous Dataflow) 3 ¡

  4. SDF Steady-State Schedule … A A A A A A 2 3 B B B B B B 1 2 C C C C C C …  Statically known firing order and FIFO queue sizes 4 ¡

  5. Scalarization … A A A A A A r 1 =… r 3 =… r 1 =… r 2 =… r 4 =… r 2 =… 2 3 …=r 1 …=r 4 B B B …=r 2 B B …=r 1 B …=r 3 …=r 2 1 r 5 =… r 6 =… 2 …=r 5 C C C C C C …=r 6 …  Implement FIFO queue via local variables, or even registers (more intricate with “peek”, not shown in this talk) 5 ¡

  6. Fission (Data Parallelism) Roundrobin Roundrobin Roundrobin Roundrobin Split Split Split Split X 1 X 2 X 1 X 2 X 1 X 2 X 1 X 2 Roundrobin Roundrobin Roundrobin Roundrobin Merge Merge Merge Merge  Round-robin split and merge rely on static rates 6 ¡

  7. Dynamic Rates float->float pipeline Decoder { add float->float filter VideoInput() { Video work pop 1 push 1 Input { … } * } add float->float filter Huffman() { Huffman work pop * push 1 { … } } add float->float filter IQuant() { IQuant work pop 64 push 64 { … } } add float->float filter IDCT() { work pop 8 push 8 IDCT { … } } }  No more static optimization? 7 ¡

  8. Dynamic Scheduling Approaches Scheduling Representative Description approach citation Each operator has SPC, Amini et al. OS Thread its own thread [DMSSP 2006] Recruit from thread Aurora, Abadi et al. Demand pool [VLDBJ 2003] Static rate, send CQL, Arasu et al. No-op nonce when no data [VLDBJ 2006] 8

  9. Our Approach: Locally Static + Globally Dynamic 1. Partitioning into static subgraphs 2. Locally optimize static subgraphs 2a. Fusion 2b. Scalarization 2c. Fission 3. Placement 3a. Core placement 3b. Thread placement 4. Globally dynamic scheduling 9

  10. Partition into Static Subgraphs Video Video Input Input Static subgraph: * * Weakly connected component after deleting dynamic edges. Huffman Huffman IQuant IQuant IDCT IDCT Partitioning 10 ¡

  11. Locally Optimize Static Subgraphs Video Video Video Video Video Input Input Input Input Input * * * * * Huffman Huffman Huffman Huffman Huffman IQuant IQuant IQuant IQuant IQuant IQuant IDCT IDCT IDCT IDCT IDCT IDCT Partitioning Fusion Scalarization Fission 11 ¡

  12. Core Placement Video Input * Video Static weight estimate Huffman Input and greedy bin-packing Huffman IQuant IQuant IQuant IQuant Place fission replicas on all cores IDCT IDCT IDCT IDCT 12 ¡

  13. Thread Placement Video Input * Video One pinned thread Huffman Input per static subgraph Huffman per core (must be able to suspend dynamic reader when no input) IQuant IQuant IQuant IQuant IDCT IDCT IDCT IDCT 13 ¡

  14. Dynamic Scheduling Video Use condition Huffman Input variables for hand-off to successor IQuant IQuant Legend: ¡Control ¡ IDCT IDCT Barrier 14 ¡

  15. Data Pipelining Video Use buffer for Huffman Input pipeline parallelism IQuant IQuant Legend: ¡Control ¡ IDCT IDCT ¡Data ¡ Barrier 15 ¡

  16. Dynamic vs. Static Performance File Reader Work Operator W /2 weight * Work W /2 File Writer  Close enough for heavy operators, but what about light operators? 16 ¡

  17. Amortizing the Thread Switching Overhead 1. Partitioning into static subgraphs 2. Locally optimize static subgraphs 2a. Fusion 2b. Scalarization 2c. Fission 2d. Batching 3. Placement 3a. Core placement 3b. Thread placement 4. Globally dynamic scheduling 17

  18. Benefit of Batching File Reader Work 1 N dynamic queues * Work 1 File Writer  Amortize thread switching overhead without heavy operators 18 ¡

  19. Our vs. Other Dynamic Schedulers Performance Scheduling Experiment Result approach 32 threads, 1 core, Our scheduler is OS Thread work 31 per operator 10x faster Huffman encoder Our scheduler is Demand and decoder 1.2x faster 2 programs: VWAP Our scheduler is No-op and predicate filter 5.1x and 4.9x faster  Our scheduler was faster in all cases (see paper for details) 19

  20. Conclusions • Static streaming languages such as StreamIt enable powerful optimizations • But many real-world applications require dynamic rates • We extend the StreamIt optimizing compiler to handle dynamic rates 20

Recommend


More recommend