data flow computing
play

Data Flow Computing James Spooner, VP of Acceleration QCon, Finance - PowerPoint PPT Presentation

Acceleration in the Wild, with Data Flow Computing James Spooner, VP of Acceleration QCon, Finance Track, 08 March 2012 Acceleration in the Wild with Data Flow Deliberate, focused approach to improving application speed Involves adding


  1. Acceleration in the Wild, with Data Flow Computing James Spooner, VP of Acceleration QCon, Finance Track, 08 March 2012

  2. Acceleration in the Wild with Data Flow • Deliberate, focused approach to improving application speed – Involves adding Data Flow Engines (DFEs) – Makes some of the program faster – Will be programmed intentionally and be architecture specific – Will exploit as much available parallelism as possible – May require transformations to expose parallelism – May have multiple implementations Maxeler is a acceleration specialist, delivering end-to-end performance for a range of clients in the banking and oil/gas exploration industries. 2

  3. Making efficient use of Silicon

  4. Computing History… Credit: Prof. Paul H.J. Kelly - J. P, Eckert, Jr (Co-Inventor of ENIAC)

  5. Computing History… “The parallel approach to computing does require that some original thinking be done about numerical analysis and data management in order to secure efficient use. In an environment which has represented the absence of the need to think as the highest virtue this is a decided disadvantage.” -Daniel Slotnick (Chief Architect of ILLIAC IV), 1967 Credit: Prof. Michael J. Flynn

  6. So what happened? • Eckert (and Amdahl) were right, Slotnik was wrong, until… • Serial computing hit the wall(s) last decade: – The memory wall ; the increasing gap between processor and memory speeds. This effect pushes cache sizes larger in order to mask the latency of memory. This helps only to the extent that memory bandwidth is not the bottleneck in performance. – The ILP wall ; the increasing difficulty of finding enough parallelism in a single instruction stream to keep a high-performance single-core processor busy. – The power wall ; the trend of consuming exponentially increasing power with each factorial increase of operating frequency. This increase can be mitigated by "shrinking" the processor by using smaller traces for the same logic. The power wall poses manufacturing, system design and deployment problems that have not been justified in the face of the diminished gains in performance due to the memory wall and ILP wall .    2 P C V f Source: Wikipedia avg load DD 6

  7. Using silicon efficiently - parallelism Level of Examples Costs Parallelism Coarse - Multi-Node, Multi-chip, multi-core -Developing a distributed Grained - Process / thread level parallism Distributed system - Locks, mutexes, queues, etc. Fine -Instruction level parallelism (ILP) - Lots of silicon Grained -Out-of-order execution, superscalar, - Compiler can do some work instruction pipelining, speculative upfront execution -Data level parallelism -SIMD / SSE Ultra Fine -Data Flow architectures - Resolve once Grained - Massively parallel, lock free, hazard free, streaming datapaths

  8. How is modern silicon used? Intel 6- Core X5680 “ Westmere ” 8

  9. How is modern silicon used? Intel 6-Core X5680 “ Westmere ” Computation Support Logic for fine grained parallelism 9

  10. What is Dataflow Computing? Computing with control flow processors Computing with dataflow engines (DFEs) vs. 10

  11. 1U dataflow cloud providing dynamically scalable compute capability over Infiniband MPC-X1000 • 8 vectis dataflow engines (DFEs) • 192GB of DFE RAM • Dynamic allocation of DFEs to conventional CPU servers – Zero-copy RDMA between CPUs and DFEs over Infiniband • Equivalent performance to 40-60 x86 servers 11

  12. Dataflow Programming

  13. Application Components Host application CPU SLiC Kernels MaxelerOS DataFlow PCI Express Memory + + * Memory Manager 13

  14. Programming with MaxCompiler C / C++ / Fortran MaxJ SLiC 14

  15. MaxCompiler Development Process CPU CPU Code Main Memory CPU Code (.c) int *x, *y;    y x x 30 for (int i =0; i < DATA_SIZE; i++) i i i y[i]= x[i] * x[i] + 30; 15

  16. MaxCompiler Development Process x CPU Memory CPU Code x Chip SLiC 30 MaxelerOS x PCI + x Manager 30 Main + x y Express Memory x y CPU Code (.c) Manager (.java) MyKernel (.java) Manager m = new Manager (“Calc”); HWVar x = io.input("x", hwInt(32)); #include “ MaxSLiCInterface.h ” Kernel k = #include “Calc.max” new MyKernel(); HWVar result = x * x + 30; int *x, *y; for (int i =0; i < DATA_SIZE; i++) m.setKernel(k); io.output("y", result, hwInt(32)); y[i]= x[i] * x[i] + 30; m.setIO( Calc(x, y, DATA_SIZE) link(“x", PCIE), link(“y", PCIE)); m.addMode(modeDefault()); m.build(); 16

  17. MaxCompiler Development Process x CPU Memory y Host Code x Chip SLiC 30 MaxelerOS x PCI + x Manager 30 Main + x Express Memory x y CPUCode (.c) Manager (.java) MyKernel (.java) Manager m = new Manager(); HWVar x = io.input("x", hwInt(32)); #include “ MaxSLiCInterface.h ” device = max_open_device(maxfile, Kernel k = "/dev/maxeler0"); #include “Calc.max” new MyKernel(); HWVar result = x * x + 30; int *x, *y; m.setKernel(k); io.output("y", result, hwInt(32)); m.setIO( Calc(x, DATA_SIZE) link(“x", PCIE), link(“y", DRAM_LINEAR1D)); m.addMode(modeDefault()); m.build(); 17

  18. The Full Kernel x public class MyKernel extends Kernel { public MyKernel (KernelParameters parameters) { super( parameters ) ; x HWVar x = io.input("x", hwInt(32)); 30 HWVar result = x * x + 30; + io.output("y", result, hwInt(32)); } } y 18

  19. Kernel Streaming: In Hardware 5 4 3 2 1 0 x x 30 + y 19

  20. Kernel Streaming: In Hardware 5 4 3 2 1 0 x 0 x 30 + y 20

  21. Kernel Streaming: In Hardware 5 4 3 2 1 0 x 1 x 0 30 + y 21

  22. Kernel Streaming: In Hardware 5 4 3 2 1 0 x 2 x 1 30 + 30 y 22

  23. Kernel Streaming: In Hardware 5 4 3 2 1 0 x 3 x 4 30 + 31 y 30 23

  24. Kernel Streaming: In Hardware 5 4 3 2 1 0 x 4 x 9 30 + 34 y 30 31 24

  25. Kernel Streaming: In Hardware 5 4 3 2 1 0 x 5 x 16 30 + 39 y 30 31 34 25

  26. Kernel Streaming: In Hardware 5 4 3 2 1 0 x x 25 30 + 46 y 30 31 34 39 26

  27. Kernel Streaming: In Hardware 5 4 3 2 1 0 x x 30 + 55 y 30 31 34 39 46 27

  28. Kernel Streaming: In Hardware 5 4 3 2 1 0 x x 30 + y 30 31 34 39 46 55 28

  29. Data flow graph as generated by MaxCompiler 4866 nodes; about 250x100

  30. How we approach Acceleration

  31. What always makes Acceleration hard? • Messy code for (i=0; i<N; ++i) { points[i]->incx(); • Complicated build } dependences • Confused control-flow • Impenetrable data x x x x x access r x x x y y y y y θ y y y • Pointer-intensive data z z z z z z z z q p p structures x y • Premature z optimization 31

  32. Conflicting Goals • Some well-motivated for (i=0; i<N; ++i) { software structures points[i]->incx(); } have real value, but make acceleration harder • Examples: x x x x x r x x x y y y y y – Virtual method calls θ y y y z z z z z z z z inside a loop q p p – Collections with non- x y uniform type z – Substructure sharing 32

  33. What makes Acceleration easier? • Self-evident data dependences • Computing on large x x x x x x x x collections of uniform data y y y y y y y y • Appropriate representation z z z z z z z z hiding • Getting the abstraction right 33

  34. Maximum Performance Computing • Identify parallelism and take advantage of it – Fully understand data dependencies • Minimize memory bandwidth – Data reuse and representation • Regularize the computation and data – Minimize control flow complexity • Find optimal balance for underlying architecture – Memory hierarchy bandwidth(s) and size(s) and latency(s) – Communication bandwidth(s) and latency(s) – Math performance – Branch cost (control divergence) – Axes of Parallelism 34

  35. Maxeler Acceleration Process • Run the code with profiling tools Code • Understand data and loop structures and data Analysis access patterns Sets theoretical • Investigate Transformation performance bounds transformation options Partitioning for these structures and access patterns Implementation Achieve performance • Decide which parts of the code need acceleration Result • Implement and validate 35

  36. Application Analysis 36

  37. Partitioning Options Data Access Plans Code Partitioning Pareto Optimal Options Development Time Transformations Runtime Try to minimise runtime and development time, while maximising flexibility and precision. 37

  38. Credit Derivatives Valuation & Risk • Compute value of complex financial derivatives (CDOs) • Typically run overnight, but beneficial to compute in real-time • Many independent jobs • Speedup: 220-270x • Power consumption per node drops from 250W to 235W/node 38

  39. Discovering the Dataflow of an Application

Recommend


More recommend