performance driven system performance driven system
play

Performance-driven system Performance-driven system generation for - PowerPoint PPT Presentation

Performance-driven system Performance-driven system generation for distributed generation for distributed vertex-centric vertex-centric graph processing graph processing on multi-FPGA systems on multi-FPGA systems Nina Engelhardt C.-H.


  1. Performance-driven system Performance-driven system generation for distributed generation for distributed vertex-centric vertex-centric graph processing graph processing on multi-FPGA systems on multi-FPGA systems Nina Engelhardt C.-H. Dominic Hung Hayden K.-H. So The University of Hong Kong 28th August 2018

  2. The GraVF graph processing framework user kernel PE PE PE ... PE PE PE ... ... ... ... ... PE PE PE ... Vertex-centric graph processing framework on FPGA User provides kernel, inserted in framework architecture PEs exchanging messages over on-chip network 2 / 14

  3. Now extended to multiple FPGA Allow PEs to exchange messages with PEs on different FPGAs Extend network by adding external interface and routing messages destined for PEs on other FPGAs over 10GbE PE PE ... PE PE PE ... PE PE PE ... PE PE PE ... PE ... ... ... ... ... ... ... ... ... 10 10 PE PE ... PE ... PE GbE GbE FPGA 1 FPGA n Ethernet Switch 3 / 14

  4. What’s the performance? Vertex kernel has well-defined interface Can calculate the necessary resources to process one edge receive message update vertex PE data read edge send message Build roofline-style performance model based on platform resources Use model to automatically pick configuration when generating 4 / 14

  5. Limiting factors 4 limits considered: Processing element throughput Memory bandwidth Network interface bandwidth Total network bandwidth 5 / 14

  6. Processing element throughput T sys ≤ n FPGA × n PE / FPGA × f clk / CPE PE ( L PE ) Cycles per edge: analogous to processor CPI, used together with clock frequency to determine individual PE throughput CPE is affected by: PE architecture data hazards kernel implementation multiplied by number of PEs in the system. 6 / 14

  7. Memory bandwidth T sys ≤ n FPGA × BW mem ( L mem ) m edge Edges can be stored off-chip to increase processable graph size Can only be processed as fast as they can be loaded 7 / 14

  8. Network interface bandwidth BW if T sys ≤ n FPGA × ( L if ) 2 n FPGA − 1 n FPGA m message When using multiple FPGAs, messages need to be transferred over external network interface Assuming equal distribution of vertices, a message has a n FPGA − 1 n FPGA chance to be sent to a different board Each message traverses an interface twice, sending and receiving Really a per-board limit, extra factor n FPGA on both sides for system throughput 8 / 14

  9. Total network bandwidth BW network T sys ≤ ( L network ) n FPGA − 1 n FPGA m message Total amount of messages transferrable by the external network Again, a fraction n FPGA − 1 of messages needs to cross the external n FPGA network 9 / 14

  10. Results Computation limit (CPE=1.6) Computation limit (CPE=1.3) Computation limit (CPE=1.2) PR - RMAT BFS - RMAT CC - RMAT PR - Uniform BFS - Uniform CC - Uniform 1000 1400 1400 System Throughput (MTEPS) 900 System Throughput (MTEPS) System Throughput (MTEPS) 1200 1200 800 1000 1000 700 600 800 800 500 600 600 400 300 400 400 200 200 200 100 0 0 0 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 Number of PEs Number of PEs Number of PEs everything on-chip: no memory, no network PE throughput is limiting close to limit for uniform graphs, slowdown due to imbalance for RMAT graphs 10 / 14

  11. Results DDR limit Comp. limit (CPE=1.6) BFS - RMAT PR - RMAT CC - RMAT BFS - Uniform PR - Uniform CC - Uniform 80 System Throughput (MTEPS) 70 60 50 40 30 20 10 0 1 2 4 8 Number of PEs using external memory, but only one FPGA Xilinx MIG DDR3 controller’s BW mem = 2 . 5 Gbps random access performance is limiting better performance at 1 PE as accesses are more sequential 11 / 14

  12. Results Network limit BFS - RMAT CC - RMAT 600 System Throughput (MTEPS) 500 400 300 200 100 0 1 2 3 4 Number of FPGAs using 4 FPGAs, no memory network interface bandwidth BW network = 6 . 7 Gbps is limiting imbalance has greater impact, further decays performance 12 / 14

  13. Conclusion Graph algorithms are very communication-intensive need to optimize interfaces can predict performance reasonably accurately except for imbalance (depends on input properties) 13 / 14

  14. Thank you for listening! Questions? Visit poster board 8! 14 / 14

Recommend


More recommend