relational graph processing on gpus
play

Relational Graph Processing on GPUs Haicheng Wu 1 , Daniel Zinn 2 , - PowerPoint PPT Presentation

Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1 , Daniel Zinn 2 , Molham Aref 2 , Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox Inc. SCHOOL OF ELECTRICAL AND COMPUTER


  1. Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1 , Daniel Zinn 2 , Molham Aref 2 , Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox Inc. SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  2. System Diversity Today Amazon EC2 GPU Instances Mobile Platforms (DSP, GPUs) Hardware Diversity is Mainstream Keeneland System (GPUs) Cray Titan (GPUs) 2 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  3. GPU and CUDA CPU (Multi Core) ③ Execute 2-16 Cores ① Input GPU Data ~3000 Cores  GPU is a many core co-processor ② Launch ~50 GB/s  1000s of cores Kernel ~300 GB/s ④ Result  1000s of concurrent threads PCI-E  Higher memory bandwidth MAIN MEM GPU MEM ~128GB ~6GB 16GB/s  Smaller memory capacity  CUDA and OpenCL are the Streaming Multiprocessor (SM) dominant programming models A A A A A A A A A A A A A A A A L L L L L L L L L L L L L L L L U U U U U U U U U U U U U U U U R R R R R R R R R R R R R R R R  Well suited for data parallel apps Thread Cooperative Thread Arrays (CTA)  Molecular Dynamics, Options branch Pricing, Ray Tracing, etc. End of branch  Commodity: led by NVIDIA, AMD, CUDA Kernel and Intel Warp 1 Warp N Shared Memory Coalesced Access 0 4 8 C 10 14 18 1C Address DRAM 3 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  4. Relational Queries and Data Analytics  The Opportunity  Significant potential data parallelism  The Problem Applications  Need to process 1-50 TBs of data 1  Small Mem Capacity & Small PCIe bandwidth  Irregularity  Fine grained computation Large Graphs  Data dependent  Low locality 1 Independent Oracle Users Group. A New Dimension to Data Warehousing: 2011 IOUG Data Warehousing Survey . 4 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  5. Candidate Application The Challenge Domains New Applications and Software …… Stacks LargeQty(p) <- Qty(q), q > 1000. …… New Accelerator Architectures Large Graphs Relational Computations Over Massive Unstructured Data Sets: Sustain 10X – 100X throughput over multicore 5 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  6. Multipredicate Join  Goal: Implementation of Leapfrog Triejoin (LFTJ) on GPU  A worst-case optimal multi-predicate join algorithm  Details (e.g., complexity analysis) in T. L. Veldhuizen, ICDT 2014  Benefits  Smaller memory footprint and data movement  No data reorganization (e.g. sorting or rebuilding hash table) after changing join key  Approach  CPU version  CPU-Friendly GPU version  Customized GPU version 6 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  7. An Important Example – Graph Problems  Finding cliques Multi-predicate Join  triangle(x,y,z)<-E(x,y),E(y,z),E(x,z), x<y<z.  4cl(x,y,z,w)<-E(x,y),E(x,z),E(x,w),E(y,z),E(y,w),E(z,w), x<y<z<w. Edge: 0 From To 0 1 1 1 2 1 3 2 3 2 3 2 4 5 4 3 5 7 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  8. Leapfrog Join (LFJ)  LFJ is the base of LFTJ  Essentially multi-way-intersections  Basic primitives: seek() , next() seek(2) seek(10) seek(8) A 0 1 3 4 5 6 7 8 9 11 seek(3) seek(8) seek(10) B 0 2 6 7 8 9 seek(6) next() C 2 4 5 8 10 C ourtesy : T. L. Veldhuizen, ICDT 2014 8 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  9. Trie Data Structure  LFTJ works on Trie Data Stucture Edge: Root 0 From To 0 1 1 From 1 2 0 1 2 3 1 3 2 3 2 3 2 4 1 2 3 3 4 5 To 5 4 3 5 9 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  10. LFTJ Algorithm – join 3 tries E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 10 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  11. LFTJ Algorithm – open() level x E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 11 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  12. LFTJ Algorithm – seek(0) in E(x,z) level x E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 12 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  13. LFTJ Algorithm – open() level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 13 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  14. LFTJ Algorithm – seek(1) in E(y,z) level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 14 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  15. LFTJ Algorithm – open() level z E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 15 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  16. LFTJ Algorithm – seek(2) in E(x,z) level z and failed E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 16 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  17. LFTJ Algorithm – up() to level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 17 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  18. LFTJ Algorithm – up() to level x E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 18 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  19. LFTJ Algorithm – seek(1) in E(x,z) level x E(x,z) E(x,y) E(y,z) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 19 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  20. LFTJ Algorithm – open() level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 20 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  21. LFTJ Algorithm – seek(2) in E(y,z) level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 21 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  22. LFTJ Algorithm – open() level z E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 22 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  23. LFTJ Algorithm – seek(3) in E(x,z) level z E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 23 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  24. LFTJ Algorithm – next() E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 24 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  25. LFTJ Algorithm – final result E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 25 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  26. LFTJ Algorithm – short conclusion  Very simple set of primitives to implement  A sequential algorithm  Traverse the Trie in depth first order  Two methods for applying this technique with GPUs  CPU algorithm per GPU thread  Customize data parallel application 26 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

  27. LFTJ-GPU: First Algorithm  Evenly map the top level of the leftmost trie to GPU threads  Run sequential LFTJ in each GPU thread  seek() is implemented as binary search  Data dependent control flow  No spacial or temporal locality Example: mapping to 2 GPU threads E(y,z) E(x,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 t0 t1 27 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Recommend


More recommend