Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1 , Daniel Zinn 2 , Molham Aref 2 , Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox Inc. SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
System Diversity Today Amazon EC2 GPU Instances Mobile Platforms (DSP, GPUs) Hardware Diversity is Mainstream Keeneland System (GPUs) Cray Titan (GPUs) 2 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
GPU and CUDA CPU (Multi Core) ③ Execute 2-16 Cores ① Input GPU Data ~3000 Cores  GPU is a many core co-processor ② Launch ~50 GB/s  1000s of cores Kernel ~300 GB/s ④ Result  1000s of concurrent threads PCI-E  Higher memory bandwidth MAIN MEM GPU MEM ~128GB ~6GB 16GB/s  Smaller memory capacity  CUDA and OpenCL are the Streaming Multiprocessor (SM) dominant programming models A A A A A A A A A A A A A A A A L L L L L L L L L L L L L L L L U U U U U U U U U U U U U U U U R R R R R R R R R R R R R R R R  Well suited for data parallel apps Thread Cooperative Thread Arrays (CTA)  Molecular Dynamics, Options branch Pricing, Ray Tracing, etc. End of branch  Commodity: led by NVIDIA, AMD, CUDA Kernel and Intel Warp 1 Warp N Shared Memory Coalesced Access 0 4 8 C 10 14 18 1C Address DRAM 3 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Relational Queries and Data Analytics  The Opportunity  Significant potential data parallelism  The Problem Applications  Need to process 1-50 TBs of data 1  Small Mem Capacity & Small PCIe bandwidth  Irregularity  Fine grained computation Large Graphs  Data dependent  Low locality 1 Independent Oracle Users Group. A New Dimension to Data Warehousing: 2011 IOUG Data Warehousing Survey . 4 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Candidate Application The Challenge Domains New Applications and Software …… Stacks LargeQty(p) <- Qty(q), q > 1000. …… New Accelerator Architectures Large Graphs Relational Computations Over Massive Unstructured Data Sets: Sustain 10X – 100X throughput over multicore 5 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Multipredicate Join  Goal: Implementation of Leapfrog Triejoin (LFTJ) on GPU  A worst-case optimal multi-predicate join algorithm  Details (e.g., complexity analysis) in T. L. Veldhuizen, ICDT 2014  Benefits  Smaller memory footprint and data movement  No data reorganization (e.g. sorting or rebuilding hash table) after changing join key  Approach  CPU version  CPU-Friendly GPU version  Customized GPU version 6 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
An Important Example – Graph Problems  Finding cliques Multi-predicate Join  triangle(x,y,z)<-E(x,y),E(y,z),E(x,z), x<y<z.  4cl(x,y,z,w)<-E(x,y),E(x,z),E(x,w),E(y,z),E(y,w),E(z,w), x<y<z<w. Edge: 0 From To 0 1 1 1 2 1 3 2 3 2 3 2 4 5 4 3 5 7 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Leapfrog Join (LFJ)  LFJ is the base of LFTJ  Essentially multi-way-intersections  Basic primitives: seek() , next() seek(2) seek(10) seek(8) A 0 1 3 4 5 6 7 8 9 11 seek(3) seek(8) seek(10) B 0 2 6 7 8 9 seek(6) next() C 2 4 5 8 10 C ourtesy : T. L. Veldhuizen, ICDT 2014 8 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Trie Data Structure  LFTJ works on Trie Data Stucture Edge: Root 0 From To 0 1 1 From 1 2 0 1 2 3 1 3 2 3 2 3 2 4 1 2 3 3 4 5 To 5 4 3 5 9 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – join 3 tries E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 10 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – open() level x E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 11 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – seek(0) in E(x,z) level x E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 12 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – open() level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 13 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – seek(1) in E(y,z) level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 14 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – open() level z E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 15 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – seek(2) in E(x,z) level z and failed E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 16 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – up() to level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 17 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – up() to level x E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 18 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – seek(1) in E(x,z) level x E(x,z) E(x,y) E(y,z) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 19 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – open() level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 20 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – seek(2) in E(y,z) level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 21 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – open() level z E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 22 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – seek(3) in E(x,z) level z E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 23 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – next() E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 24 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – final result E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 25 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – short conclusion  Very simple set of primitives to implement  A sequential algorithm  Traverse the Trie in depth first order  Two methods for applying this technique with GPUs  CPU algorithm per GPU thread  Customize data parallel application 26 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ-GPU: First Algorithm  Evenly map the top level of the leftmost trie to GPU threads  Run sequential LFTJ in each GPU thread  seek() is implemented as binary search  Data dependent control flow  No spacial or temporal locality Example: mapping to 2 GPU threads E(y,z) E(x,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 t0 t1 27 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Recommend
More recommend