Multipredicate Join Algorithms for Accelerating Relational Graph Processing on GPUs Haicheng Wu 1 , Daniel Zinn 2 , Molham Aref 2 , Sudhakar Yalamanchili 1 1. Georgia Institute of Technology 2. LogicBlox Inc. SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
System Diversity Today Amazon EC2 GPU Instances Mobile Platforms (DSP, GPUs) Hardware Diversity is Mainstream Keeneland System (GPUs) Cray Titan (GPUs) 2 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
GPU and CUDA CPU (Multi Core) ③ Execute 2-16 Cores ① Input GPU Data ~3000 Cores GPU is a many core co-processor ② Launch ~50 GB/s 1000s of cores Kernel ~300 GB/s ④ Result 1000s of concurrent threads PCI-E Higher memory bandwidth MAIN MEM GPU MEM ~128GB ~6GB 16GB/s Smaller memory capacity CUDA and OpenCL are the Streaming Multiprocessor (SM) dominant programming models A A A A A A A A A A A A A A A A L L L L L L L L L L L L L L L L U U U U U U U U U U U U U U U U R R R R R R R R R R R R R R R R Well suited for data parallel apps Thread Cooperative Thread Arrays (CTA) Molecular Dynamics, Options branch Pricing, Ray Tracing, etc. End of branch Commodity: led by NVIDIA, AMD, CUDA Kernel and Intel Warp 1 Warp N Shared Memory Coalesced Access 0 4 8 C 10 14 18 1C Address DRAM 3 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Relational Queries and Data Analytics The Opportunity Significant potential data parallelism The Problem Applications Need to process 1-50 TBs of data 1 Small Mem Capacity & Small PCIe bandwidth Irregularity Fine grained computation Large Graphs Data dependent Low locality 1 Independent Oracle Users Group. A New Dimension to Data Warehousing: 2011 IOUG Data Warehousing Survey . 4 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Candidate Application The Challenge Domains New Applications and Software …… Stacks LargeQty(p) <- Qty(q), q > 1000. …… New Accelerator Architectures Large Graphs Relational Computations Over Massive Unstructured Data Sets: Sustain 10X – 100X throughput over multicore 5 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Multipredicate Join Goal: Implementation of Leapfrog Triejoin (LFTJ) on GPU A worst-case optimal multi-predicate join algorithm Details (e.g., complexity analysis) in T. L. Veldhuizen, ICDT 2014 Benefits Smaller memory footprint and data movement No data reorganization (e.g. sorting or rebuilding hash table) after changing join key Approach CPU version CPU-Friendly GPU version Customized GPU version 6 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
An Important Example – Graph Problems Finding cliques Multi-predicate Join triangle(x,y,z)<-E(x,y),E(y,z),E(x,z), x<y<z. 4cl(x,y,z,w)<-E(x,y),E(x,z),E(x,w),E(y,z),E(y,w),E(z,w), x<y<z<w. Edge: 0 From To 0 1 1 1 2 1 3 2 3 2 3 2 4 5 4 3 5 7 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Leapfrog Join (LFJ) LFJ is the base of LFTJ Essentially multi-way-intersections Basic primitives: seek() , next() seek(2) seek(10) seek(8) A 0 1 3 4 5 6 7 8 9 11 seek(3) seek(8) seek(10) B 0 2 6 7 8 9 seek(6) next() C 2 4 5 8 10 C ourtesy : T. L. Veldhuizen, ICDT 2014 8 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Trie Data Structure LFTJ works on Trie Data Stucture Edge: Root 0 From To 0 1 1 From 1 2 0 1 2 3 1 3 2 3 2 3 2 4 1 2 3 3 4 5 To 5 4 3 5 9 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – join 3 tries E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 10 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – open() level x E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 11 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – seek(0) in E(x,z) level x E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 12 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – open() level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 13 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – seek(1) in E(y,z) level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 14 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – open() level z E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 15 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – seek(2) in E(x,z) level z and failed E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 16 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – up() to level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 17 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – up() to level x E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 18 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – seek(1) in E(x,z) level x E(x,z) E(x,y) E(y,z) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 19 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – open() level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 20 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – seek(2) in E(y,z) level y E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 21 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – open() level z E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 22 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – seek(3) in E(x,z) level z E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 23 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – next() E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 24 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – final result E(x,z) E(y,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 25 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ Algorithm – short conclusion Very simple set of primitives to implement A sequential algorithm Traverse the Trie in depth first order Two methods for applying this technique with GPUs CPU algorithm per GPU thread Customize data parallel application 26 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
LFTJ-GPU: First Algorithm Evenly map the top level of the leftmost trie to GPU threads Run sequential LFTJ in each GPU thread seek() is implemented as binary search Data dependent control flow No spacial or temporal locality Example: mapping to 2 GPU threads E(y,z) E(x,z) E(x,y) Root Root Root 0 1 2 3 x 0 1 2 3 y 1 2 3 3 4 5 0 1 2 3 z 1 2 3 3 4 5 1 2 3 3 4 5 t0 t1 27 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY
Recommend
More recommend