DATABASE CRACKING: Fancy Scan, not Poor Man’s Sort! Hardware Folks Cracking Folks Don Holger Pirk Eleni Petraki Strato Idreos Stefan Manegold Martin Kersten
EVALUATING RANGE PREDICATES
COMPLEXITY ON PAPER • Scanning: O(n) • Sorting: O(n × log(n)) • Cracking: O(n) • Essentially a single Quicksort-Step
COSTS IN REALITY • Implement microbenchmarks • 1 Billion uniform random integer values • Pivot in the middle of the range • Workstation machine (16 GB RAM, 4 Sandy Bridge Cores)
COSTS IN REALITY 13 12 10 Wallclock time in s 8.0 6.0 4.0 2.0 0.0 0.0 Parallel Scanning Cracking Parallel Sorting
SO: WHAT’S GOING ON?
CACHE MISSES? L1I Misses L2 Misses L1D Misses L3 Misses 1.5B 1.4B 1.2B NOPE! 1.0B 800M 600M 400M 200M 0.0 0.0 Scanning Cracking Sorting
CPU COSTS Micro-ops Issued? Yes No Allocation Micro-op Stall? Ever Retire? Yes No Yes No Frontend Backend Bad Retiring Bound Bound Speculation " " ! # Cache Miss Other Stalls Stalls
CPU COSTS Bad Speculation Data Stalls Pipeline Frontend Retiring Pipeline Backend 1.0 1.0 0.80 0.60 0.40 0.20 0.0 0.0 Scanning Cracking Sorting
CPU COSTS 14 % !!! Bad Speculation Data Stalls Pipeline Frontend Retiring Pipeline Backend 1.0 1.0 0.80 0.60 0.40 0.20 0.0 0.0 Scanning Cracking Sorting
CPU COSTS Bad Speculation Data Stalls Pipeline Frontend Retiring Pipeline Backend 1.0 1.0 0.80 Lots of Potential 0.60 0.40 0.20 0.0 0.0 Scanning Cracking Sorting
WHAT CAN WE DO ABOUT IT?
INCREASING CPU EFFICIENCY
PREDICATION for(i=0; i<size; i++) ! for(i=0; i<size; i++) ! if(input[i] < pivot) { ! { ! output[outI] = input[i]; ! output[outI] = input[i]; ! outI++ ! outI += (input[i] < pivot); ! } }
PREDICATION • Turns control dependencies into data dependencies • Eliminates Branch Mispredictions • Causes unconditional (potentially unnecessary) I/O • (limited to caches) • Works only for out-of-place algorithms
PREDICATED CRACKING
PREDICATED CRACKING pivot active backup 5 7 2 4 8 2 9 3 8 1 5 0 7 5 3
PREDICATED CRACKING pivot active backup 5 3 7 3 2 4 8 2 9 3 8 1 5 0 7 5 7
PREDICATED CRACKING pivot cmp active backup 5 ? 3 7 3 2 4 8 2 9 3 8 1 5 0 7 5 7 State Before Iteration
PREDICATED CRACKING pivot cmp active backup 5 3 7 1 > 2 4 8 2 9 3 8 1 5 0 7 5 3 3 Evaluate Predicat & Write
PREDICATED CRACKING pivot cmp active backup 5 1 3 7 1- += − = 3 2 4 8 2 9 3 8 1 5 0 7 5 3 Advance Cursor
PREDICATED CRACKING pivot cmp active backup 5 1 7 2 1- * * + 3 2 4 8 2 9 3 8 1 5 0 7 5 3 Read Next Element
PREDICATED CRACKING pivot cmp backup active 5 1 2 7 3 2 4 8 2 9 3 8 1 5 0 7 5 3
PREDICATED CRACKING • Predication for in-place algorithms • No branching ⇒ No branch mispredictions • Somewhat intricate • Lots of copying stuff around (integer granularity ⇒ inefficient) • Bulk-copying would be more efficient
VECTORIZED CRACKING
VECTORIZED CRACKING • Turns in-place cracking into out-of-place cracking • Copies Vector-sized chunks and cracks them into the array • Makes vanilla-predication possible • Uses SIMD-copying for vector copying • Challenge: ensure that values aren't “accidentally" overwritten
VECTORIZED CRACKING copy partition copy partition
RESULTS
RESULTS Bad Speculation Data Stalls Pipeline Frontend Retiring Pipeline Backend 1.0 1.0 0.80 0.60 0.40 0.20 0.0 0.0 Vectorized Predicated Original
RESULTS: WORKSTATION 4.3 4.0 Wallclock time in s 3.0 2.0 1.0 0.0 0.0 Scan Vectorized Predicated Predicated Original (Register) (Cache)
RESULTS: SERVER 5.3 5.0 4.0 Wallclock time in s 3.0 2.0 Not there yet! 1.0 0.0 0.0 Scan Vectorized Predicated Predicated Original (Register) (Cache)
PARALLELIZATION
PARALLELIZATION • Obvious Solution: Partitioning
CRACK & MERGE x1 y4 y1 x2 y2 x3 y3 x4 Partition
CRACK & MERGE x1 y4 y1 x2 y2 x3 y3 x4 Merge
REFINED CRACK & MERGE x1 x2 x3 x4 y4 y3 y2 y1 Partition
REFINED CRACK & MERGE x1 x2 x3 x4 y4 y3 y2 y1 Smaller Merge
RESULTS: WORKSTATION 1,6 1,2 Seconds 0,8 0,4 0 Scan RVPCrack RPCrack PVCrack PCrack Vectorized
RESULTS: SERVER 3,00 2,25 Seconds 1,50 0,75 0,00 Scan RVPCrack RPCrack PVCrack PCrack Vectorized
IMPACT OF SELECTIVITY: WORKSTATION 1.5 1.4 Vectorized 1.2 Partition & Merge Wallclock time in s Vectorized Partition & Merge 1.0 Refined Partition & Merge 0.80 Vectorized Refined Partition & Merge 0.60 Scanning 0.40 0.20 0.0 0 50 100 Qualifying Tuples/Pivot
IMPACT OF SELECTIVITY: SERVER 2.6 Vectorized 2.0 Partition & Merge Wallclock time in s Vectorized Partition & Merge 1.5 Refined Partition & Merge Vectorized Refined Partition & Merge 1.0 Scanning 0.50 0.0 0 50 100 Qualifying Tuples/Pivot
CONCLUSIONS
Recommend
More recommend