database cracking
play

DATABASE CRACKING: Fancy Scan, not Poor Mans Sort! Hardware Folks - PowerPoint PPT Presentation

DATABASE CRACKING: Fancy Scan, not Poor Mans Sort! Hardware Folks Cracking Folks Don Holger Pirk Eleni Petraki Strato Idreos Stefan Manegold Martin Kersten EVALUATING RANGE PREDICATES COMPLEXITY ON PAPER Scanning: O(n) Sorting:


  1. DATABASE CRACKING: Fancy Scan, not Poor Man’s Sort! Hardware Folks Cracking Folks Don Holger Pirk Eleni Petraki Strato Idreos Stefan Manegold Martin Kersten

  2. EVALUATING RANGE PREDICATES

  3. COMPLEXITY ON PAPER • Scanning: O(n) • Sorting: O(n × log(n)) • Cracking: O(n) • Essentially a single Quicksort-Step

  4. COSTS IN REALITY • Implement microbenchmarks • 1 Billion uniform random integer values • Pivot in the middle of the range • Workstation machine (16 GB RAM, 4 Sandy Bridge Cores)

  5. COSTS IN REALITY 13 12 10 Wallclock time in s 8.0 6.0 4.0 2.0 0.0 0.0 Parallel Scanning Cracking Parallel Sorting

  6. SO: WHAT’S GOING ON?

  7. CACHE MISSES? L1I Misses L2 Misses L1D Misses L3 Misses 1.5B 1.4B 1.2B NOPE! 1.0B 800M 600M 400M 200M 0.0 0.0 Scanning Cracking Sorting

  8. CPU COSTS Micro-ops Issued? Yes No Allocation Micro-op Stall? Ever Retire? Yes No Yes No Frontend Backend Bad Retiring Bound Bound Speculation " " ! # Cache Miss Other Stalls Stalls

  9. CPU COSTS Bad Speculation Data Stalls Pipeline Frontend Retiring Pipeline Backend 1.0 1.0 0.80 0.60 0.40 0.20 0.0 0.0 Scanning Cracking Sorting

  10. CPU COSTS 14 % !!! Bad Speculation Data Stalls Pipeline Frontend Retiring Pipeline Backend 1.0 1.0 0.80 0.60 0.40 0.20 0.0 0.0 Scanning Cracking Sorting

  11. CPU COSTS Bad Speculation Data Stalls Pipeline Frontend Retiring Pipeline Backend 1.0 1.0 0.80 Lots of Potential 0.60 0.40 0.20 0.0 0.0 Scanning Cracking Sorting

  12. WHAT CAN WE DO ABOUT IT?

  13. INCREASING CPU EFFICIENCY

  14. PREDICATION for(i=0; i<size; i++) ! for(i=0; i<size; i++) ! if(input[i] < pivot) { ! { ! output[outI] = input[i]; ! output[outI] = input[i]; ! outI++ ! outI += (input[i] < pivot); ! } }

  15. PREDICATION • Turns control dependencies into data dependencies • Eliminates Branch Mispredictions • Causes unconditional (potentially unnecessary) I/O • (limited to caches) • Works only for out-of-place algorithms

  16. PREDICATED CRACKING

  17. PREDICATED CRACKING pivot active backup 5 7 2 4 8 2 9 3 8 1 5 0 7 5 3

  18. PREDICATED CRACKING pivot active backup 5 3 7 3 2 4 8 2 9 3 8 1 5 0 7 5 7

  19. PREDICATED CRACKING pivot cmp active backup 5 ? 3 7 3 2 4 8 2 9 3 8 1 5 0 7 5 7 State Before Iteration

  20. PREDICATED CRACKING pivot cmp active backup 5 3 7 1 > 2 4 8 2 9 3 8 1 5 0 7 5 3 3 Evaluate Predicat & Write

  21. PREDICATED CRACKING pivot cmp active backup 5 1 3 7 1- += − = 3 2 4 8 2 9 3 8 1 5 0 7 5 3 Advance Cursor

  22. PREDICATED CRACKING pivot cmp active backup 5 1 7 2 1- * * + 3 2 4 8 2 9 3 8 1 5 0 7 5 3 Read Next Element

  23. PREDICATED CRACKING pivot cmp backup active 5 1 2 7 3 2 4 8 2 9 3 8 1 5 0 7 5 3

  24. PREDICATED CRACKING • Predication for in-place algorithms • No branching ⇒ No branch mispredictions • Somewhat intricate • Lots of copying stuff around (integer granularity ⇒ inefficient) • Bulk-copying would be more efficient

  25. VECTORIZED CRACKING

  26. VECTORIZED CRACKING • Turns in-place cracking into out-of-place cracking • Copies Vector-sized chunks and cracks them into the array • Makes vanilla-predication possible • Uses SIMD-copying for vector copying • Challenge: ensure that values aren't “accidentally" overwritten

  27. VECTORIZED CRACKING copy partition copy partition

  28. RESULTS

  29. RESULTS Bad Speculation Data Stalls Pipeline Frontend Retiring Pipeline Backend 1.0 1.0 0.80 0.60 0.40 0.20 0.0 0.0 Vectorized Predicated Original

  30. RESULTS: WORKSTATION 4.3 4.0 Wallclock time in s 3.0 2.0 1.0 0.0 0.0 Scan Vectorized Predicated Predicated Original (Register) (Cache)

  31. RESULTS: SERVER 5.3 5.0 4.0 Wallclock time in s 3.0 2.0 Not there yet! 1.0 0.0 0.0 Scan Vectorized Predicated Predicated Original (Register) (Cache)

  32. PARALLELIZATION

  33. PARALLELIZATION • Obvious Solution: Partitioning

  34. CRACK & MERGE x1 y4 y1 x2 y2 x3 y3 x4 Partition

  35. CRACK & MERGE x1 y4 y1 x2 y2 x3 y3 x4 Merge

  36. REFINED CRACK & MERGE x1 x2 x3 x4 y4 y3 y2 y1 Partition

  37. REFINED CRACK & MERGE x1 x2 x3 x4 y4 y3 y2 y1 Smaller Merge

  38. RESULTS: WORKSTATION 1,6 1,2 Seconds 0,8 0,4 0 Scan RVPCrack RPCrack PVCrack PCrack Vectorized

  39. RESULTS: SERVER 3,00 2,25 Seconds 1,50 0,75 0,00 Scan RVPCrack RPCrack PVCrack PCrack Vectorized

  40. IMPACT OF SELECTIVITY: WORKSTATION 1.5 1.4 Vectorized 1.2 Partition & Merge Wallclock time in s Vectorized Partition & Merge 1.0 Refined Partition & Merge 0.80 Vectorized Refined Partition & Merge 0.60 Scanning 0.40 0.20 0.0 0 50 100 Qualifying Tuples/Pivot

  41. IMPACT OF SELECTIVITY: SERVER 2.6 Vectorized 2.0 Partition & Merge Wallclock time in s Vectorized Partition & Merge 1.5 Refined Partition & Merge Vectorized Refined Partition & Merge 1.0 Scanning 0.50 0.0 0 50 100 Qualifying Tuples/Pivot

  42. CONCLUSIONS

Recommend


More recommend