dbms on a modern processor where does time go
play

DBMS on a modern processor: where does time go? Anastasia Ailamaki, - PowerPoint PPT Presentation

DBMS on a modern processor: where does time go? Anastasia Ailamaki, David DeWitt, Mark Hill and David Wood University of Wisconsin Madison Presented by: Bogdan Simion Current DBMS Performance + = Where is query execution time spent? Identify


  1. DBMS on a modern processor: where does time go? Anastasia Ailamaki, David DeWitt, Mark Hill and David Wood University of Wisconsin ‐ Madison Presented by: Bogdan Simion

  2. Current DBMS Performance + =

  3. Where is query execution time spent? Identify performance bottlenecks in CPU and memory

  4. Outline • Motivation • Background • Query execution time breakdown • Experimental results and discussions • Conclusions

  5. Hardware performance standards • Processors are designed and evaluated with simple programs • Benchmarks: SPEC, LINPACK • What about DBMSs?

  6. DBMS bottlenecks • Initially, bottleneck was I/O • Nowadays ‐ memory and compute intensive apps • Modern platforms: – sophisticated execution hardware – fast, non ‐ blocking caches and memory • Still … – DBMS hardware behaviour is suboptimal compared to scientific workloads

  7. Execution pipeline INSTRUCTION POOL FETCH/ DISPATCH RETIRE DECODE EXECUTE UNIT UNIT UNIT L1 I ‐ CACHE L1 D ‐ CACHE L2 CACHE MAIN MEMORY Stalls overlapped with useful work !!!

  8. Execution time breakdown T Q = T C + T M + T B + T R ‐ T OVL •T C ‐ Computation L1D, L1I •T M ‐ Memory stalls L2D, L2I DTLB, ITLB •T B ‐ Branch Mispredictions Functional Units •T R ‐ Stalls on Execution Resources Dependency Stalls

  9. DB setup • DB is memory resident => no I/O interference • No dynamic and random parameters, no concurrency control among transactions

  10. Workload choice • Simple queries: – Single ‐ table range selections (sequential, index) – Two ‐ table equijoins • Easy to setup and run • Fully controllable parameters • Isolates basic operations • Enable iterative hypotheses !!! • Building blocks for complex workloads?

  11. Execution Time Breakdown (%) 10% Sequential Scan 10% Indexed Range Selection Join (no index) 100% 100% 100% Query execution time (%) 80% 80% 80% 60% 60% 60% 40% 40% 40% 20% 20% 20% 0% 0% 0% A B C D B C D A B C D DBMS DBMS DBMS Computation Memory Branch mispredictions Resource • Stalls at least 50% of time • Memory stalls are major bottleneck

  12. Memory Stalls Breakdown (%) 10% Sequential Scan 10% Indexed Range Selection Join (no index) 100% 100% 100% Memory stall time (%) 80% 80% 80% 60% 60% 60% 40% 40% 40% 20% 20% 20% 0% 0% 0% A B C D B C D A B C D DBMS DBMS DBMS L1 Data L1 Instruction L2 Data L2 Instruction • Role of L1 data cache and L2 instruction cache unimportant • L2 data and L1 instruction stalls dominate • Memory bottlenecks across DBMSs and queries vary

  13. Effect of Record Size 10% Sequential Scan L2 data misses / record L1 instruction misses / record 8 25 # of misses per record 20 6 15 4 10 2 5 0 0 20 48 100 200 20 48 100 200 record size record size System A System B System C System D • L2D increase: locality + page crossing (except D) • L1I increase: page boundary crossing costs

  14. Memory Bottlenecks • Memory is important ‐ Increasing memory ‐ processor performance gap ‐ Deeper memory hierarchies expected • Stalls due to L2 cache data misses ‐ Expensive fetches from main memory ‐ L2 grows (8MB), but will be slower • Stalls due to L1 I ‐ cache misses ‐ Buffer pool code is expensive ‐ L1 I ‐ cache not likely to grow as much as L2

  15. Branch Mispredictions Are Expensive 25% 25% Branch misprediction rates Query execution time (%) 20% 20% 15% 15% 10% 10% 5% 5% 0% 0% A B C D A B C D DBMS DBMS Sequential Scan Index scan Join (no index) • Rates are low, but contribution is significant • A compiler task, but decisive for L1I performance

  16. Mispredictions Vs. L1 ‐ I Misses 10% Sequential Scan 10% Indexed Range Selection Join (no index) 20 12 50 Events / 1000 instr. 40 15 9 30 6 10 20 5 3 10 0 0 0 A B C D A B C D B C D DBMS DBMS DBMS Branch mispredictions L1 I-cache misses • More branch mispredictions incur more L1I misses • Index code more complicated ‐ needs optimization

  17. Resource ‐ related Stalls Dependency ‐ related stalls (T DEP ) Functional Unit ‐ related stalls (T FU ) 25% 25% % of query execution time 20% 20% 15% 15% 10% 10% 5% 5% 0% 0% A B C D A B C D DBMS DBMS Sequential Scan Index scan Join (no index) • High T DEP for all systems : Low ILP opportunity • A’s sequential scan: Memory unit load buffers?

  18. Microbenchmarks vs. TPC CPI Breakdown System B System D 3.5 3.5 3 3 2.5 2.5 Clock ticks 2 2 1.5 1.5 1 1 0.5 0.5 0 0 sequential TPC-D 2ary index TPC-C sequential TPC-D 2ary TPC-C scan scan index benchmark benchmark Computation Memory Branch misprediction Resource • Sequential scan breakdown similar to TPC ‐ D • 2ary index and TPC ‐ C: higher CPI, memory stalls (L2 D&I mostly)

  19. Conclusions • Execution time breakdown shows trends • L1I and L2D are major memory bottlenecks • We need to: – reduce page crossing costs – optimize instruction stream – optimize data placement in L2 cache – reduce stalls at all levels • TPC may not be necessary to locate bottlenecks

  20. Five years later – Becker et al 2004 • Same DBMSs, setup and workloads (memory resident) and same metrics • Outcome: stalls still take lots of time – Seq scans: L1I stalls, branch mispredictions much lower – Index scans: no improvement – Joins: improvements, similar to seq scans – Bottleneck shift to L2D misses => must improve data placement – What works well on some hardware doesn’t on other

  21. Five years later – Becker et al 2004 • C on a Quad P3 700MHz, 4G RAM, 16K L1, 2M L2 • B on a single P4 3GHz, 1G RAM, 8K L1D + 12KuOp trace cache, 512K L2, BTB 8x than P3 • P3 results: – Similar to 5 years ago: major bottlenecks are L1I and L2D • P4 results: – Memory stalls almost entirely due to L1D and L2D stalls – L1D stalls higher ‐ smaller cache and larger cache line – L1I stalls removed due to trace cache (esp. for seq. scan, but still some for index) Hardware – awareness is important !

  22. References • DBMS on a modern processor: where does time go? Revisited – CMU Tech Report 2004 • Anastassia Ailamaki – VLDB’99 talk slides

  23. Questions?

Recommend


More recommend