database servers on chip multiprocessors limitations and
play

Database servers on chip multiprocessors: limitations and - PowerPoint PPT Presentation

Database servers on chip multiprocessors: limitations and opportunities N. Hardavellas N. Mancheril I. Pandis A. Ailamaki R. Johnson B. Falsafi Benjamin Reilly September 27, 2011 Tuesday, 27 September, 11 The fattened cache (and CPU)


  1. Database servers on chip multiprocessors: limitations and opportunities N. Hardavellas N. Mancheril I. Pandis A. Ailamaki R. Johnson B. Falsafi Benjamin Reilly September 27, 2011 Tuesday, 27 September, 11

  2. The fattened cache (and CPU) Cache capacity More data on hand, but = Cache latency higher cost to retrieve it CPUs show similar trend in development: continually larger , and more complex Tuesday, 27 September, 11

  3. • Motivation • Experiment design OVERVIEW • Results and observations • What now? • Summary and discussion Tuesday, 27 September, 11

  4. Dividing the CMPs CMPs? Chip multiprocessors : several cores sharing on-chip resources (caches) Vary in: • # of cores • # of hardware threads (“ contexts ”) • Execution order • Pipeline depth Tuesday, 27 September, 11

  5. The ‘ Fat Camp’ ( FC ) Core 0 Core 1 Thread Context Key characteristics • Few, but powerful cores • Few (1-2) hardware contexts • OoO –– Out-of-Order execution • ILP –– Instruction-level parallelism Tuesday, 27 September, 11

  6. Hiding data stalls: FC earlier Wait for... input operations op 0 op 1 Compute OoO op 1 op 2 Out of Order Execution op 2 ILP a += b Instruction-level parallelism b += c d += e Tuesday, 27 September, 11

  7. The ‘ Lean Camp’ ( LC ) Core 0 Core 4 Core 2 Core 6 Core 1 Core 7 Core 3 Core 5 Key characteristics • Many, but weaker cores • Several (4+) hardware contexts • In-order execution (simpler) Tuesday, 27 September, 11

  8. Hiding data stalls: LC Hardware contexts interleaved in round-robin fashion, skipping contexts that are in data stalls . Core 0 Core 0 Core 0 Core 0 Core 0 Core 0 Core 0 Core 0 Core 0 Core 0 Running Idle (runnable) Stalled (non-runnable) Tuesday, 27 September, 11

  9. (Un)saturated workloads Workloads • DSS • OLTP Number of requests • Saturated : work always available for each hardware context • Unsaturated : work not always available Tuesday, 27 September, 11

  10. LC vs. FC Performance +12% +70% ( low ILP ( high ILP for FC) for FC) LC has slower response time in un saturated workloads Tuesday, 27 September, 11

  11. LC vs. FC Performance +70% (ILP not significant for FC) LC has higher throughput in saturated workloads Tuesday, 27 September, 11

  12. LC vs. FC Performance Observations: • FC spends 46-64% of execution on data stalls • At best ( saturated workloads), LC spends 76-80% on computation Tuesday, 27 September, 11

  13. Data stall breakdown Consider three components of data cache stalls: 1. Cache size Larger (and hence slower ) caches are decreasingly optimal Tuesday, 27 September, 11

  14. Data stall breakdown CPI contributions for: OLTP DSS L2 hit stalls responsible for an increasingly large portion of the CPI Tuesday, 27 September, 11

  15. Data stall breakdown 2. Per-chip core integration SMP CMP Processing 4x 1-core 1x 4-core 16 MB L2 cache(s) 4MB / CPU shared Fewer cores per chip = fewer L2 hits Tuesday, 27 September, 11

  16. Data stall breakdown 3. On-chip core count 8 cores: • 9% superlinear increase in throughput (for DSS) 16 cores: • 26% sublinear decrease (OLTP) • Too much pressure on L2 Tuesday, 27 September, 11

  17. How do we apply this? 1.Increase parallelism • Divide! (more threads ⇒ more saturation ) • Pipeline/OLP (producer-consumer pairs) • Partition input (not ideal; static and complex) Tuesday, 27 September, 11

  18. How do we apply this? 2.Improve data locality • Reduce data stalls to help with unsaturated workloads • Halt producers in favour of consumers • Use cache-friendly algorithms 3.Use staged DBs • Partition work by groups of relational operators Tuesday, 27 September, 11

  19. Summary & Discussion 1. LC typically performs better than FC • LC is best under saturated workloads. • Is there room for FC CMPs in DB applications? 2. L2 hits are a bottleneck • Why were DBs ignored in HW design? • How can we avoid incurring the cost of an L2 hit? Tuesday, 27 September, 11

Recommend


More recommend