DBMS on a modern processor: where does time go? Anastasia Ailamaki, - PowerPoint PPT Presentation

DBMS on a modern processor: where does time go? Anastasia Ailamaki, David DeWitt, Mark Hill and David Wood University of Wisconsin ‐ Madison Presented by: Bogdan Simion

Current DBMS Performance + =

Where is query execution time spent? Identify performance bottlenecks in CPU and memory

Outline • Motivation • Background • Query execution time breakdown • Experimental results and discussions • Conclusions

Hardware performance standards • Processors are designed and evaluated with simple programs • Benchmarks: SPEC, LINPACK • What about DBMSs?

DBMS bottlenecks • Initially, bottleneck was I/O • Nowadays ‐ memory and compute intensive apps • Modern platforms: – sophisticated execution hardware – fast, non ‐ blocking caches and memory • Still … – DBMS hardware behaviour is suboptimal compared to scientific workloads

Execution pipeline INSTRUCTION POOL FETCH/ DISPATCH RETIRE DECODE EXECUTE UNIT UNIT UNIT L1 I ‐ CACHE L1 D ‐ CACHE L2 CACHE MAIN MEMORY Stalls overlapped with useful work !!!

Execution time breakdown T Q = T C + T M + T B + T R ‐ T OVL •T C ‐ Computation L1D, L1I •T M ‐ Memory stalls L2D, L2I DTLB, ITLB •T B ‐ Branch Mispredictions Functional Units •T R ‐ Stalls on Execution Resources Dependency Stalls

DB setup • DB is memory resident => no I/O interference • No dynamic and random parameters, no concurrency control among transactions

Workload choice • Simple queries: – Single ‐ table range selections (sequential, index) – Two ‐ table equijoins • Easy to setup and run • Fully controllable parameters • Isolates basic operations • Enable iterative hypotheses !!! • Building blocks for complex workloads?

Execution Time Breakdown (%) 10% Sequential Scan 10% Indexed Range Selection Join (no index) 100% 100% 100% Query execution time (%) 80% 80% 80% 60% 60% 60% 40% 40% 40% 20% 20% 20% 0% 0% 0% A B C D B C D A B C D DBMS DBMS DBMS Computation Memory Branch mispredictions Resource • Stalls at least 50% of time • Memory stalls are major bottleneck

Memory Stalls Breakdown (%) 10% Sequential Scan 10% Indexed Range Selection Join (no index) 100% 100% 100% Memory stall time (%) 80% 80% 80% 60% 60% 60% 40% 40% 40% 20% 20% 20% 0% 0% 0% A B C D B C D A B C D DBMS DBMS DBMS L1 Data L1 Instruction L2 Data L2 Instruction • Role of L1 data cache and L2 instruction cache unimportant • L2 data and L1 instruction stalls dominate • Memory bottlenecks across DBMSs and queries vary

Effect of Record Size 10% Sequential Scan L2 data misses / record L1 instruction misses / record 8 25 # of misses per record 20 6 15 4 10 2 5 0 0 20 48 100 200 20 48 100 200 record size record size System A System B System C System D • L2D increase: locality + page crossing (except D) • L1I increase: page boundary crossing costs

Memory Bottlenecks • Memory is important ‐ Increasing memory ‐ processor performance gap ‐ Deeper memory hierarchies expected • Stalls due to L2 cache data misses ‐ Expensive fetches from main memory ‐ L2 grows (8MB), but will be slower • Stalls due to L1 I ‐ cache misses ‐ Buffer pool code is expensive ‐ L1 I ‐ cache not likely to grow as much as L2

Branch Mispredictions Are Expensive 25% 25% Branch misprediction rates Query execution time (%) 20% 20% 15% 15% 10% 10% 5% 5% 0% 0% A B C D A B C D DBMS DBMS Sequential Scan Index scan Join (no index) • Rates are low, but contribution is significant • A compiler task, but decisive for L1I performance

Mispredictions Vs. L1 ‐ I Misses 10% Sequential Scan 10% Indexed Range Selection Join (no index) 20 12 50 Events / 1000 instr. 40 15 9 30 6 10 20 5 3 10 0 0 0 A B C D A B C D B C D DBMS DBMS DBMS Branch mispredictions L1 I-cache misses • More branch mispredictions incur more L1I misses • Index code more complicated ‐ needs optimization

Resource ‐ related Stalls Dependency ‐ related stalls (T DEP ) Functional Unit ‐ related stalls (T FU ) 25% 25% % of query execution time 20% 20% 15% 15% 10% 10% 5% 5% 0% 0% A B C D A B C D DBMS DBMS Sequential Scan Index scan Join (no index) • High T DEP for all systems : Low ILP opportunity • A’s sequential scan: Memory unit load buffers?

Microbenchmarks vs. TPC CPI Breakdown System B System D 3.5 3.5 3 3 2.5 2.5 Clock ticks 2 2 1.5 1.5 1 1 0.5 0.5 0 0 sequential TPC-D 2ary index TPC-C sequential TPC-D 2ary TPC-C scan scan index benchmark benchmark Computation Memory Branch misprediction Resource • Sequential scan breakdown similar to TPC ‐ D • 2ary index and TPC ‐ C: higher CPI, memory stalls (L2 D&I mostly)

Conclusions • Execution time breakdown shows trends • L1I and L2D are major memory bottlenecks • We need to: – reduce page crossing costs – optimize instruction stream – optimize data placement in L2 cache – reduce stalls at all levels • TPC may not be necessary to locate bottlenecks

Five years later – Becker et al 2004 • Same DBMSs, setup and workloads (memory resident) and same metrics • Outcome: stalls still take lots of time – Seq scans: L1I stalls, branch mispredictions much lower – Index scans: no improvement – Joins: improvements, similar to seq scans – Bottleneck shift to L2D misses => must improve data placement – What works well on some hardware doesn’t on other

Five years later – Becker et al 2004 • C on a Quad P3 700MHz, 4G RAM, 16K L1, 2M L2 • B on a single P4 3GHz, 1G RAM, 8K L1D + 12KuOp trace cache, 512K L2, BTB 8x than P3 • P3 results: – Similar to 5 years ago: major bottlenecks are L1I and L2D • P4 results: – Memory stalls almost entirely due to L1D and L2D stalls – L1D stalls higher ‐ smaller cache and larger cache line – L1I stalls removed due to trace cache (esp. for seq. scan, but still some for index) Hardware – awareness is important !

References • DBMS on a modern processor: where does time go? Revisited – CMU Tech Report 2004 • Anastassia Ailamaki – VLDB’99 talk slides

Questions?

DBMS on a modern processor: where does time go? Anastasia Ailamaki, - PowerPoint PPT Presentation

DBMS on a modern processor: where does time go? Anastasia Ailamaki, David DeWitt, Mark Hill and David Wood University of Wisconsin Madison Presented by: Bogdan Simion Current DBMS Performance + = Where is query execution time spent? Identify

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Design of Flash- -Based DBMS: Based DBMS: Design of Flash Design of Flash-Based DBMS: An In-

Alert: An Architecture for Transforming a Passive DBMS into an Active DBMS Ulf Schreier, Hamid

DBMS + ML Julian Oks Josh Sennett Jan. 29, 2020 Context + Problem Statement Context: DBMS + ML

CS743 - Principles of Database Management and Use Distribution, Replication, and CAP Ken Salem

FPGA co-processor Patrick Dunne for the co-processor group Introduction Co-processor will

Processor Design Pipelined Processor Hung-Wei Tseng Drawbacks of a single-cycle processor

Distributed DBMS reliability Distributed DBMS reliability

Database Management System (DBMS) DBMS contains information about a particular enterprise

Database Management Systems (DBMS) Prof. Pfaff. Lafayette College February 19, 2018 Prof.

Tactical data engineering Julian Hyde April 1718, 2019 San Francisco @julianhyde DBMS Data

Architecture of DBMS Mrs. Maninder Kaur professormaninder@gmail.com Mrs. Maninder Kaur

DRY-SAS/DBMS UPDATE Executive Committee meeting 9 OCTOBER 2020 BACKGROUND DRY-SAS AND DBMS

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

Processor Architecture: Current Trends A B Transfer a truckload at a time from A to B Processor

Modern Risk Modern Risk Modern Risk Management Modern Risk Management anagement Concepts:

The Search for the Schwinger Effect: Non-perturbative Pair Production from Vacuum Gerald Dunne

Discrete Controller Synthesis for Infinite State Systems with ReaX Nicolas B erthier Herv M

Modal Logics for Timed Control Patricia Bouyer 1 , Franck Cassez 2 and Franois Laroussinie 1 1

Multipurpose Event Generators and ep Physics Simon Pltzer IPPP, Department of Physics, Durham

1 Splatting Splatting Algorithm: Process from closest voxel to furthest voxel

Introduction to Cost Accounting Dr. Varadraj Bapat Indian Institute of Technology, Mumbai

APT TECHNICAL CPD - MAF TRANSFER PRICING AND PERFORMANCE EVALUATION Transfer Pricing and

Linear Optimal Control (LQR) Robert Platt Northeastern University The linear control problem