DATABASE SYSTEM IMPLEMENTATION GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ LECTURE #4: SYSTEM CATALOGS & DATABASE COMPRESSION
2 OFFICE HOURS Prashanth Dintyala → Office Hours: Mon, 1:30-2:30 PM → Location: Near KACB 3324 → Email: vdintyala3@gatech.edu Sonia Matthew → Office Hours: Wed, 11-12 AM → Location: Near KACB 3324 → Email: smathew60@gatech.edu
3 HOMEWORK #1 We have posted clarifications on Piazza → More clarifications will be provided over time Separately submit the design document on Gradescope → Homework 1 - Design Doc
4 TODAY’S AGENDA System Catalogs Compression Background Naïve Compression OLAP Columnar Compression OLTP Index Compression
5 SYSTEM CATALOGS Group of tables and views that describe the structure of the database. → Statistics related to queries Each system catalog table contains information about specific elements in the database. → Statistics related to queries → Statistics related to data distribution → List of indexes → List of user tables
6 SYSTEM CATALOGS Almost every DBMS stores their a database's catalog in itself (i.e., using it storage manager). → Wrap object abstraction around tuples to avoid invoking SQL queries to retrieve catalog data → Specialized code for "bootstrapping" catalog tables. The entire DBMS should be aware of transactions in order to automatically provide ACID guarantees for DDL (i.e., schema change) commands and concurrent txns.
7 SYSTEM CATALOGS MySQL → Special hard-coded scripts to alter the catalog → Non-transactional catalog changes PostgreSQL → Uses SQL commands to alter the catalog
8 SCHEMA CHANGES ADD COLUMN : → NSM : Copy tuples into new region in memory. → DSM : Just create the new column segment DROP COLUMN : → NSM #1 : Copy tuples into new region of memory. → NSM #2 : Mark column as "deprecated", clean up later. → DSM : Just drop the column and free memory. CHANGE COLUMN : → Check whether the conversion is allowed to happen. Depends on default values.
9 SCHEMA CHANGES Relevant comment on HackerNews: → Schema changes fail to complete on databases > 2 TB → Operation requires double the amount of disk storage for copying. Takes close to a month to perform such an operation on large tables → The big reason that DDL is slow is because these systems haven't tried to make it fast. So, blame the DB designers! → Wish it was on my day job list of things I can work on.
10 INDEXES CREATE INDEX : → Scan the entire table and populate the index. → Must not block all other txns during index construction. → Have to record changes made by txns that modified the table while another txn was building the index. → When the scan completes, lock the table and resolve changes that were missed after the scan started. DROP INDEX : → Just drop the index logically from the catalog. → It only becomes "invisible" when the txn that dropped it commits. All existing txns will still have to update it.
11 OBSERVATION I/O is the main bottleneck if the DBMS has to fetch data from disk. → CPU cost for decompressing data < I/O cost for fetching un-compressed data. Compression always helps . In-memory DBMSs are more complicated → Compressing the database reduces DRAM requirements and processing. Key trade-off is speed vs. compression ratio → In-memory DBMSs (always?) choose speed.
12 REAL-WORLD DATA CHARACTERISTICS Data sets tend to have highly skewed distributions for attribute values. → Example: Zipfian distribution of the Brown Corpus → Words like “the”, “a” occur very frequently in books Data sets tend to have high correlation between attributes of the same tuple. → Example: Order Date to Ship Date (few days) → (June 5, +5) instead of (June 5, June 10)
13 DATABASE COMPRESSION Goal #1: Must produce fixed-length values. Allows us to be efficient while accessing tuples. Goal #2: Allow the DBMS to postpone decompression as long as possible during query execution. Operate directly on compressed data. Goal #3: Must be a lossless scheme. No data should be lost during this transformation.
14 LOSSLESS VS. LOSSY COMPRESSION When a DBMS uses compression, it is always lossless because people don’t like losing data. Any kind of lossy compression is has to be performed at the application level. → Example: Sensor data. Readings are taken every second, but we may only store average across every minute. Some new DBMSs support approximate queries → Example: BlinkDB, SnappyData, XDB, Oracle (2017)
15 ZONE MAPS Pre-computed aggregates for blocks of data. DBMS can check the zone map first to decide whether it wants to access the block. Original Data val 100 200 300 400 400
16 ZONE MAPS Pre-computed aggregates for blocks of data. DBMS can check the zone map first to decide whether it wants to access the block. Original Data Zone Map val type val 100 MIN 100 200 MAX 400 300 AVG 280 400 SUM 1400 400 COUNT 5
17 ZONE MAPS Pre-computed aggregates for blocks of data. DBMS can check the zone map first to decide whether it wants to access the block. Original Data Zone Map val type val SELECT * FROM table 100 MIN 100 WHERE val > 600 200 MAX 400 300 AVG 280 400 SUM 1400 400 COUNT 5
18 COMPRESSION GRANULARITY Choice #1: Block-level → Compress a block of tuples for the same table. Choice #2: Tuple-level → Compress the contents of the entire tuple (NSM-only). Choice #3: Attribute-level → Compress a single attribute value within one tuple. → Can target multiple attributes for the same tuple. Choice #4: Column-level → Compress multiple values for one or more attributes stored for multiple tuples (DSM-only).
19 NAÏVE COMPRESSION Compress data using a general purpose algorithm. Scope of compression is only based on the data provided as input. → LZO (1996), LZ4 (2011), Snappy (2011), Zstd (2015) Considerations → Computational overhead ( gzip is super slow) → Compress vs. decompress speed.
20 NAÏVE COMPRESSION Choice #1: Entropy Encoding → More common sequences use less bits to encode, less common sequences use more bits to encode. Choice #2: Dictionary Encoding → Build a lookup table that maps logical identifiers to data chunks. Example: 1 ~ “the” → Replace those values in the original data with logical identifiers which can be later uncompressed using the lookup table. Example: <1, 5, 3> ~ <“the lookup table”>
21 MYSQL INNODB COMPRESSION Buffer Pool Disk Pages modification log Compressed page0 modification log Compressed page1 modification log Compressed page2 Source: MySQL 5.7 Documentation
22 MYSQL INNODB COMPRESSION Buffer Pool Disk Pages [1,2,4,8] modification log KB Compressed page0 modification log Compressed page1 modification log Compressed page2 Source: MySQL 5.7 Documentation
23 MYSQL INNODB COMPRESSION Buffer Pool Disk Pages [1,2,4,8] modification log modification log KB Compressed page0 Compressed page0 modification log Compressed page1 modification log Compressed page2 Source: MySQL 5.7 Documentation
24 MYSQL INNODB COMPRESSION Buffer Pool Disk Pages Updates [1,2,4,8] modification log modification log KB Compressed page0 Compressed page0 modification log Compressed page1 modification log Compressed page2 Source: MySQL 5.7 Documentation
25 MYSQL INNODB COMPRESSION Buffer Pool Disk Pages Updates [1,2,4,8] modification log modification log KB Compressed page0 Compressed page0 modification log Uncompressed Compressed page1 page0 modification log Compressed page2 Source: MySQL 5.7 Documentation
26 MYSQL INNODB COMPRESSION Buffer Pool Disk Pages Updates [1,2,4,8] modification log modification log KB Compressed page0 Compressed page0 modification log Uncompressed 16 KB Compressed page1 page0 modification log Compressed page2 Source: MySQL 5.7 Documentation
27 NAÏVE COMPRESSION The data has to be decompressed first before it can be read and (potentially) modified. → This limits the “scope” of the compression scheme. These schemes also do not consider the high-level meaning or semantics of the data. → Example: Relationship between order and shipping dates
28 OBSERVATION We can perform exact-match comparisons and natural joins on compressed data if predicates and data are compressed the same way. → Range predicates are more tricky …
29 OBSERVATION We can perform exact-match comparisons and natural joins on compressed data if predicates and data are compressed the same way. → Range predicates are more tricky … SELECT * FROM users WHERE name = 'Andy' NAME SALARY Andy 99999 Prashanth 88888
30 OBSERVATION We can perform exact-match comparisons and natural joins on compressed data if predicates and data are compressed the same way. → Range predicates are more tricky … SELECT * FROM users WHERE name = 'Andy' NAME SALARY NAME SALARY XX AA Andy 99999 YY BB Prashanth 88888
31 OBSERVATION We can perform exact-match comparisons and natural joins on compressed data if predicates and data are compressed the same way. → Range predicates are more tricky … SELECT * FROM users SELECT * FROM users WHERE name = 'Andy' WHERE name = XX NAME SALARY NAME SALARY XX AA Andy 99999 YY BB Prashanth 88888
Recommend
More recommend