data management systems
play

Data Management Systems Access Methods Denormalized tables Pages - PowerPoint PPT Presentation

Data Management Systems Access Methods Denormalized tables Pages and Blocks Indexing Log structured databases Access Methods in context No indexes Gustavo Alonso Institute of Computing Platforms Department of Computer Science


  1. Data Management Systems • Access Methods Denormalized tables • Pages and Blocks • Indexing Log structured databases • Access Methods in context No indexes Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Access Methods in context 1

  2. Alternative designs • There are many different versions of what we have studied regarding access methods • New and old ideas keep being applied because the context changes: • Denormalized tables = store more than one table in the same blocks • It is a form of materialized view with a join already performed • Useful if tables are more often accessed together than on their own • Log Structured representation = do not store tuples but only updates to them • Main memory databases for OLTP • Cloud storage • Avoid creating indexes • Cloud databases (Snowflake) • Database Cracking (MonetDB) Access Methods in context 2

  3. Denormalized tables Access Methods in context 3

  4. Merging tables in advance • Normal forms are rules used to eliminate redundancy in the schema of a database • They force to split tables into differnet tables until each one represents a distinct concept • Good idea to save space but often tables will be joined in almost every query • Some systems allow to cluster tables into the same segment so that blocks contain data form both tables (indexed by the common attributes/keys) Access Methods in context 4

  5. Clustered Tables (Oracle) • Clustered tables make sense when the tables are really processed together most of the time • Reduces I/O, saves space • Tuples that are related are stored together in the same block, reducing access time • It is like a materialized join • Updates can become expensive Access Methods in context 5

  6. Log structured databases Access Methods in context 6

  7. Log structured file • Instead of storing tuples and modifying them as needed, keep a record of how the data was modified (a log): • Inserts records the entire tuple • Delete records that the tuple is invalidated • Updates records the attributes that have been modified • New entries are appended at the end of the file Block INSERT id = 17, Val = 3 New entries DELETE id = 11 appended at the end UPDATE id = 25, Val = 7 UPDATE id = 98, Val = 14 Access Methods in context 7

  8. Log structured database • It is much faster to update a file sequentially than to do random accesses (even in SSDs) • In cloud storage, file storage is typically append only • It minimizes the cost of making the data persistent (very expensive for OLTP databases with many transactions) • Not for heavy query analytics but very good for in memory transactional workloads Access Methods in context 8

  9. Optimizing log structured representations • Periodically compact the log: • Remove the history and add one entry for every tuple • Index the tuples and their modifications (entry could contain pointer to previous entry referring to that tuple) • It actually matches what often happens in many applications. • No updates in place • Record history of operations • Periodically apply all operations • Increasingly being used in a number of databases, especially cloud databases Access Methods in context 9

  10. No indexes Access Methods in context 10

  11. Snowflake: Micro-partitions instead of indexes • Micro- partitions are Snowflake’s name for extents • What is interesting is how they are organized to facilitate query processing • Size ranges between 50 and 500 MB (before compression, the data is always compressed when in S3) • Each micro partition has metadata describing what is inside • The metadata can be read without reading the whole micro-partition • The metadata is used to read just the part of the micro-partition that is relevant • Data in the micro-partition is stored in columnar form (by columns not by rows) Storage in Context 11

  12. https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html Storage in Context 12

  13. Pruning based on metadata • The header for a micro-partition contains information about the data. SELECT * FROM T WHERE age > 45 • If header contains the min and max age in the data, we can decide we do not need that micro-partition simply by looking at the header • Same as indexing built in the micro- partitions Storage in Context 13

  14. Database Cracking • Pioneered in MonetDB (open source column store) • Do not create an index, instead, build the index incrementally while the data is being processed • Based on creating a copy of the column • Do it in successive passes so that the cost is not too high • Initial queries expensive, later cost is amortized as work has already been done Access Methods in context 14

Recommend


More recommend