Data Management Systems • Access Methods Denormalized tables • Pages and Blocks • Indexing Log structured databases • Access Methods in context No indexes Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Access Methods in context 1
Alternative designs • There are many different versions of what we have studied regarding access methods • New and old ideas keep being applied because the context changes: • Denormalized tables = store more than one table in the same blocks • It is a form of materialized view with a join already performed • Useful if tables are more often accessed together than on their own • Log Structured representation = do not store tuples but only updates to them • Main memory databases for OLTP • Cloud storage • Avoid creating indexes • Cloud databases (Snowflake) • Database Cracking (MonetDB) Access Methods in context 2
Denormalized tables Access Methods in context 3
Merging tables in advance • Normal forms are rules used to eliminate redundancy in the schema of a database • They force to split tables into differnet tables until each one represents a distinct concept • Good idea to save space but often tables will be joined in almost every query • Some systems allow to cluster tables into the same segment so that blocks contain data form both tables (indexed by the common attributes/keys) Access Methods in context 4
Clustered Tables (Oracle) • Clustered tables make sense when the tables are really processed together most of the time • Reduces I/O, saves space • Tuples that are related are stored together in the same block, reducing access time • It is like a materialized join • Updates can become expensive Access Methods in context 5
Log structured databases Access Methods in context 6
Log structured file • Instead of storing tuples and modifying them as needed, keep a record of how the data was modified (a log): • Inserts records the entire tuple • Delete records that the tuple is invalidated • Updates records the attributes that have been modified • New entries are appended at the end of the file Block INSERT id = 17, Val = 3 New entries DELETE id = 11 appended at the end UPDATE id = 25, Val = 7 UPDATE id = 98, Val = 14 Access Methods in context 7
Log structured database • It is much faster to update a file sequentially than to do random accesses (even in SSDs) • In cloud storage, file storage is typically append only • It minimizes the cost of making the data persistent (very expensive for OLTP databases with many transactions) • Not for heavy query analytics but very good for in memory transactional workloads Access Methods in context 8
Optimizing log structured representations • Periodically compact the log: • Remove the history and add one entry for every tuple • Index the tuples and their modifications (entry could contain pointer to previous entry referring to that tuple) • It actually matches what often happens in many applications. • No updates in place • Record history of operations • Periodically apply all operations • Increasingly being used in a number of databases, especially cloud databases Access Methods in context 9
No indexes Access Methods in context 10
Snowflake: Micro-partitions instead of indexes • Micro- partitions are Snowflake’s name for extents • What is interesting is how they are organized to facilitate query processing • Size ranges between 50 and 500 MB (before compression, the data is always compressed when in S3) • Each micro partition has metadata describing what is inside • The metadata can be read without reading the whole micro-partition • The metadata is used to read just the part of the micro-partition that is relevant • Data in the micro-partition is stored in columnar form (by columns not by rows) Storage in Context 11
https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html Storage in Context 12
Pruning based on metadata • The header for a micro-partition contains information about the data. SELECT * FROM T WHERE age > 45 • If header contains the min and max age in the data, we can decide we do not need that micro-partition simply by looking at the header • Same as indexing built in the micro- partitions Storage in Context 13
Database Cracking • Pioneered in MonetDB (open source column store) • Do not create an index, instead, build the index incrementally while the data is being processed • Based on creating a copy of the column • Do it in successive passes so that the cost is not too high • Initial queries expensive, later cost is amortized as work has already been done Access Methods in context 14
Recommend
More recommend