Data Management Systems Storage Management Memory hierarchy - PowerPoint PPT Presentation

Data Management Systems • Storage Management • Memory hierarchy • Segments and file storage • Database buffer cache • Storage techniques in context Cloud native databases (Snowflake) Gustavo Alonso Institute of Computing Platforms Department of Computer Science ETH Zürich Storage in Context 1

Architecture of a database Relations, views Application Queries, Transactions (SQL) Logical data (tables, schemas) Logical view (logical data) Record Interface Logical records (tuples) Access Paths Record Access Physical records Physical data in memory Page access Page structure Pages in memory File Access Storage allocation Blocks, files, segments Cloud storage and file system (Amazon S3) Storage in Context 2

Goals of the lecture • Put the previous material regarding storage in perspective • Use a modern example of a database (Snowflake) • Illustrate how hardware and the cloud change things • Illustrate the many possibilities available to a database engine in terms of storing and managing physical data • Start introducing different database engine designs speciliazed to concrete applications Storage in Context 3

Snowflake • A data warehouse specialized for analytical queries developed entirely on the cloud (cloud native) • Separates compute (nodes running VMs with a local disk) from storage (Amazon’s S3) Paper: https://dl.acm.org/doi/10.1145/2882903.2903741 Documentation: https://docs.snowflake.com/en/user-guide-intro.html Storage in Context 4

Amazon’s S3 • Simple Storage Service (S3) is an object storage service in the cloud that acts as the persistent storage that is available to applications • Unlike conventional local disks or distributed file system! • Object store (key-value) [object = file] • HTTP(S) PUT/GET/DELETE interface • No update in place (objects must be written in full) • Can read parts (ranges) of an object instead of the whole object • High CPU overhead (because of HTTP) • I/O is extra expensive (network bandwidth, latency, interface) Storage in Context 5

Something familiar • A virtual warehouse is a collection of worker nodes (EC2 instances in Amazon) • Each worker node has a cache in its local disk where it stores the objects (table files or parts thereof) accessed before • The cache uses a simple LRU replacement policy Storage in Context 6

Micro-partitions • Micro- partitions are Snowflake’s name for extents • What is interesting is how they are organized to facilitate query processing • Size ranges between 50 and 500 MB (before compression, the data is always compressed when in S3) • Each micro partition has metadata describing what is inside • The metadata can be read without reading the whole micro-partition • The metadata is used to read just the part of the micro-partition that is relevant • Data in the micro-partition is stored in columnar form (by columns not by rows) Storage in Context 7

https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions.html Storage in Context 8

A not so uncommon design • Snowflake is combining many tricks used before in different contexts: • Horizontal partitioning of the tables (by row): allows to read the table in parallel, to put different parts of the table in different processing nodes, and – if organized accordingly- allows to read only the needed tuples instead of all the table • Columnar format (storage by column): the preferred storage format for analytics, improves cache locality, enables vectorized processing, facilitates projection operations (SQL), allows to process only the part of the table that is relevant • Storage level processing to read only the part of the file that is needed (helped by the micro-partitions and the columnar format) Storage in Context 9

Pruning based on metadata • The header for a micro-partition contains information about the data. SELECT * FROM T WHERE age > 45 • If header contains the min and max age in the data, we can decide we do not need that micro- partition simply by looking at the header Storage in Context 10

Pruning • Snowflake does not use indexes • Indexes require a lot of space • Indexes induce random accesses (very bad for slow storage like S3) • Indexes need to be maintained and selected correctly • Instead, it uses the metadata to store information that allows to filter micro-partitions (min/max, #distinct values,#nulls, bloom filters, etc.) • The metadata is much smaller than an index and easier to load than a whole index • By splitting table in potentially many micro-partitions, it can significantly optimize the data movement to and from storage Storage in Context 11

Writing to disk • S3 does not support update in place, a file is replaced in its entirety (immutable) • Snowflake uses this feature to implement snapshots of the data (like shadow paging): • When a micro-partition is modified, a new file is written • The old micro-partition can be kept or discarded • Allows time travel (read the data in the past up to 90 days) and provides fault-tolerance (the old data can be recovered from the old micro-partitions) Storage in Context 12

Data Management Systems Storage Management Memory hierarchy - PowerPoint PPT Presentation

Data Management Systems Storage Management Memory hierarchy Segments and file storage Database buffer cache Storage techniques in context Cloud native databases (Snowflake) Gustavo Alonso Institute of Computing Platforms

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Types of Expert Systems Interpretation Systems Prediction Systems Diagnosis Systems

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

CMPUT 391 Database Management Systems Spatial Data Management 1 Dr. Jrg Sander, 2006

BSD Data Systems Report to the School Board, April 2019 Data-Driven ESSA Best Practices Data

Data Collection and Data Management saverio . giallorenzo @gmail.com 1 Web Science Data

PRESENTS PRESENTS TRUE HOTEL MANAGEMENT SYSTEM TRUE HOTEL MANAGEMENT SYSTEM MANAGEMENT FEATURE

PUBLIC FINANCE MANAGEMENT SYSTEMS MANAGEMENT SYSTEMS AND RISK MODELLING IN TURKISH TREASURY

Database Management Objectives of Lecture 10 Systems Spatial Data Management Spatial Data

Bracing Systems Bracing Systems 1 1 Rod Bracing Rod Bracing 2 2 Wind Bracing Systems Wind

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Database Management Objectives of Lecture 5 Systems Data Warehousing and OLAP Data Warehousing

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Diagnose data for cleaning Cleaning Data in Python Cleaning data Prepare data for analysis

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

Data Preparation Discretization Data cleaning (Data pre-processing) Data

Snowflake Subgroups of CAT(0) Groups Noel Brady and Max Forester Department of Mathematics

CSE 341 Lecture 27 JavaScript scope and closures slides created by Marty Stepp

Compilers Scope Alex Aiken Scope Matching identifier declarations with uses Important

Software Programming tiny devices You would like to have: structured programs with multiple

Developing Physics Basis for the Radiative Snowflake Divertor at DIII-D by V.A. Soukhanovskii 1 ,

Dimension of p-harmonic measure in space Murat Akman Workshop on Harmonic Analysis Partial

Soap films = minimal surfaces Goldstein lab, Cambridge Goldstein lab, Cambridge Rotating drop in

CS/INFO 330: Web Driven Web Applications Web Services Definition: Web Services A