Data Management Systems Fall Semester 2020 Gustavo Alonso Institute of Computing Platforms Content Introduction Department of Computer Science ETH Zürich Course Introduction 1
Course contents • Intro & basics • Storage • Access methods • Query Processing • Concurrency Control and Recovery • Distributed Transaction Processing • Key Value Stores • Cloud Data Management Systems Course Introduction 2
Why data management? • Databases and data management are at the heart of many key applications • The “assembly code” of data science • Organizing and processing information • Data management engines • Architecture and algorithms • Scalability and distribution • Cloud and new hardware • What you learn here will help you be more effective as a computer scientist or as a scientist dealing with data Course Introduction 3
A Database System (DBMS) • A DBMS is a tool that helps develop and run data-intensive applications. • Push the complexity of dealing with the data (storage, processing, consistency) to the database rather than to the program • Share the database • The Database is a tool • Many shapes and forms • Many applications Course Introduction 4
Course Introduction 5
How does a database look like • Until recently, a database often meant a relational database • Today, there are many forms of data management engines (or databases) • Principles and ideas behind relational databases apply to almost all forms of data management Course Introduction 6
SQL Accelerators From Oracle M7 documentation Course Introduction 7
Why use a DBMS? • Avoid redundancy and inconsistency • Rich (declarative) access to the data • Synchronize concurrent data access • Recovery after system failures • Security and privacy • Facilitate reuse of the data • Reduce cost and pain of doing something useful • There is always an alternative!!! Course Introduction 8
What makes a database great? • Physical and logical data independence • Declarative language (SQL) • Query optimization • Data integrity • Access controls • Persistence • Concurrency Control • Recovery • Scalability and flexibility • … Course Introduction 9
Data independence Physical data independence Logical data independence • The database hides how the • The database allows to build data is actually stored views over the schema so that persistently and represented in different logical interpretations memory of the same data are possible • What the user sees are just • This is implemented through relations and a schema views and the engine is responsible for maintaining such • How to map physical data to the views schema is the engine’s problem Course Introduction 10
Declarative language • SQL (Structured Query TPC-H Query 1 SELECT L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY) AS Language) specifies how the SUM_QTY, SUM(L_EXTENDEDPRICE) AS SUM_BASE_PRICE, result looks like and describes SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)) AS the tuples that should be part of SUM_DISC_PRICE, SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)) AS the result SUM_CHARGE, AVG(L_QUANTITY) AS AVG_QTY, AVG(L_EXTENDEDPRICE) AS AVG_PRICE, AVG(L_DISCOUNT) AS • It does not say how to get to the AVG_DISC, COUNT(*) AS COUNT_ORDER result (no control flow) FROM LINEITEM WHERE L_SHIPDATE <= dateadd(dd, -90, cast('1998-12-01' as • This is what allows the database date)) to optimize queries GROUP BY L_RETURNFLAG, L_LINESTATUS ORDER BY L_RETURNFLAG,L_LINESTATUS • Standardized (to some extent) Course Introduction 11
Query optimization • Relational algebra allows to prove the equivalence of certain transformations. • One an SQL statement is transformed into an operator tree (a plan), the plan can be optimized using additional information, heuristics, cost estimations, etc. • This is automatic, user is not involved (hints allowed) Course Introduction 12
Data integrity • Databases use a number of features to preserve the integrity of the data: • Foreign key constraints (if a tuple appears in one table, it has to appear in another table) • NULL/NOT-NULL (preventing that key attributed are left empty) • Value constraints (preventing that attributes get assigned invalid values) • … • This allows to define rules controlling the consistency of the data, letting the engine enforce these rules at run time. These rules are declarative. Course Introduction 13
Access controls • Relations and queries are logically the same thing • Database engines provide access control by determining what data can be seen using queries • Different possibilities of implementation • The engine decides how to enforce access control • Allows a declarative specification of access control • Use similar tricks as query optimization … Course Introduction 14
Concurrency control • One of the big selling points of database engines • No matter how many users read and write the data concurrently, they all see a consistent database • The application does not need to worry about the fact that other applications/users are also modifying the data • The engine ensures that you read the correct data even if it is being modified and it also ensures your changes make it to the database even when others are also reading and writing the same data • A hot topic for decades and still very much being discussed (lots of fake news from companies when they do not know how to do it) Course Introduction 15
Recovery • Another big selling point of database engines • Data committed to the database will be persistent and recoverable even when failures occur • Crucial in a wide range of applications (financial, banking, etc.) • Upon restart (even after a failure) the database will restore its last consistent stage before shutting down • The database can recreate its past states and records all the changes made, keeping track of all modifications to the data or the system Course Introduction 16
Scalability and flexibility • Database engines today provide very high levels of performance and a wide variety of features • Very complex systems (potentially very expensive as well) with many extension and deployments • Established code basis have been optimized to the extreme and are highly reliable (code tested by very many users) • Often, old engines are better not because of their design but because of their reliability and hugely optimized code • In spite of diversification, relational engines remain the key benchmark in functionality as well as performance Course Introduction 17
Architecture Course Introduction 18
Database Architecture Relations, views Application Queries, Transactions (SQL) Logical data (tables, schemas) Logical view (logical data) Record Interface Logical records (tuples) Access Paths Record Access Physical records Physical data in memory Page access Page structure Pages in memory File Access Storage allocation Blocks, files, segments Physical storage Course Introduction 19
Database Architecture (another view) https://docs.oracle.com/cd/E11882_01/server.112/e40540/pro cess.htm#CNCPT902 Course Introduction 20
Database Architecture (yet another view) https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf Course Introduction 21
Basic Architectures – shared nothing • Each node runs on its own data and NETWORK on its own engine. They share neither disk nor memory • Function shipping = the query has to go to where the data is DB DB DB • Data Shipping = the data has to go Engine Engine Engine where the query is • Easy to maintain and to scale Disk Disk Disk • Ideal when data can be sharded • Replication used for fault tolerance and to minimize data shipping Course Introduction 22
Who uses shared nothing? • Shared nothing works well when: • Data can be sharded (partitioned) across the NETWORK nodes • There are not many updates across nodes • The data fits within the local storage of each node DB DB DB Engine Engine Engine • Replication does not add too much overhead • Queries can be parallelized across the nodes • Shared nothing does not work when Disk Disk Disk • Data cannot be easily partitioned • There are too many updates across all the data Course Introduction 23
Example of processing with shared nothing Merge and answer T Results Results SELECT * FROM T WHERE Age > 50 SELECT * DB SELECT * SELECT * DB DB FROM T Engine FROM T FROM T Engine Engine WHERE WHERE WHERE Age > 50 Age > 50 Age > 50 Disk Disk Disk T3 T1 T2 Course Introduction 24
Example of processing with shared nothing Process and answer Data Shipping (S) R S SELECT * FROM R,S WHERE R.name = S.name DB SELECT * DB DB Engine FROM R,S Engine Engine WHERE R.name = S.name Disk Disk Disk S R Course Introduction 25
Example of processing with shared nothing R S Function shipping SELECT * FROM R,S WHERE R.id = S.id SELECT * FROM R,S DB DB DB WHERE S Engine Engine Engine R.id = S.id Disk Disk Disk S R Course Introduction 26
Recommend
More recommend