CMU SCS CMU SCS Today’s Class Carnegie Mellon Univ. • Storage Models Dept. of Computer Science • System Architectures 15-415/615 - DB Applications • Vectorization • Compression • Data Modification C. Faloutsos – A. Pavlo Lecture#25: Column Stores Faloutsos/Pavlo CMU SCS 15-415/615 3 CMU SCS CMU SCS Wikipedia Example OLTP CREATE TABLE useracct ( userID INT PRIMARY KEY , • On-line Transaction Processing: userName VARCHAR UNIQUE , ⋮ – Short-lived txns. ); – Small footprint. CREATE TABLE pages ( pageID INT PRIMARY KEY , title VARCHAR UNIQUE , – Repetitive operations. latest INT REFERENCES revisions (revID), ); SELECT * FROM useracct SELECT P.*, R.* WHERE userName = ? FROM pages AS P CREATE TABLE revisions ( AND userPass = ? INNER JOIN revisions AS R revID INT PRIMARY KEY, ON P.latest = R.revID pageID INT REFERENCES pages (pageID), UPDATE useracct WHERE P.pageID = ? userID INT REFERENCES useracct (userID), content TEXT, SET lastLogin = NOW (), updated DATETIME hostname = ? INSERT INTO revisions WHERE userID = ? ); VALUES ( ? , ? …, ? ) 4 Faloutsos/Pavlo 5
CMU SCS CMU SCS OLAP Data Storage Models • On-line Analytical Processing: • There are different ways to store tuples. – Long running queries. • We have been assuming the n -ary storage – Complex joins. model this entire semester. – Exploratory queries. SELECT COUNT (U.lastLogin), EXTRACT (month FROM U.lastLogin) AS month FROM useracct AS U WHERE U.hostname LIKE ‘%.gov’ GROUP BY EXTRACT (month FROM U.lastLogin) Faloutsos/Pavlo CMU SCS 15-415/615 6 Faloutsos/Pavlo CMU SCS 15-415/615 7 CMU SCS CMU SCS n -ary Storage Model n -ary Storage Model SELECT * FROM useracct WHERE userName = ? • The DBMS stores all attributes for a single AND userPass = ? tuple contiguously in a block. B+Tree NSM Disk Page NSM Disk Page userID userName userPass hostname lastLogin userID userName userPass hostname lastLogin userID userName userPass hostname lastLogin userID userName userPass hostname lastLogin userID userName userPass hostname lastLogin userID userName userPass hostname lastLogin - - - - - - - - - - Faloutsos/Pavlo CMU SCS 15-415/615 8 Faloutsos/Pavlo CMU SCS 15-415/615 9
CMU SCS CMU SCS n -ary Storage Model n -ary Storage Model SELECT * FROM useracct SELECT COUNT (U.lastLogin), WHERE userName = ? EXTRACT (month FROM U.lastLogin) AS month AND userPass = ? FROM useracct AS U WHERE U.hostname LIKE ‘%.gov’ B+Tree INSERT INTO useracct GROUP BY EXTRACT (month FROM U.lastLogin) VALUES ( ? , ? ,… ? ) NSM Disk Page NSM Disk Page userID userName userPass hostname lastLogin userID userName userPass hostname lastLogin userID userName userPass hostname lastLogin userID userName userPass hostname lastLogin userID userName userPass hostname lastLogin userID userName userPass hostname lastLogin userID - userName userPass - - hostname - lastLogin - userID userName userPass hostname lastLogin Faloutsos/Pavlo CMU SCS 15-415/615 9 Faloutsos/Pavlo CMU SCS 15-415/615 10 CMU SCS CMU SCS n -ary Storage Model n -ary Storage Model SELECT COUNT (U.lastLogin), SELECT COUNT (U.lastLogin), EXTRACT (month FROM U.lastLogin) AS month EXTRACT (month FROM U.lastLogin) AS month FROM useracct AS U FROM useracct AS U WHERE U.hostname LIKE ‘%.gov’ WHERE U.hostname LIKE ‘%.gov’ GROUP BY EXTRACT (month FROM U.lastLogin) GROUP BY EXTRACT (month FROM U.lastLogin) NSM Disk Page NSM Disk Page userID userName userPass hostname lastLogin userID userName userPass hostname lastLogin userID userName userPass hostname lastLogin userID userName userPass hostname lastLogin userID userName userPass hostname lastLogin userID userName userPass hostname lastLogin userID userName userPass hostname lastLogin userID userName userPass hostname lastLogin Faloutsos/Pavlo CMU SCS 15-415/615 10 Faloutsos/Pavlo CMU SCS 15-415/615 10
CMU SCS CMU SCS n -ary Storage Model n -ary Storage Model SELECT COUNT (U.lastLogin), EXTRACT (month FROM U.lastLogin) AS month • Advantages FROM useracct AS U WHERE U.hostname LIKE ‘%.gov’ – Fast inserts, updates, and deletes. GROUP BY EXTRACT (month FROM U.lastLogin) – Good for queries that need the entire tuple. • Disadvantages NSM Disk Page – Not good for scanning large portions of the X userID userName userPass hostname lastLogin table and/or a subset of the attributes. userID userName userPass hostname lastLogin userID userName userPass hostname lastLogin userID userName userPass hostname lastLogin Faloutsos/Pavlo CMU SCS 15-415/615 10 Faloutsos/Pavlo CMU SCS 15-415/615 11 CMU SCS CMU SCS Decomposition Storage Model Decomposition Storage Model • The DBMS stores a single attribute for all • The DBMS stores a single attribute for all tuples contiguously in a block. tuples contiguously in a block. DSM Disk Page userID userName userPass hostname lastLogin hostname hostname hostname hostname hostname userID lastLogin userID userName userPass hostname lastLogin hostname hostname hostname hostname hostname userID userName userPass hostname lastLogin hostname hostname hostname hostname hostname - - - - - hostname hostname hostname hostname hostname userName userPass Faloutsos/Pavlo CMU SCS 15-415/615 12 Faloutsos/Pavlo CMU SCS 15-415/615 12
CMU SCS CMU SCS Decomposition Storage Model Decomposition Storage Model SELECT COUNT (U.lastLogin), EXTRACT (month FROM U.lastLogin) AS month • Advantages FROM useracct AS U WHERE U.hostname LIKE ‘%.gov’ – Reduces the amount wasted I/O because the GROUP BY EXTRACT (month FROM U.lastLogin) DBMS only reads the data that it needs. – Better query processing and data compression DSM Disk Page (more on this later). hostname hostname hostname hostname hostname • Disadvantages hostname hostname hostname hostname hostname – Slow for point queries, inserts, updates, and hostname hostname hostname hostname hostname deletes because of tuple splitting/stitching. hostname hostname hostname hostname hostname Faloutsos/Pavlo CMU SCS 15-415/615 13 Faloutsos/Pavlo CMU SCS 15-415/615 14 CMU SCS CMU SCS History System Architectures • 1970s: Cantor DBMS • Fractured Mirrors • 1980s: DSM Proposal • Partition Attributes Across (PAX) • 1990s: SybaseIQ (in-memory only) • Pure Columnar Storage • 2000s: Vertica, VectorWise, MonetDB • 2010s: Cloudera Impala, Amazon Redshift, “The Big Three” Faloutsos/Pavlo CMU SCS 15-415/615 15 Faloutsos/Pavlo CMU SCS 15-415/615 16
CMU SCS CMU SCS Fractured Mirrors PAX • Store a second copy of the database in a • Data is still stored in NSM blocks, but each DSM layout that is automatically updated. block is organized as mini columns. – Examples: Oracle, IBM DB2 BLU PAX Disk Page userID userID userID userID userName userName userName userName userPass userPass NSM DSM userPass userPass hostname hostname hostname hostname lastLogin lastLogin lastLogin lastLogin Faloutsos/Pavlo CMU SCS 15-415/615 17 Faloutsos/Pavlo CMU SCS 15-415/615 18 CMU SCS CMU SCS Column Stores Today’s Class • Entire system is designed for columnar data. • Storage Models – Query Processing, Storage, Operator • System Architectures Algorithms, Indexing, etc. • Vectorization – Examples: Vertica, VectorWise, Paraccel, • Compression Cloudera Impala, Amazon Redshift • Data Modification Faloutsos/Pavlo CMU SCS 15-415/615 19 Faloutsos/Pavlo CMU SCS 15-415/615 20
CMU SCS CMU SCS Query Processing Strategies Iterator Model • Each operator calls next() on their child • The DBMS needs to process queries differently when using columnar data. operator to process tuples one at a time. • We have already discussed the Iterator next π SELECT cname, amt Model for processing tuples in the DBMS FROM customer, account cname, amt WHERE customer.acctno = next ⨝ query operators. account.acctno acctno=acctno AND account.amt > 1000 next σ amt>1000 CUSTOMER ACCOUNT Faloutsos/Pavlo CMU SCS 15-415/615 21 Faloutsos/Pavlo CMU SCS 15-415/615 22 CMU SCS CMU SCS Materialization Model Observations • Each operator consumes its entire input and • The Iterator Model is bad with a DSM generates the full output all at once. because it requires the DBMS to stitch tuples back together each time. SELECT cname, amt π FROM customer, account • The Materialization Model is a bad because cname, amt WHERE customer.acctno = ⨝ the intermediate results may be larger than account.acctno acctno=acctno AND account.amt > 1000 the amount of memory in the system. σ amt>1000 CUSTOMER ACCOUNT Faloutsos/Pavlo CMU SCS 15-415/615 23 Faloutsos/Pavlo CMU SCS 15-415/615 24
Recommend
More recommend