data analytics using deep learning
play

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY - PowerPoint PPT Presentation

DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY ARULRAJ L E C T U R E # 0 7 : S T O R A G E M O D E L S & C O M P R E S S I O N administrivia Reminder Assignment 1 due on next Wednesday Sign up for discussion


  1. DATA ANALYTICS USING DEEP LEARNING GT 8803 // FALL 2019 // JOY ARULRAJ L E C T U R E # 0 7 : S T O R A G E M O D E L S & C O M P R E S S I O N

  2. administrivia • Reminder – Assignment 1 due on next Wednesday – Sign up for discussion slots on next Thursday GT 8803 // Fall 2019 2

  3. LAST CLASS • Disk-centric & in-memory DBMSs – Buffer management (ACI D ) – Query processing – Concurrency control (AC I D) – Logging and recovery ( A CI D ) GT 8803 // Fall 2019 3

  4. TODAY’s AGENDA • Storage Models • Compression • Visual Storage Engine GT 8803 // Fall 2019 4

  5. STORAGE MODELS 5 GT 8803 // Fall 2018

  6. ANATOMY OF A DATABASE SYSTEM Process Manager Connection Manager + Admission Control Query Parser Query Processor Query Optimizer Query Executor Query Lock Manager (Concurrency Control) Transactional Access Methods (or Indexes) Storage Manager Buffer Pool Manager Log Manager Shared Utilities Memory Manager + Disk Manager Networking Manager Source: Anatomy of a Database System GT 8803 // Fall 2019 6

  7. DATA ORGANIZATION • One can think of an in-memory database as just a large array of bytes. – The schema tells the DBMS how to convert the bytes into the appropriate type (e.g., INTEGER , DATE ). – Each tuple is prefixed with a header that contains meta-data (e.g., last modified time-stamp). GT 8803 // Fall 2019 7

  8. TABLE STORAGE FORMAT • Storage Models – N -ary Storage Model (NSM) / Row-Store – Decomposition Storage Model (DSM) / Column- Store – Flexible or Hybrid Storage Model GT 8803 // Fall 2019 8

  9. N-ARY STORAGE MODEL (NSM) • The DBMS stores all of the attributes for a single tuple contiguously. • Ideal for OLTP workloads where txns tend to operate only on an individual entity and insert-heavy workloads. • Use the tuple-at-a-time iterator model. GT 8803 // Fall 2019 9

  10. N-ARY STORAGE MODEL (NSM) ID University Enrollment City 1 Georgia Tech 15000 Atlanta 2 Wisconsin 30000 Madison 3 Carnegie Mellon 6000 Pittsburgh 4 UC Berkeley 30000 Berkeley GT 8803 // Fall 2019 10

  11. NSM PHYSICAL STORAGE • Choice #1: Heap-Organized Tables – Tuples are stored in blocks called a heap . – The heap does not necessarily define an order • Choice #2: Index-Organized Tables – Tuples are stored in the primary key index itself. – Index does define an order based on the primary key GT 8803 // Fall 2019 11

  12. N-ARY STORAGE MODEL (NSM) • Advantages – Fast inserts, updates, and deletes. – Good for queries that need the entire tuple. – Can use index-oriented physical storage. • Disadvantages – Not good for scanning large portions of the table and/or a subset of the attributes. – OLAP workloads & wide tables with lots of attributes GT 8803 // Fall 2019 12

  13. DECOMPOSITION STORAGE MODEL (DSM) • The DBMS stores a single attribute for all tuples contiguously in a block of data. – Sometimes also called vertical partitioning . • Ideal for OLAP workloads where read-only queries perform large scans over a subset of the table’s attributes. • Use the vector-at-a-time iterator model. GT 8803 // Fall 2019 13

  14. DECOMPOSITION STORAGE MODEL (DSM) ID University Enrollment City 1 Georgia Tech 15000 Atlanta 2 Wisconsin 30000 Madison 3 Carnegie Mellon 6000 Pittsburgh 4 UC Berkeley 30000 Berkeley GT 8803 // Fall 2019 14

  15. TUPLE IDENTIFICATION IN DSM • Choice #1: Fixed-length Offsets – Each value is the same length for an attribute. • Choice #2: Embedded Tuple Ids – Each value is stored with its tuple id in a column. Offsets Embedded Ids A B C D A B C D 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 GT 8803 // Fall 2019 15

  16. DECOMPOSITION STORAGE MODEL (DSM) • Advantages – Reduces the amount wasted work because the DBMS only reads the data that it needs. – Better compression. • Disadvantages – Slow for point queries, inserts, updates, and deletes because of tuple splitting/stitching (OLTP workloads). GT 8803 // Fall 2019 16

  17. OBSERVATION • Can we build a single system that supports both OLTP and OLAP workloads? • Data is “hot” when first entered into database – A newly inserted tuple is more likely to be updated again the near future. • As a tuple ages, it is updated less frequently. – At some point, a tuple is only accessed in read-only queries along with other tuples. GT 8803 // Fall 2019 17

  18. BIFURCATED ENVIRONMENT Extract Transform Load OLTP DATA SILOS OLAP DATA WAREHOUSE 18 GT 8803 // Fall 2018

  19. BIFURCATED ENVIRONMENT Extract Transform Load OLTP DATA SILOS OLAP DATA WAREHOUSE 19 GT 8803 // Fall 2018

  20. BIFURCATED ENVIRONMENT Extract Transform Load OLTP DATA SILOS OLAP DATA WAREHOUSE 20 GT 8803 // Fall 2018

  21. HYBRID STORAGE MODEL • Single database instance that uses different storage models for hot and cold data. • Store new data in NSM for fast OLTP Migrate data to DSM for more efficient OLAP GT 8803 // Fall 2019 21

  22. HYBRID STORAGE MODEL ID University Enrollment City 1 Georgia Tech 15000 Atlanta 2 Madison Wisconsin 30000 3 Pittsburgh Carnegie Mellon 6000 UC Berkeley 30000 4 Berkeley GT 8803 // Fall 2019 22

  23. PELOTON ADAPTIVE STORAGE • Employ a single execution engine architecture that is able to operate on both NSM and DSM data. – Don’t need to store two copies of the database. – Don’t need to sync multiple database segments. • Note that a DBMS can still use the delta-store approach with this single-engine architecture. BRIDGING THE ARCHIPELAGO BETWEEN ROW-STORES AND COLUMN-STORES FOR HYBRID WORKLOADS SIGMOD 2016 GT 8803 // Fall 2019 23

  24. PELOTON ADAPTIVE STORAGE Original Data A B C D UPDATE AndySux SET A = 123, B = 456, C = 789 WHERE D = “xxx” SELECT AVG (B) FROM AndySux WHERE C = “yyy” 24 GT 8803 // Fall 2018

  25. PELOTON ADAPTIVE STORAGE Original Data A B C D UPDATE AndySux SET A = 123, B = 456, C = 789 WHERE D = “xxx” SELECT AVG (B) FROM AndySux WHERE C = “yyy” 25 GT 8803 // Fall 2018

  26. PELOTON ADAPTIVE STORAGE Original Data A B C D UPDATE AndySux SET A = 123, Hot B = 456, C = 789 WHERE D = “xxx” SELECT AVG (B) FROM AndySux WHERE C = “yyy” Cold 26 GT 8803 // Fall 2018

  27. PELOTON ADAPTIVE STORAGE Original Data Adapted Data A B C D A B C D UPDATE AndySux SET A = 123, Hot B = 456, C = 789 A B C D WHERE D = “xxx” SELECT AVG (B) FROM AndySux WHERE C = “yyy” Cold 27 GT 8803 // Fall 2018

  28. PELOTON ADAPTIVE STORAGE Original Data Adapted Data A B C D A B C D UPDATE AndySux SET A = 123, Hot B = 456, C = 789 A B C D WHERE D = “xxx” SELECT AVG (B) FROM AndySux WHERE C = “yyy” Cold 28 GT 8803 // Fall 2018

  29. FLEXIBLE STORAGE MODEL ID University Enrollment City 1 Georgia Tech 15000 Atlanta 2 Wisconsin 30000 Madison 3 Carnegie Mellon 6000 Pittsburgh 4 UC Berkeley 30000 Berkeley GT 8803 // Fall 2019 29

  30. TILE ABSTRACTION • Introduce an indirection layer that abstracts the true layout of tuples from query operators. A B C D GT 8803 // Fall 2019 30

  31. TILE ABSTRACTION • Introduce an indirection layer that abstracts the true layout of tuples from query operators. A B C D Tile Group A Tile Group B GT 8803 // Fall 2019 31

  32. TILE ABSTRACTION • Introduce an indirection layer that abstracts the true layout of tuples from query operators. A B C D Tile Group A Tile #1 Tile Group B Tile #2 Tile #3 Tile #4 GT 8803 // Fall 2019 32

  33. TILE ABSTRACTION • Introduce an indirection layer that abstracts the true layout of tuples from query operators. Tile Group H A B C D Header + Tile #1 + + Tile #2 Tile #3 Tile #4 + + GT 8803 // Fall 2019 33

  34. TILE ABSTRACTION • Introduce an indirection layer that abstracts the true layout of tuples from query operators. H A B C D + + + + + GT 8803 // Fall 2019 34

  35. TILE ABSTRACTION • Introduce an indirection layer that abstracts the true layout of tuples from query operators. SELECT AVG (B) FROM AndySux H A B C D WHERE C = “yyy” γ + + s + + + AS GT 8803 // Fall 2019 35

  36. TILE ABSTRACTION • Introduce an indirection layer that abstracts the true layout of tuples from query operators. SELECT AVG (B) FROM AndySux H A B C D WHERE C = “yyy” γ + B + 1 s + 2 + 1 + AS 2 3 GT 8803 // Fall 2019 36

  37. TILE ABSTRACTION • Introduce an indirection layer that abstracts the true layout of tuples from query operators. SELECT AVG (B) FROM AndySux H A B C D WHERE C = “yyy” γ + B + 1 s + 2 + 1 + AS 2 3 GT 8803 // Fall 2019 37

  38. PARTING THOUGHTS • A flexible architecture that supports a hybrid storage model is the next major trend in DBMSs – This will enable relational DBMSs to support both OLTP and OLAP database workloads. GT 8803 // Fall 2019 38

  39. COMPRESSION 39 GT 8803 // Fall 2018

  40. OBSERVATION • I/O is the main bottleneck if the DBMS has to fetch data from disk. – CPU cost for decompressing data < – I/O cost for fetching un-compressed data. • Compression always helps . GT 8803 // Fall 2019 40

  41. OBSERVATION • In-memory DBMSs are more complicated – Compressing the database reduces DRAM requirements and processing. • Key trade-off is speed vs. compression ratio – In-memory DBMSs (always?) choose speed. GT 8803 // Fall 2019 41

Recommend


More recommend