Cloud Analytics Data Warehousing Marco Serafini COMPSCI 532 - PowerPoint PPT Presentation

Cloud Analytics Data Warehousing Marco Serafini COMPSCI 532 Lecture 18

Trivia • How does Amazon make money? • Selling books? • Entertainment? 2 2

Migrating to the Cloud • ELASTICITY • COST • Pay-as-you-go • HW procurement at scale • Unlimited scale • Cluster management at scale 3 3

Cloud Computing • Shared resources • Multiple tenants sharing resources (with isolation) • Economy of scale • Elastic provisioning • Can easily add and remove resources on the fly • Pay as you go only when used • Different flavors • IaaS, PaaS, SaaS • Public, private cloud 4 4

Cloud Offerings • Computing nodes • Example: AWS EC2 • Full nodes with local storage and pre-installed OS • Very large number of instance types: compute optimized, memory optimized, storage optimized, with GPUs, burstable… • Storage services • Example: AWS S3 • Key-value stores (put/get), file systems • Higher-level services • Example: DBMS 5 5

Storage Disaggregation • Computing nodes (e.g. EC2) • Feature-rich machines • Storage services (e.g. S3) • On cheaper, storage-heavy machines • Limited read/write interface • Advantages for cloud provider • Provision storage and computation independently • Advantages for users • Storage services cheaper • Network bandwidth ~ I/O bandwidth 6 6

Cloud Storage Types STORAGE PERFORMANC ACCESS APPENDS AVAILABILITY PRICE E ✓ OBJECT (S3) -- Shared X Low ✓ ✓ FILE SYSTEM (EFS) - Shared High ✓ BLOCK (EBS) + Instance (*) X Mid ✓ INSTANCE-LOCAL ++ Instance X High (**) (*) Can be detached from an instance and reattached to another (**) Storage-heavy instances are expensive 7 7

From Shared-Nothing Architecture… COMPUTE COMPUTE COMPUTE COMPUTE LS LS LS LS Principle: move computation to data 8 8

…To Hybrid Architectures Arbitrary COMPUTE COMPUTE COMPUTE COMPUTE computation LS LS LS LS STORAGE Read/Write only SERVICE Cannot move computation to data! 9 9

Scheduling Low-Priority Tasks • Helps increase hardware utilization • Spot instances • Allocated in real-time based on live bidding • Can be revoked any time (with notice) • Serverless computing • Example: AWS Lambda • Each of these services comes with own pricing 10 10

Goals: Push-Button Analytics • Easily parallelize single-threaded code • Eliminate cluster management overhead • Deployment of nodes • Installation • Configuration • Even cloud offerings have their complexities • Many instance types • Many services • Solution: Serverless functions 12 12

Goal: Push-Button Analytics • Use ”serverless” components • No need to select a specific cluster size • System auto-scales up and down on demand • Building blocks • Serverless functions (AWS Lambdas) • Cloud storage services (AWS S3) • This paper implements MapReduce in this setting 13 13

Serverless Functions • Single threaded code • Invoked through HTTP requests • Cloud platform takes care of • Deployment • Load balancing • Performance isolation • No need to • Deploy servers • Configure clusters 14 14

Challenges with Lambdas • No local storage, need to use remote cloud storage • For example S3 • No function-to-function communication • Again need remote storage to share remote memory • Short maximum running time 15 15

Remote vs. Local Storage 16 16

State and Fault Tolerance • State is lost after execution • Inputs and outputs need to be persisted • Fault tolerance • Re-execute function • Require atomic writes to check what has succeeded 17 17

Registering Functions • Registering a new Lambda function is slow • Solution • Register a single generic Lambda function • Serialize the code that needs the be executed • Store the code (and the input data) on S3 • Generic Lambda function loads code and executes it 18 18

Remote Storage Scalability 19 19

Semantics • Map is easy • Execute one function per element of the list • Map + single Reducer • E.g. parallel featurization + single-server ML • MapReduce • Many Lambdas needed, many small intermediate files • Use Redis, an in-memory key-value store • Parameter server • Use Redis 20 20

The Cost of Scaling Up • Using more nodes does not always imply higher cost • Lower latency à lower cost per node 21 21

Data Warehousing Architectures 22

Data Warehousing • Analytical (OLAP) relational queries • Different architectures • Snowflake: shared-disk + caching at compute nodes • Redshift: shared-nothing, store all data at compute nodes • Redshift Spectrum: serverless workers executing on-demand and reading from S3 • Let’s discuss these architectures and compare them 23 23

Snowflake • Shared-disk architecture • Data is stored on S3, all nodes can access it • But nodes keep a distributed cache • Challenges • Heterogeneous workloads • No one-size-fits-all hardware configuration • Membership changes • Large data shuffles when a node fails/is removed • Online upgrade • It is similar to changing all the nodes in the system 24 24

Snowflake Architecture • Data Storage • Based on S3: high throughput, high latency • Used also for intermediate data • Virtual Warehouses • Responsible for query execution • Stateless (restarted in their entirety) • Shared cache (low latency on hot data, most data cold) • Cloud Services • Query parsing, access control, optimization • Snapshot isolation with multi- versioning • Metadata on external key-value store 25 25

Snowflake Advantages • Storage on S3 is cheaper • Use expensive local disk only for hot data • All services (except storage) are stateless • Simpler fault tolerance and membership change 26 26

Redshift • Classical shared-nothing architecture • Initially based on PostgreSQL but heavily re-optimized for OLAP • Runs on EC2, explicit provisioning • All data pre-loaded on instance storage • Query compilation • S3 for backup only 27 27

Redshift Spectrum • Serverless query executor • Number of workers dynamically assigned • Stateless • Reads data directly from S3 • Scale out to leverage storage and computation bandwidth 28 28

Comparison Setup • Benchmark: TPC-H • 1 TB uncompressed data • 1 execution of the query suite • Configuration • Default: 4 nodes, memory optimized (r4 8xlarge) • Redshift: analogous node that offers SSD storage (dc2) • Athena: opaque 30 30

Comparison: Initialization Time • Paid every time we shut down and restart the system • Load metadata and (optionally) data 31 31

Comparison: Runtime • Pre-loading pays off • Initialization delay is easily amortized • Caching less helpful • Cost • Athena: pay data scan only • Other systems: mainly running time • Spectrum: scan + running time 32 32

Comparison: Execution Cost • RS can amortize loading costs • Athena • Servlerless • Pay per amount of data scanned • RS Spectrum • Similar scheme as Athena • But must add RS cluster cost 33 33

Storage Cost Per Day Instance storage + EBS very expensive S3 backup cheaper 34 34

Pushing Down Computation? • One should always move computation to data • But disaggregated storage cannot compute! Arbitrary COMPUTE COMPUTE COMPUTE COMPUTE computation LS LS LS LS STORAGE Read/Write only SERVICE 35 35

S3 Select • Computation on the storage layer • Simple selection and projection queries on structured data (e.g. CSV or Parquet) • Simple aggregations (e.g. sum) 36 36

PusdownDB • Stateless query execution with S3 select • Example: Bloom join • Standard hash join but push down Bloom filter to filter results that will not join 37

TPC-H Results • Great speedups with S3 select 38 38

Cloud Analytics Data Warehousing Marco Serafini COMPSCI 532 - PowerPoint PPT Presentation

Cloud Analytics Data Warehousing Marco Serafini COMPSCI 532 Lecture 18 Trivia How does Amazon make money? Selling books? Entertainment? 2 2 Migrating to the Cloud ELASTICITY COST Pay-as-you-go HW procurement at

Database Management Objectives of Lecture 5 Systems Data Warehousing and OLAP Data Warehousing

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Data Warehousing Outline Overview of data warehousing Dimensional Modeling Online

CS520 Data Integration, Warehousing, and Provenance 6. Data Warehousing IIT DBGroup Boris

A Generic Solution for A Generic Solution for Warehousing Business Warehousing Business

Lifting your cargoes faster shorecranes up to 208 tons rhb stevedoring & warehousing Our

TAPPI Shipping, Receiving & Warehousing Workshop TAPPI Shipping, Receiving & Warehousing

Warehousing Warehousing are the activities involved in the design and operation of warehouses

4/14/20 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2)

Data Management and Analysis with Business Applications Data Warehousing Andrea Brunello

Data Mining Data warehousing Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy

5/5/16 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2)

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends to process data the Cloud

Purchasing Local Food from a Broadline Distributor Abby Harper Farm to School Specialist Center

Nomos : Resource-Aware Session Types for Programming Digital Contracts Stephanie Balzer,

Providing Atomic Sector Updates in Software for Persistent Memory Vishal Verma

Underserved Communities: Moving Forward with Distributed Solar+Storage Projects October 20, 2020

CAS CS 460/660 Introduction to Database Systems Query Optimization II 1.1 Review

1 What c an you affor d? Do yo u wa nt to g ive yo ur e mplo ye e s ra ise s? L a yo

EDBT Summer School Database Performance Pat & Betty (Elizabeth) ONeil Sept. 6, 2007

Blockchain Programming Languages Tur uring ing complet plete e vs. s. non-Turing uring

Cloud Analytics Data Warehousing Marco Serafini COMPSCI 532 - PowerPoint PPT Presentation

Cloud Analytics Data Warehousing Marco Serafini COMPSCI 532 Lecture 18 Trivia How does Amazon make money? Selling books? Entertainment? 2 2 Migrating to the Cloud ELASTICITY COST Pay-as-you-go HW procurement at

Database Management Objectives of Lecture 5 Systems Data Warehousing and OLAP Data Warehousing

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Data Warehousing Outline Overview of data warehousing Dimensional Modeling Online

CS520 Data Integration, Warehousing, and Provenance 6. Data Warehousing IIT DBGroup Boris

A Generic Solution for A Generic Solution for Warehousing Business Warehousing Business

Lifting your cargoes faster shorecranes up to 208 tons rhb stevedoring &amp; warehousing Our

TAPPI Shipping, Receiving &amp; Warehousing Workshop TAPPI Shipping, Receiving &amp; Warehousing

Warehousing Warehousing are the activities involved in the design and operation of warehouses

4/14/20 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2)

Data Management and Analysis with Business Applications Data Warehousing Andrea Brunello

Data Mining Data warehousing Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy

5/5/16 Outline 0) Course Info CS520 1) Introduction Data Integration, Warehousing, and 2)

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Big Data on Google Cloud Using Cloud Dataflow, BigQuery, and friends to process data the Cloud

Purchasing Local Food from a Broadline Distributor Abby Harper Farm to School Specialist Center

Nomos : Resource-Aware Session Types for Programming Digital Contracts Stephanie Balzer,

Providing Atomic Sector Updates in Software for Persistent Memory Vishal Verma

Underserved Communities: Moving Forward with Distributed Solar+Storage Projects October 20, 2020

CAS CS 460/660 Introduction to Database Systems Query Optimization II 1.1 Review

1 What c an you affor d? Do yo u wa nt to g ive yo ur e mplo ye e s ra ise s? L a yo

EDBT Summer School Database Performance Pat &amp; Betty (Elizabeth) ONeil Sept. 6, 2007

Blockchain Programming Languages Tur uring ing complet plete e vs. s. non-Turing uring

Lifting your cargoes faster shorecranes up to 208 tons rhb stevedoring & warehousing Our

TAPPI Shipping, Receiving & Warehousing Workshop TAPPI Shipping, Receiving & Warehousing

EDBT Summer School Database Performance Pat & Betty (Elizabeth) ONeil Sept. 6, 2007