DSC 102 Systems for Scalable Analytics Arun Kumar Topic 5: - PowerPoint PPT Presentation

DSC 102   Systems for Scalable Analytics Arun Kumar Topic 5: Dataflow Systems Chapter 2.2 of MLSys Book 1

Parallel RDBMSs ❖ Parallel RDBMSs are highly successful and widely used ❖ They offer massive scalability (shared-nothing parallelism) and high performance (parallel relational dataflows) along with many other enterprise-grade benefits of RDBMSs: ❖ Full power of SQL ❖ Business intelligence dashboards/APIs on top ❖ Transaction management and crash recovery ❖ Index structures, compressed file formats, auto-tuning, etc. Q: So, why did people need to go beyond parallel RDBMSs? 2

Beyond RDBMSs: A Brief History ❖ Relational model and RDBMSs are too restrictive : 1. “Flat” tables with few data/attribute types Object-Relational DBMSs : UDT, UDFs, text, multimedia, etc. 2. Restricted language interface (SQL) PL/SQL ; recursive SQL; embedded SQL; QBE; visual interfaces 3. Need to know schema first! “Schema-later” semi-structured XML data model; XQuery 4. Optimized for static dataset Stream data model; “standing” queries; time windows But the DB community has addressed these issues already! Ad : Take CSE 132B and CSE 135 to learn such extensions 3

Q: Again, so, why did people still need to go beyond parallel RDBMSs?! 4

Beyond RDBMSs: A Brief History The DB community got blindsided by the unstoppable rise of the Web/Internet giants! ❖ DB folks underappreciated 4 key concerns of Web folks: Developability Fault Tolerance Elasticity Cost/Politics! 5

DB/Enterprise vs. Web Dichotomy ❖ DB folks underappreciated 4 key concerns of Web folks: Developability : RDBMS extensibility mechanisms (UDTs, UDFs, etc.) are too painful to use for programmers! DB companies: we write the software and sell to our customers, viz., enterprise companies (banks, retail, etc.) Web companies: we will hire an army of software engineers to build own in-house software systems! Need simpler APIs and DBMSs that scale custom programs 6

DB/Enterprise vs. Web Dichotomy ❖ DB folks underappreciated 4 key concerns of Web folks: Fault Tolerance : What if we run on 100Ks of machines?! DB companies: our customers do not need more than a few dozen machines to store and analyze their data! Web companies: we need hundreds of thousands of machines for planetary-scale Web services! If a machine fails, user should not have to rerun entire query! DBMS should take care of fault tolerance, not user/appl. (Cloud-native RDBMSs now offer fault tolerance by design) 7

DB/Enterprise vs. Web Dichotomy ❖ DB folks underappreciated 4 key concerns of Web folks: Elasticity : Resources should adapt to “query” workload DB companies: our customers have “fairly predictably” sized datasets and workloads; can fix their clusters! Web companies: our workloads could vary widely and the datasets they need vary widely! Need to be able to upsize and downsize clusters easily on-the-fly, based on current query workload 8

DB/Enterprise vs. Web Dichotomy ❖ DB folks underappreciated 4 key concerns of Web folks: Cost/Politics : Commercial RDBMS licenses too costly! DB companies: our customers have $$$! ☺ Web companies: our products are mostly free (ads?); why pay so much $$$ if we can build our own DBMSs? Many started with MySQL (!) but then built their own DBMSs New tools were free & open source; led to viral adoption! 9

This new breed of parallel data systems called Dataflow Systems jolted the DB folks from being smug and complacent! 10

Outline ❖ Beyond RDBMSs: A Brief History ❖ MapReduce/Hadoop Craze ❖ Spark and Dataflow Programming ❖ More Scalable ML with MapReduce/Spark 11

The MapReduce/Hadoop Craze ❖ Blame Google! ❖ “Simple” problem : index, store, and search the Web! ☺ ❖ Who were their major systems hires? Jeff Dean and Sanjay Ghemawat (Systems, not DB or IR) ❖ Why did they not use RDBMSs? (Haha.) Developability, data model, fault tolerance, scale, cost, … Engineers started with MySQL; abandoned it! 12

What is MapReduce? MapReduce: Simplified Data Processing on Large Clusters. In OSDI 2004. ❖ Programming model for writing programs on sharded data + distributed system architecture for processing large data ❖ Map and Reduce are terms/ideas from functional PL ❖ Engineer only implements the logic of Map and Reduce ❖ System implementation handles orchestration of data distribution, parallelization, etc. under the covers Was radically easier for engineers to write programs with! 13

What is MapReduce? ❖ Standard example : count word occurrences in a doc corpus ❖ Input : A set of text documents (say, webpages) ❖ Output : A dictionary of unique words and their counts function map (String docname, String doctext) : Hmmm, sounds suspiciously familiar … ☺ for each word w in doctext : emit (w, 1) Part of MapReduce API function reduce (String word, Iterator partialCounts) : sum = 0 for each pc in partialCounts : sum += pc emit (word, sum) 14

How MapReduce Works Parallel flow of control and data during MapReduce execution: Under the covers, each Mapper and Reducer is a separate process; Reducers face barrier synchronization (BSP) Fault tolerance achieved using data replication 15

Abstract Semantics of MapReduce ❖ Map(): Operates independently on one “record” at a time ❖ Can batch multiple data examples on to one record ❖ Dependencies across Mappers not allowed ❖ Can emit 1 or more key-value pairs as output ❖ Data types of inputs and outputs can be different! ❖ Reduce(): Gathers all Map output pairs across machines with same key into an Iterator (list) ❖ Aggregation function applied on Iterator and output final ❖ Input Split: ❖ Physical-level split/shard of dataset that batches multiple examples to one file “block” (~128MB default on HDFS) ❖ Custom Input Splits can be written by appl. user 16

Benefits of MapReduce ❖ Goal: Higher level abstraction of functional operations (Map; Reduce) to simplify data-parallel programming at scale ❖ Key Benefits: ❖ Out-of-the-box scalability and cluster parallelism ❖ Fault tolerance offloaded to system impl., not appl./user ❖ Map() and Reduce() can be highly general; no restrictions on data types/structures processed; easier to use for ETL and text/multimedia-oriented analytics ❖ Free and OSS implementations available (Hadoop) ❖ New burden on users: Converting computations of data- intensive program/operation to Map() + Reduce() API ❖ But MapReduce libraries available in multiple PLs to mitigate coding pains: Java, C++, Python, R, Scala, etc. 17

Emulate MapReduce in SQL? Q: How would you do the word counting in RDBMS / in SQL? ❖ First step: Transform text docs into relations and load: Part of the Extract-Transform-Load (ETL) stage Suppose we pre-divide each document into words and have the schema: DocWords (DocName, Word) ❖ Second step: a single, simple SQL query! Word, COUNT (*) SELECT FROM DocWords GROUP BY Word Parallelism, scaling, etc. done by RDBMS under the covers [ORDER BY Word] 18

More MR Examples: Select ❖ Input Split (part of ETL): Let table be sharded tuple-wise ❖ Map(): On tuple, apply selection condition; emit pair with dummy key and entire tuple as value ❖ Reduce(): Not needed! No cross-shard aggregation here ❖ Such kinds of tasks/jobs are called “ Map-only ” jobs 19

More MR Examples: Simple Agg. ❖ Input Split (part of ETL): Let table be sharded tuple-wise ❖ Assume it is algebraic aggregate (SUM, AVG, MAX, etc.) ❖ Map(): On agg. attribute, compute incremental stats; emit pair with single global dummy key and stats as value ❖ Reduce(): Since only one dummy key across all shards, Iterator has all suff. stats that can be unified to get result 20

More MR Examples: GROUP BY Agg. ❖ Input Split (part of ETL): Let table be sharded tuple-wise ❖ Assume it is algebraic aggregate (SUM, AVG, MAX, etc.) ❖ Map(): On agg. attribute, compute incremental stats; emit pair with grouping attribute as key and stats as value ❖ Reduce(): Iterator has all suff. stats for a single group ; unify those to get result for that group; different reducers handle different groups 21

More MR Examples: Matrix Norm ❖ Input Split (part of ETL): Let matrix be sharded tile-oriented ❖ Assume it is algebraic aggregate (L p,q norm) ❖ Very similar to simple aggregate! ❖ Map(): On agg. attribute, compute incremental stats; emit pair with single global dummy key and stats as value ❖ Reduce(): Since only one dummy key across all shards, Iterator has all suff. stats that can be unified to get result 22

Analogue: Parallel RDBMS UDA ❖ Recall how word count MapReduce can be done with SQL: Word, COUNT (*) SELECT FROM DocWords GROUP BY Word Q: How can we compute other aggregates not native to SQL? ❖ User-Defined Aggregate Function (UDAF) abstraction in parallel RDBMs can do MapReduce-like computations ❖ MapReduce seems more intuitive and succinct to many! 23

Analogue: Parallel RDBMS UDA ❖ 4 main functions in the UDAF API to work with BSP model ❖ Aggregation state: data structure computed (independently) by workers and unified by master ❖ Initialize(): Set up info./initialize RAM for agg. state; runs independently on each worker ❖ Transition(): Per-tuple function run by worker to update its agg. state; analogous to Map() in MapReduce ❖ Merge(): Function that combines agg. states from workers; run by master after workers done; analogous to Reduce() ❖ Finalize(): Run once at end by master to return final result 24

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 5: - PowerPoint PPT Presentation

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 5: Dataflow Systems Chapter 2.2 of MLSys Book 1 Parallel RDBMSs Parallel RDBMSs are highly successful and widely used They offer massive scalability (shared-nothing

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 4: ML Data Preparation and Model

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 3: Parallel and Scalable Data

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 7: ML Deployment Not included for

Slide 7 / 102 Slide 8 / 102 4 Compare/Contrast Pulse and Wave. 5 In a transverse wave, compare

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 1: Computer Organization; Operating

DSC 102 Systems for Scalable Analytics Winter 2020 Arun Kumar 1 About Myself 2009:

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 2: Basics of Cloud Computing 1

Slide 1 / 102 Slide 2 / 102 8th Grade Wave Properties Classwork-Homwork Slides 2015-10-15

Slide 4 / 102 1 What causes a wave? Slide 5 / 102 2 In terms of wave motion, define medium.

How to do research in clinical practice Dr P S Shankar, MD, FRCP(Lond), FAMS, DSc(Gul),

3rd Grade Shapes and Perimeter 2015-11-10 www.njctl.org Slide 3 / 102 Slide 4 / 102 Table of

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Cache Coherence in Scalable Machines Scalable Cache Coherent Systems Scalable, distributed

DSC 10: Lecture 1 Introduction Cause and Effect Credit: Anindita Adhikari and John DeNero

AP Physics C - Mechanics Simple Harmonic Motion 2015-12-05 www.njctl.org Slide 3 / 102 Slide 4

How to write a novel using open source software Introduction to bibisco Andrea Feccomandi The

Lazart: a symbolic approach for evaluating the robustness of secured codes against control flow

Keynote

Experimental Performability Evaluation of Middleware for Large-Scale Distributed Systems L.

Software Failures Dr. James A. Bednar jbednar@inf.ed.ac.uk http://homepages.inf.ed.ac.uk/jbednar

CoRD: Collabora,ve Data Race Detec,on Baris Kasikci, Cris,an

DC CFAR VIRTUAL LAB OPEN HOUSE MULTIPARAMETRIC FLOW CYTOMETRY May 22, 2020 What is

Software Libraries for PGMs Kevin Rothi Prepared for Dr. Rina Dechters Spring 2018 UCI ICS 276