DSC 102 Systems for Scalable Analytics Arun Kumar Topic 5: Dataflow Systems Chapter 2.2 of MLSys Book 1
Parallel RDBMSs ❖ Parallel RDBMSs are highly successful and widely used ❖ They offer massive scalability (shared-nothing parallelism) and high performance (parallel relational dataflows) along with many other enterprise-grade benefits of RDBMSs: ❖ Full power of SQL ❖ Business intelligence dashboards/APIs on top ❖ Transaction management and crash recovery ❖ Index structures, compressed file formats, auto-tuning, etc. Q: So, why did people need to go beyond parallel RDBMSs? 2
Beyond RDBMSs: A Brief History ❖ Relational model and RDBMSs are too restrictive : 1. “Flat” tables with few data/attribute types Object-Relational DBMSs : UDT, UDFs, text, multimedia, etc. 2. Restricted language interface (SQL) PL/SQL ; recursive SQL; embedded SQL; QBE; visual interfaces 3. Need to know schema first! “Schema-later” semi-structured XML data model; XQuery 4. Optimized for static dataset Stream data model; “standing” queries; time windows But the DB community has addressed these issues already! Ad : Take CSE 132B and CSE 135 to learn such extensions 3
Q: Again, so, why did people still need to go beyond parallel RDBMSs?! 4
Beyond RDBMSs: A Brief History The DB community got blindsided by the unstoppable rise of the Web/Internet giants! ❖ DB folks underappreciated 4 key concerns of Web folks: Developability Fault Tolerance Elasticity Cost/Politics! 5
DB/Enterprise vs. Web Dichotomy ❖ DB folks underappreciated 4 key concerns of Web folks: Developability : RDBMS extensibility mechanisms (UDTs, UDFs, etc.) are too painful to use for programmers! DB companies: we write the software and sell to our customers, viz., enterprise companies (banks, retail, etc.) Web companies: we will hire an army of software engineers to build own in-house software systems! Need simpler APIs and DBMSs that scale custom programs 6
DB/Enterprise vs. Web Dichotomy ❖ DB folks underappreciated 4 key concerns of Web folks: Fault Tolerance : What if we run on 100Ks of machines?! DB companies: our customers do not need more than a few dozen machines to store and analyze their data! Web companies: we need hundreds of thousands of machines for planetary-scale Web services! If a machine fails, user should not have to rerun entire query! DBMS should take care of fault tolerance, not user/appl. (Cloud-native RDBMSs now offer fault tolerance by design) 7
DB/Enterprise vs. Web Dichotomy ❖ DB folks underappreciated 4 key concerns of Web folks: Elasticity : Resources should adapt to “query” workload DB companies: our customers have “fairly predictably” sized datasets and workloads; can fix their clusters! Web companies: our workloads could vary widely and the datasets they need vary widely! Need to be able to upsize and downsize clusters easily on-the-fly, based on current query workload 8
DB/Enterprise vs. Web Dichotomy ❖ DB folks underappreciated 4 key concerns of Web folks: Cost/Politics : Commercial RDBMS licenses too costly! DB companies: our customers have $$$! ☺ Web companies: our products are mostly free (ads?); why pay so much $$$ if we can build our own DBMSs? Many started with MySQL (!) but then built their own DBMSs New tools were free & open source; led to viral adoption! 9
This new breed of parallel data systems called Dataflow Systems jolted the DB folks from being smug and complacent! 10
Outline ❖ Beyond RDBMSs: A Brief History ❖ MapReduce/Hadoop Craze ❖ Spark and Dataflow Programming ❖ More Scalable ML with MapReduce/Spark 11
The MapReduce/Hadoop Craze ❖ Blame Google! ❖ “Simple” problem : index, store, and search the Web! ☺ ❖ Who were their major systems hires? Jeff Dean and Sanjay Ghemawat (Systems, not DB or IR) ❖ Why did they not use RDBMSs? (Haha.) Developability, data model, fault tolerance, scale, cost, … Engineers started with MySQL; abandoned it! 12
What is MapReduce? MapReduce: Simplified Data Processing on Large Clusters. In OSDI 2004. ❖ Programming model for writing programs on sharded data + distributed system architecture for processing large data ❖ Map and Reduce are terms/ideas from functional PL ❖ Engineer only implements the logic of Map and Reduce ❖ System implementation handles orchestration of data distribution, parallelization, etc. under the covers Was radically easier for engineers to write programs with! 13
What is MapReduce? ❖ Standard example : count word occurrences in a doc corpus ❖ Input : A set of text documents (say, webpages) ❖ Output : A dictionary of unique words and their counts function map (String docname, String doctext) : Hmmm, sounds suspiciously familiar … ☺ for each word w in doctext : emit (w, 1) Part of MapReduce API function reduce (String word, Iterator partialCounts) : sum = 0 for each pc in partialCounts : sum += pc emit (word, sum) 14
How MapReduce Works Parallel flow of control and data during MapReduce execution: Under the covers, each Mapper and Reducer is a separate process; Reducers face barrier synchronization (BSP) Fault tolerance achieved using data replication 15
Abstract Semantics of MapReduce ❖ Map(): Operates independently on one “record” at a time ❖ Can batch multiple data examples on to one record ❖ Dependencies across Mappers not allowed ❖ Can emit 1 or more key-value pairs as output ❖ Data types of inputs and outputs can be different! ❖ Reduce(): Gathers all Map output pairs across machines with same key into an Iterator (list) ❖ Aggregation function applied on Iterator and output final ❖ Input Split: ❖ Physical-level split/shard of dataset that batches multiple examples to one file “block” (~128MB default on HDFS) ❖ Custom Input Splits can be written by appl. user 16
Benefits of MapReduce ❖ Goal: Higher level abstraction of functional operations (Map; Reduce) to simplify data-parallel programming at scale ❖ Key Benefits: ❖ Out-of-the-box scalability and cluster parallelism ❖ Fault tolerance offloaded to system impl., not appl./user ❖ Map() and Reduce() can be highly general; no restrictions on data types/structures processed; easier to use for ETL and text/multimedia-oriented analytics ❖ Free and OSS implementations available (Hadoop) ❖ New burden on users: Converting computations of data- intensive program/operation to Map() + Reduce() API ❖ But MapReduce libraries available in multiple PLs to mitigate coding pains: Java, C++, Python, R, Scala, etc. 17
Emulate MapReduce in SQL? Q: How would you do the word counting in RDBMS / in SQL? ❖ First step: Transform text docs into relations and load: Part of the Extract-Transform-Load (ETL) stage Suppose we pre-divide each document into words and have the schema: DocWords (DocName, Word) ❖ Second step: a single, simple SQL query! Word, COUNT (*) SELECT FROM DocWords GROUP BY Word Parallelism, scaling, etc. done by RDBMS under the covers [ORDER BY Word] 18
More MR Examples: Select ❖ Input Split (part of ETL): Let table be sharded tuple-wise ❖ Map(): On tuple, apply selection condition; emit pair with dummy key and entire tuple as value ❖ Reduce(): Not needed! No cross-shard aggregation here ❖ Such kinds of tasks/jobs are called “ Map-only ” jobs 19
More MR Examples: Simple Agg. ❖ Input Split (part of ETL): Let table be sharded tuple-wise ❖ Assume it is algebraic aggregate (SUM, AVG, MAX, etc.) ❖ Map(): On agg. attribute, compute incremental stats; emit pair with single global dummy key and stats as value ❖ Reduce(): Since only one dummy key across all shards, Iterator has all suff. stats that can be unified to get result 20
More MR Examples: GROUP BY Agg. ❖ Input Split (part of ETL): Let table be sharded tuple-wise ❖ Assume it is algebraic aggregate (SUM, AVG, MAX, etc.) ❖ Map(): On agg. attribute, compute incremental stats; emit pair with grouping attribute as key and stats as value ❖ Reduce(): Iterator has all suff. stats for a single group ; unify those to get result for that group; different reducers handle different groups 21
More MR Examples: Matrix Norm ❖ Input Split (part of ETL): Let matrix be sharded tile-oriented ❖ Assume it is algebraic aggregate (L p,q norm) ❖ Very similar to simple aggregate! ❖ Map(): On agg. attribute, compute incremental stats; emit pair with single global dummy key and stats as value ❖ Reduce(): Since only one dummy key across all shards, Iterator has all suff. stats that can be unified to get result 22
Analogue: Parallel RDBMS UDA ❖ Recall how word count MapReduce can be done with SQL: Word, COUNT (*) SELECT FROM DocWords GROUP BY Word Q: How can we compute other aggregates not native to SQL? ❖ User-Defined Aggregate Function (UDAF) abstraction in parallel RDBMs can do MapReduce-like computations ❖ MapReduce seems more intuitive and succinct to many! 23
Analogue: Parallel RDBMS UDA ❖ 4 main functions in the UDAF API to work with BSP model ❖ Aggregation state: data structure computed (independently) by workers and unified by master ❖ Initialize(): Set up info./initialize RAM for agg. state; runs independently on each worker ❖ Transition(): Per-tuple function run by worker to update its agg. state; analogous to Map() in MapReduce ❖ Merge(): Function that combines agg. states from workers; run by master after workers done; analogous to Reduce() ❖ Finalize(): Run once at end by master to return final result 24
Recommend
More recommend