61A Lecture 36 Big Data : A buzzword used to describe data sets so - PDF document

MapReduce MapReduce is a framework for batch processing of Big Data. What does that mean? • Framework : A system used by programmers to build applications. • Batch processing : All the data is available at the outset, and results aren't used until processing completes. 61A Lecture 36 • Big Data : A buzzword used to describe data sets so large that they reveal facts about the world via statistical analysis. The MapReduce idea: Wednesday, November 28 • Data sets are too big to be analyzed by one machine. • When using multiple machines, systems issues abound. • Pure functions enable an abstraction barrier between data processing logic and distributed system administration. (Demo) 2 Systems The Unix Operating System Systems research enables the development of applications by Essential features of the Unix operating system (and variants): defining and implementing abstractions: • Portability : The same operating system on different hardware. • Operating systems provide a stable, consistent interface to • Multi-Tasking : Many processes run concurrently on a machine. unreliable, inconsistent hardware. • Plain Text : Data is stored and shared in text format. • Modularity : Small tools are composed flexibly via pipes. • Networks provide a simple, robust data transfer interface to constantly evolving communications infrastructure. standard input process • Databases provide a declarative interface to software that standard output stores and retrieves information efficiently. Text input • Distributed systems provide a single-entity-level interface standard error Text output to a cluster of multiple machines. The standard streams in a Unix-like operating system A unifying property of effective systems: are conceptually similar to Python iterators. Hide complexity , but retain flexibility (Demo) 3 4 Python Programs in a Unix Environment MapReduce Evaluation Model The built-in input function reads a line from standard input . Map phase : Apply a mapper function to inputs, emitting a set of intermediate key-value pairs. The built-in print function writes a line to standard output . • The mapper takes an iterator over inputs, such as text lines. • The mapper yields zero or more key-value pairs per input. (Demo) o: 2 Google MapReduce i: 1 mapper a: 1 a: 1 a: 4 Is a Big Data framework The values sys.stdin and sys.stdout also provide access to the o: 2 u: 1 e: 1 Unix standard streams as "files." For batch processing e: 1 e: 3 o: 1 i: 1 A Python "file" is an interface that supports iteration, read, Reduce phase : For each intermediate key, apply a reducer and write methods. function to accumulate all values associated with that key. • The reducer takes an iterator over key-value pairs. Using these "files" takes advantage of the operating system standard stream abstraction. • All pairs with a given key are consecutive. • The reducer yields 0 or more values, (Demo) each associated with that intermediate key. 5 6

MapReduce Evaluation Model Above-the-Line: Execution model o: 2 Google MapReduce i: 1 mapper a: 1 a: 1 a: 4 Is a Big Data framework o: 2 u: 1 e: 1 For batch processing e: 1 e: 3 o: 1 i: 1 Reduce phase : For each intermediate key, apply a reducer function to accumulate all values associated with that key. • The reducer takes an iterator over key-value pairs. • All pairs with a given key are consecutive. • The reducer yields 0 or more values, each associated with that intermediate key. a: 4 a: 1 reducer i: 2 a: 1 a: 6 e: 1 o: 5 e: 3 reducer e: 1 e: 5 u: 1 ... http://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0007.html 7 8 Below-the-Line: Parallel Execution MapReduce Assumptions Constraints on the mapper and reducer : Map phase • The mapper must be equivalent to applying a pure function to each input independently. • The reducer must be equivalent to applying a pure function to the sequence of values for a key. Benefits of functional programming: Shuffle • When a program contains only pure functions, call expressions can be evaluated in any order, lazily, and in parallel. • Referential transparency: a call expression can be replaced A "task" is a by its value (or vis versa ) without changing the program. Reduce phase Unix process running on a In MapReduce, these functional programming ideas allow: machine • Consistent results, however computation is partitioned. • Re-computation and caching of results, as needed. http://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0008.html 9 10 Python Example of a MapReduce Application Python Example of a MapReduce Application The mapper and reducer are both self-contained Python programs. The mapper and reducer are both self-contained Python programs. • Read from standard input and write to standard output ! • Read from standard input and write to standard output ! Mapper Reducer Tell Unix: this is Python #!/usr/bin/env python3 #!/usr/bin/env python3 The emit function outputs a Takes and returns iterators import sys import sys key and value as a line of from ucb import main from ucb import main text to standard output from mapreduce import emit from mapreduce import emit, group_values_by_key def emit_vowels(line): Input : lines of text representing key-value for vowel in 'aeiou': count = line.count(vowel) pairs, grouped by key if count > 0: Output: Iterator over (key, value_iterator) emit(vowel, count) pairs that give all values for each key Mapper inputs are for line in sys.stdin: for key, value_iterator in group_values_by_key(sys.stdin): lines of text provided emit_vowels(line) emit(key, sum(value_iterator)) to standard input 11 12

What Does the MapReduce Framework Provide Fault tolerance : A machine or hard drive might crash. • The MapReduce framework automatically re-runs failed tasks. Speed : Some machine might be slow because it's overloaded. • The framework can run multiple copies of a task and keep the result of the one that finishes first. Network locality : Data transfer is expensive. • The framework tries to schedule map tasks on the machines that hold the data to be processed. Monitoring : Will my job finish before dinner?!? • The framework provides a web-based interface describing jobs. (Demo) 13

61A Lecture 36 Big Data : A buzzword used to describe data sets so - PDF document

MapReduce MapReduce is a framework for batch processing of Big Data. What does that mean? Framework : A system used by programmers to build applications. Batch processing : All the data is available at the outset, and results aren't used

61a A&P: Respiratory System 61a A&P: Respiratory System Class Outline 5 minutes

CS 61A/CS 98-52 Mehrdad Niknami University of California, Berkeley Mehrdad Niknami (UC Berkeley)

61a A&P: Respiratory System 61a A&P: Respiratory System Class Outline 5 minutes

CS 61A/CS 98-52 Mehrdad Niknami University of California, Berkeley Mehrdad Niknami (UC Berkeley)

CS 61A/CS 98-52 Mehrdad Niknami University of California, Berkeley Mehrdad Niknami (UC Berkeley)

61A Lecture 1 How to contact John: denero@berkeley.edu piazza.com/berkeley/fall2016/cs61a

CS 61A/CS 98-52 Mehrdad Niknami University of California, Berkeley Mehrdad Niknami (UC Berkeley)

Disc 0: Welcome to CS 61A! Lab 128L | Soda 275, Tu 5 p.m. - 6:30 p.m Disc 128 | Evans 9, 5

CS 61A/CS 98-52 Mehrdad Niknami University of California, Berkeley Mehrdad Niknami (UC Berkeley)

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

61A Lecture 35 Wednesday, December 4 Announcements 2 Announcements Homework 11 due Thursday

61A Lecture 33 Monday, November 25 Announcements Homework 10 due Tuesday 11/26 @ 11:59pm

Welcome to CS 61A About the Course Parts of the Course 4 Parts of the Course Lecture : Videos

61A Lecture 33 Monday, November 25 Announcements 2 Announcements Homework 10 due Tuesday

61A Lecture 32 Friday, November 22 Announcements Homework 10 due Tuesday 11/26 @ 11:59pm

61A Lecture 14 Announcements Mutable Functions A Function with Behavior That Varies Over Time

ATS 2017 Call for Abstracts Instructions for Title, Type, Category and Presentation Preference Add

Pre Prese sente nter Di Diana Wo Wood odwo worth, Fi Financ ancial Aid d Coordinato

Executive MBA Academic Standards & Policies Academic Standards & Grading Scale

Automated Detection of Plagiarism based on Whitespace and History Markus Ongyerth December 4,

infotexture Information Architecture & Content Strategy Roger W. Fienhold Sheen Agenda

A Theory of Aspects as Latent Topics t Pierre Baldi, Cristina Lopes, Erik Linstead, Sushil

Natural Language Processing CSCI 4152/6509 Lecture 4 About Course Project; Automata and

Welcome to the co u rse ! IN TR OD U C TION TO IMP OR TIN G DATA IN P YTH ON H u go Bo w ne -

61A Lecture 36 Big Data : A buzzword used to describe data sets so - PDF document

MapReduce MapReduce is a framework for batch processing of Big Data. What does that mean? Framework : A system used by programmers to build applications. Batch processing : All the data is available at the outset, and results aren't used

61a A&amp;P: Respiratory System 61a A&amp;P: Respiratory System Class Outline 5 minutes

CS 61A/CS 98-52 Mehrdad Niknami University of California, Berkeley Mehrdad Niknami (UC Berkeley)

61a A&amp;P: Respiratory System 61a A&amp;P: Respiratory System Class Outline 5 minutes

CS 61A/CS 98-52 Mehrdad Niknami University of California, Berkeley Mehrdad Niknami (UC Berkeley)

CS 61A/CS 98-52 Mehrdad Niknami University of California, Berkeley Mehrdad Niknami (UC Berkeley)

61A Lecture 1 How to contact John: denero@berkeley.edu piazza.com/berkeley/fall2016/cs61a

CS 61A/CS 98-52 Mehrdad Niknami University of California, Berkeley Mehrdad Niknami (UC Berkeley)

Disc 0: Welcome to CS 61A! Lab 128L | Soda 275, Tu 5 p.m. - 6:30 p.m Disc 128 | Evans 9, 5

CS 61A/CS 98-52 Mehrdad Niknami University of California, Berkeley Mehrdad Niknami (UC Berkeley)

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

61A Lecture 35 Wednesday, December 4 Announcements 2 Announcements Homework 11 due Thursday

61A Lecture 33 Monday, November 25 Announcements Homework 10 due Tuesday 11/26 @ 11:59pm

Welcome to CS 61A About the Course Parts of the Course 4 Parts of the Course Lecture : Videos

61A Lecture 33 Monday, November 25 Announcements 2 Announcements Homework 10 due Tuesday

61A Lecture 32 Friday, November 22 Announcements Homework 10 due Tuesday 11/26 @ 11:59pm

61A Lecture 14 Announcements Mutable Functions A Function with Behavior That Varies Over Time

ATS 2017 Call for Abstracts Instructions for Title, Type, Category and Presentation Preference Add

Pre Prese sente nter Di Diana Wo Wood odwo worth, Fi Financ ancial Aid d Coordinato

Executive MBA Academic Standards &amp; Policies Academic Standards &amp; Grading Scale

Automated Detection of Plagiarism based on Whitespace and History Markus Ongyerth December 4,

infotexture Information Architecture &amp; Content Strategy Roger W. Fienhold Sheen Agenda

A Theory of Aspects as Latent Topics t Pierre Baldi, Cristina Lopes, Erik Linstead, Sushil

Natural Language Processing CSCI 4152/6509 Lecture 4 About Course Project; Automata and

Welcome to the co u rse ! IN TR OD U C TION TO IMP OR TIN G DATA IN P YTH ON H u go Bo w ne -

61a A&P: Respiratory System 61a A&P: Respiratory System Class Outline 5 minutes

61a A&P: Respiratory System 61a A&P: Respiratory System Class Outline 5 minutes

Executive MBA Academic Standards & Policies Academic Standards & Grading Scale

infotexture Information Architecture & Content Strategy Roger W. Fienhold Sheen Agenda