Incoop: MapReduce for Incremental Computations by Bhatotia et al - PowerPoint PPT Presentation

Jan 07, 2024 •388 likes •685 views

Incoop: MapReduce for Incremental Computations by Bhatotia et al What is Incoop? Hadoop based framework Designed for improved efficiency of incremental programs Developed at the Max Plank institute by Bhatotia et al. Why Incoop?

Incoop: MapReduce for Incremental Computations by Bhatotia et al
What is Incoop? ● Hadoop based framework ● Designed for improved efficiency of incremental programs ● Developed at the Max Plank institute by Bhatotia et al.
Why Incoop?
Why run incremental computation on Incoop? ● Lots of applications are incremental ○ Machine Learning, wc over a range of docs etc ● Easy to write, input = Hadoop programs ● Great speedups
What differs Incoop from Hadoop? ● Incremental HDFS ● Incremental map and incremental reduce through contraction phase ● Memoization-aware scheduler
HDFS recap ● Large, fixed sized chunks - 64MB ● Append only filesystem ● Serial reads and writes
What’s bad about HDFS? ● Even small changes to input data results in unstable partitioning! ● This makes it difficult to reuse results
The problem with HDFS Partitioning Input Input Input file file file HDFS Mapper Mapper Mapper
The problem with HDFS Partitioning Input Input Input file file file HDFS Mapper Mapper Mapper
The problem with HDFS Partitioning Input Input Input file file file HDFS Mapper Mapper Mapper
Incremental HDFS ● Splits input data based on content ● Variable length chunk sizes ● Done at the input creation phase ● Follows the HDFS API
Solution with incremental HDFS Input Input Input file file file INC-HDFS Mapper Mapper Mapper
Solution with incremental HDFS Input Input Input file file file INC-HDFS Mapper Mapper Mapper
Solution with incremental HDFS Input Input Input file file file INC-HDFS Mapper Mapper Mapper
What differs Incoop from Hadoop? ● Incremental HDFS ● Incremental map/reduce and contraction phase ● Memoization-aware scheduler
Incremental Map Phase ● Persistently stores result between iterations ● Creates a reference to the result in the memoization server (via hashing) ● Later iterations fetches results pointed to by the memoization server
Incremental Map Phase
Incremental Reduce phase ● More challenging than the Map Phase ● Coarse grained memoization ○ Reducers copies map input only if result not already computed ● Fine-grained memoization ○ Combiners
What are combiners? ● A step between mappers and reducers ● Traditionally used to reduce the bandwidth between mappers and reducers ● Used in incoop to split reduce tasks and allow for better memoization
Incremental Reduce phase
What differs Incoop from Hadoop? ● Incremental HDFS ● Incremental map/reduce and contraction phase ● Memoization-aware scheduler
Memoization Scheduling ● Built using memcached ● Per node work queue for good use of data locality and memoization ● Work stealing
Results - incremental runs
Results - Scheduler
Results - Overheads
Results - Overheads
Criticisms ● Lack of comparison against other frameworks ● How were the percentual incremental changes generated? ● Garbage collection is pretty naïve. Odd-even runtime workloads sees no memoization. ● How realistic are the incremental results for real world workloads wrt Inc-HDFS?
Questions?

Recommend

Incoop: MapReduce for Incremental Computations Bhatotia, P., Wieder, A., Rodrigues, R., Acar,

Incoop: MapReduce for Incremental Computations Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U.A., and Pasquin, R. (2011). Reviewed by Neil Satra Why? You are calculating PageRank at Google. Crawling petabytes of web pages. 1% of web pages

302 views • 12 slides

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for: parallelizable problems large datasets cluster/grid computing Background Google project Implemented many special-purpose computations

373 views • 26 slides

Embarrassingly Parallel Computations 3.2 1 Embarrassingly Parallel Computations A computation

Parallel Techniques Embarrassingly Parallel Computations Partitioning and Divide-and-Conquer Strategies Pipelined Computations Synchronous Computations Asynchronous Computations Asynchronous Computations Strategies

378 views • 13 slides

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2 Challenge with Spot Market 3 Cloud MapReduce Hadoop Our prior work MapReduce App MapReduce App Cloud MapReduce Hadoop Cloud OS Amazon

185 views • 7 slides

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

MapReduce MapReduce in Scientific Computing Mrs Features Performance and Case Studies Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi Brigham Young University November 16, 2012 MapReduce MapReduce

432 views • 29 slides

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and implementation used to process and generate large data sets. The map component of a MapReduce job typically parses input data and distills it down to

532 views • 5 slides

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind of functional programming We focus on the MapReduce execution engine of Hadoop through YARN 2 Logical View of MapReduce During MapReduce, the

422 views • 23 slides

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large Scale Data Processing MapReduce Idea: simple, highly scalable, generic parallelization model Want to process lots of data ( > 1 TB)

781 views • 49 slides

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the concept Hadoop : the implementation Query Languages for Hadoop Spark : the improvement MapReduce vs databases Conclusion 340151

788 views • 29 slides

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the basic data structure in MapReduce Keys and values can be: integers, float, strings, raw bytes They can also be arbitrary data structures

1.77k views • 65 slides

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a framework for batch processing of Big Data: http://research.google.com/archive/mapreduce-osdi04-slides] Framework: A system used by programmers to build

186 views • 3 slides

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Laboratory Session: MapReduce 1 / 63 Algorithm Design Preliminaries Preliminaries Pietro Michiardi (Eurecom) Laboratory

814 views • 62 slides

Incremental Garbage Collection Part II Roland Schatz Incremental Garbage Collection p.1/22

Incremental Garbage Collection Part II Roland Schatz Incremental Garbage Collection p.1/22 Baker Disadvantages not concurrent Incremental Garbage Collection p.2/22 Baker Disadvantages not concurrent not even threadsafe

1.49k views • 93 slides

Structuring Computations Structuring Computations Contents Jacobs Types06, 18/4/06

FACULTY OF SCIENCE Bart Jacobs Structuring Computations Structuring Computations Contents Jacobs Types06, 18/4/06 p.1/52 Structuring Computations Contents I. Sneak preview VII. Hoare logic for JML II. Comonads VIII. Conclusions

1.79k views • 164 slides

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice deletes intermediate results of MapReduce jobs These results are not useless A system that reuses the output of MapReduce jobs / sub-jobs --

802 views • 59 slides

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer

Flow Analysis Using MapReduce Strengths and Limitations Markus De Shon Sr. Security Engineer Agenda MapReduce What is it? Case Study Entropy Timeseries Scaling MapReduces Other thoughts, Conclusions MapReduce: What is it? A parallel

380 views • 12 slides

Corporate Presentation | March 2019 Disclaimer The information contained in these slides and the

Corporate Presentation | March 2019 Disclaimer The information contained in these slides and the accompanying oral presentation (together, the "Presentation") have not been approved by an authorised person within the meaning of the

528 views • 27 slides

Executable Models and Verification from MARTE and SysML: a Comparative Study of Code Generation

Introduction SysML MARTE Code Generation Conclusions and Remarks Executable Models and Verification from MARTE and SysML: a Comparative Study of Code Generation Capabilities Marcello Mura Amrit Panda Mauro Prevostini

722 views • 56 slides

Accelerating MURaM on GPUs using OpenACC 2019 Multicore9 Workshop UDEL: Eric Wright, Sunita

Accelerating MURaM on GPUs using OpenACC 2019 Multicore9 Workshop UDEL: Eric Wright, Sunita Chandrasekaran NCAR: Shiquan Su, Cena Miller, Supreeth Suresh, Matthias Rempel, Rich Loft Max Planck Institute for Solar System Research: Damien

672 views • 25 slides

T F A R PUBLIC SPACES MASTER PLAN (POPS) D UPDATES POPS Advisory Committee Meeting February

T F A R PUBLIC SPACES MASTER PLAN (POPS) D UPDATES POPS Advisory Committee Meeting February 13, 2018 NOTE: This presentation is a working document, and some recommendations or ideas may have evolved or changed based on continued

992 views • 72 slides

Climate Impacts and Services Sixteenth Session of the Forum on Regional Climate Monitoring

Climate Impacts and Services Sixteenth Session of the Forum on Regional Climate Monitoring Assessment and Prediction for Asia (FOCRAII) Arghya Sinha Roy Image Source: Jessie Ponce Im Importance of Understanding Climate Impacts and Services

278 views • 7 slides

Ten years with technology - innovative practice or incremental change ? Claire Englund, Centre

Ten years with technology - innovative practice or incremental change ? Claire Englund, Centre for Teaching & Learning, Ume University Does ICT have the potential to revolutionise higher education? THETA Hobart 2 0 1304 1 0 Has it

594 views • 16 slides

More of the Same? The Structure of Research Collaboration Networks in Homogeneous Academic

More of the Same? The Structure of Research Collaboration Networks in Homogeneous Academic Environments Luis Filipe de Miranda Grochocki Lemann Center Topics on Brazilian Education Spring 2020 Endogamy, faculty mobility and scientific

689 views • 25 slides

WHO are Catholics in the United States? Non-Hispanic White 47.4 Hispanic 43% African American

WHO are Catholics in the United States? Non-Hispanic White 47.4 Hispanic 43% African American 3.6% Asian 5% Native Americans 1% Preparing LEMs (and other ministers) to serve in a culturally diverse Church 17% Preparedness to serve in

461 views • 8 slides