Parallel applications in the cloud Diana Naranjo Pomalaya Computao - PowerPoint PPT Presentation

Parallel applications in the cloud Diana Naranjo Pomalaya Computação Paralela e Distribuida

Agenda ● Introduction ● MapReduce ● Solutions ○ Haloop ○ iMapReduce ○ Pig

Global Data Center Traffic Source: Cisco Global Cloud Index, 2013–2018

Data-intensive applications ● Industry ● Sciences ○ Web-data analysis ○ Massive-scale ○ Click-stream simulations data analysis analysis ○ Network-monitoring ○ Sensor deployments log analysis ○ High-throughput lab equipment

MapReduce Combine/ Combine Reduce Map Local Sort Merge Shuffle Combine/ Map Local Sort Combine Reduce Merge

MapReduce ○ Easy-to-use ○ Designed for Batch- programming model oriented (2 functions) computations (N-step ○ Scalability dataflows) ○ Fault-tolerance ○ Low-level abstraction ○ Load balancing (combined data sets, ○ Data locality-based primitive operations) optimization

Solutions Haloop iMapReduce Pig Loop-aware Persistent tasks High-level data scheduler manipulation Input data loaded once Caching Hadoop mechanisms execution Asynchronous execution

Haloop ● Hadoop based framework ● Supports iterative programs ● Loop-aware scheduler and Caching Mechanisms

Hadoop Master Slave Client Node Node submits jobs schedules Slave tasks Node Task Scheduler creates Slave tasks Node Slave Node Task Tracker manages tasks’ execution

Haloop Master Slave Client Node Node submits jobs schedules Slave tasks Node Loop Control Initiates map- Task reduce steps until Scheduler creates Slave termination tasks Node condition is met Slave Node Data locality Task By the means of Tracker manages tasks’ execution caching and indexing

Haloop - Loop control ● Goal: place on same physical machine map/reduce tasks that occur in different iterations but access same data ● How: ○ Keep track of data partitions processed by each task on each physical machine ○ Map new tasks to slave nodes that have alreasy processed that data partition ○ If node full, then re-assign to other node

Haloop - Caching and indexing ● Reducer input cache: useful for repeated joins against large invariant data (wastes less time in shuffling) ● Reducer output cache: reduce cost of fixpoint termination condition avaliation ● Mapper input cache: useful in k-means similar applications (input data does not vary) ● Cache reloading: if node is full, copy all required data to new assigned node

iMapReduce ● Based on Hadoop ● Framework for iterative algorithms ● Concept of persistent tasks, input data loaded to persistent tasks once and facilitates asynchronous execution of tasks within iteration

iMapReduce - Restrictions ● Map and reduce operations use Graph-based same key (one-to-one mapping) iterative algorithms ● Each iteration contains only one MapReduce job

iMapReduce - Persistent tasks ● Tasks keep alive during whole DFS DFS iteration process (dormat as Map data is parsed/processed) Job 1 Reduce Map ● Depends on available task slots DFS Job (problem with balancing load - Reduce Map strangles/leaders nodes) Job 2 . . . . . . DFS DFS

iMapReduce - Data management ● Input data becomes: static data (invariant) and state data (variant) ● State data is passed from reduce to map tasks through socket connections ● Static data is partitioned with the same hash function used to shuffle state data ● Map and reduce tasks (one-to-one due to key restriction) are scheduled to same worker

iMapReduce - Asynchronous execution ● Map tasks can start execution as soon as its state data arrives ● No need to wait for other map tasks ● Fault-tolerance problem: use buffer to save results from reduce tasks (return to last iteration)

Pig ● Provides constructs that allow high-level data manipulation ● Allows employment of user-provided executables ● Compiles data-flow programs (pig latin) into sets of MapReduce jobs and coordinates its execution (Hadoop)

Pig - Compilation and execution stages Physical Logical Plan Plan (DAG) (DAG) Logical MapReduce MapReduce Hadoop Job Parser optimizer Compiler Optimizer Manager Type Ex. Filter Mapping Distributive/ Monitor check pushdown from logical algebraic (minimize to physical operations Schema amount of to map, inference data combine scanned and reduce and steps processed)

Pig - Memory Management ● Pig is implemented in JAVA ● Memory overflow situations when large bags of tuples are materialized between and inside operators ● Solution: List of bags ordered in descending order (estimated size), spill bags when threshold is reached

iMapReduce - Streaming ● User-defined functions are supported in JAVA and are synchronous ● Streaming executables allow other languages to be used (scripts/compiled binaries) ● Streaming executables are asynchronous (queues)

Pig - Performance 17 December, 2010 6 June, 2015: release 0.15.0 available Source: India Hadoop Summit - Feb, 2011

PigPen ● MapReduce language that looks and behaves like Clojure.core ● Supports unit tests and iterative development ● Used in Netflix

References ● Cisco and/or its affiliates. Cisco global cloud index: Forecast and methodology 20132018 white paper, 2014. ● Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. Haloop: Efficient iterative data processing on large clusters. ● Yanfeng Zhang, Qixin Gao, Lixin Gao, and Cuirong Wang. C.: Imapreduce: a distributed computing framework for iterative computation. In In: Proceedings 8 of the 1st International Workshop on Data Intensive Computing in the Clouds (DataCloud, page 1112, 2011. ● Thilina Gunarathne, Bingjing Zhang, Tak lon Wu, and Judy Qiu. Portable parallel programming on cloud and hpc: Scientific applications of twister4azure,” presented at the portable. In Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure, 2011. ● Jaliya Ekanayake, Xiaohong Qiu, Thilina Gunarathne, Scott Beason, and Geoffrey Fox. High performance parallel computing with cloud and cloud technologies.

Questions?

Parallel applications in the cloud Diana Naranjo Pomalaya Computao - PowerPoint PPT Presentation

Parallel applications in the cloud Diana Naranjo Pomalaya Computao Paralela e Distribuida Agenda Introduction MapReduce Solutions Haloop iMapReduce Pig Global Data Center Traffic Source: Cisco Global Cloud Index,

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Parallel Query Execution in POLARDB for MySQL ystein Grvlen Benny Wang Alibaba Cloud Agenda

Design Consideration for Cloud Applications Agenda How cloud applications are different?

SNR SNR- -cloud interaction cloud interaction cloud interaction SNR SNR cloud interaction

Cloud Cloud Cloud Cloud network Edge Edge Edge Edge as a Edge Edge Edge Edge Edge

Cloud Ross Mallace Commercial Director Cloud/SaaS Cloud is here. ALL By 2020 most core

Embracing Cloud Ian Apperley Agenda A little about me What is Cloud and where did it come

Are We Really Cloud-Native? Bert Ertman Cloud-Native Computing What is Cloud-Native? answer:

CS5412: THE CLOUD VALUE PROPOSITION Lecture XXII Ken Birman Cloud Hype 2 The cloud is

SAS and (the) Cloud Dave Annis SAS Solutions onDemand SAS and (the) Cloud Everyones Cloud

Cloud Computing & Cloud Models Cloud Models Topics Defining cloud computing

CS5412: THE CLOUD VALUE PROPOSITION Lecture XXII Ken Birman Cloud Hype 2 The cloud is

Electron Cloud Build Electron Cloud Build- Electron Cloud Build Electron Cloud Build -Up

Chapter 4 Cloud Computing Applications and Paradigms Cloud Computing: Theory and Practice. 1

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

DEPLOYING AND SCALING MICROSERVICES Sam Newman Goto Chicago 2016 @samnewman Building

Optimal convergence rates for distributed optimization Francis Bach Inria - Ecole Normale

Big Data for Data Science The MapReduce Framework & Hadoop event.cwi.nl/lsde Key premise:

Continuous Integration & Deploying using Jenkins and uDeploy (Projects used are of

of optimized code Michael Ernst University of Washington (work done at Microsoft Research)

Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of

Slices Chapter 10 SL1 Program slice What is a program slice? SL2 Program slice

Dawn Song dawnsong@cs.berkeley.edu 1 Bouncer: Securing Software by Blocking Bad Input 2 Main

Parallel applications in the cloud Diana Naranjo Pomalaya Computao - PowerPoint PPT Presentation

Parallel applications in the cloud Diana Naranjo Pomalaya Computao Paralela e Distribuida Agenda Introduction MapReduce Solutions Haloop iMapReduce Pig Global Data Center Traffic Source: Cisco Global Cloud Index,

Building a Private Cloud Cloud Infrastructure Using Opensource Building a Private Cloud OSCON

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

Parallel Query Execution in POLARDB for MySQL ystein Grvlen Benny Wang Alibaba Cloud Agenda

Design Consideration for Cloud Applications Agenda How cloud applications are different?

SNR SNR- -cloud interaction cloud interaction cloud interaction SNR SNR cloud interaction

Cloud Cloud Cloud Cloud network Edge Edge Edge Edge as a Edge Edge Edge Edge Edge

Cloud Ross Mallace Commercial Director Cloud/SaaS Cloud is here. ALL By 2020 most core

Embracing Cloud Ian Apperley Agenda A little about me What is Cloud and where did it come

Are We Really Cloud-Native? Bert Ertman Cloud-Native Computing What is Cloud-Native? answer:

CS5412: THE CLOUD VALUE PROPOSITION Lecture XXII Ken Birman Cloud Hype 2 The cloud is

SAS and (the) Cloud Dave Annis SAS Solutions onDemand SAS and (the) Cloud Everyones Cloud

Cloud Computing &amp; Cloud Models Cloud Models Topics Defining cloud computing

CS5412: THE CLOUD VALUE PROPOSITION Lecture XXII Ken Birman Cloud Hype 2 The cloud is

Electron Cloud Build Electron Cloud Build- Electron Cloud Build Electron Cloud Build -Up

Chapter 4 Cloud Computing Applications and Paradigms Cloud Computing: Theory and Practice. 1

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

DEPLOYING AND SCALING MICROSERVICES Sam Newman Goto Chicago 2016 @samnewman Building

Optimal convergence rates for distributed optimization Francis Bach Inria - Ecole Normale

Big Data for Data Science The MapReduce Framework &amp; Hadoop event.cwi.nl/lsde Key premise:

Continuous Integration &amp; Deploying using Jenkins and uDeploy (Projects used are of

of optimized code Michael Ernst University of Washington (work done at Microsoft Research)

Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of

Slices Chapter 10 SL1 Program slice What is a program slice? SL2 Program slice

Dawn Song dawnsong@cs.berkeley.edu 1 Bouncer: Securing Software by Blocking Bad Input 2 Main

Cloud Computing & Cloud Models Cloud Models Topics Defining cloud computing

Big Data for Data Science The MapReduce Framework & Hadoop event.cwi.nl/lsde Key premise:

Continuous Integration & Deploying using Jenkins and uDeploy (Projects used are of