Azure MapReduce Thilina Gunarathne Salsa group, Indiana University

Agenda • Recap of Azure Cloud Services • Recap of MapReduce • Azure MapReduce Architecture • Application development using AzureMR • Pairwise distance alignment implementation • Next steps

Cloud Computing • On demand computational services over web – Backed by massive commercial infrastructures giving economies of scale – Spiky compute needs of the scientists • Horizontal scaling with no additional cost – Increased throughput • Cloud infrastructure services – Storage, messaging, tabular storage – Cloud oriented services guarantees – Virtually unlimited scalability • Future seems to be CLOUDY!!!

Azure Platform • Windows Azure Compute – .net platform as a service – Worker roles & web roles • Azure Storage – Blobs – Queues – Table • Development SDK, fabric and storage

MapReduce • Automatic parallelization & distribution • Fault-tolerant • Provides status and monitoring tools • Clean abstraction for programmers – map (in_key, in_value) -> (out_key, intermediate_value) list – reduce (out_key, intermediate_value list) -> out_value list

Motivation • Currently no parallel programming framework on Azure – No MPI, No Dryad • Well known, easy to use programming model • Cloud nodes are not as reliable as conventional cluster nodes

Azure MapReduce Concepts • Take advantage of the cloud services – Distributed services, Unlimited scalability – Backed by industrial strength data centers and technologies • Decentralized control – Dynamically scale up/down • Eventual consistency • Large latencies – Coarser grained map tasks • Global queue based scheduling

1 1.Client driver loads the map & reduce tasks to the queues

2 2. Map workers retrieve map tasks from the queue

3 3. Map workers download data from the Blob storage and start processing

4 4. Reduce workers pick the tasks from the queue and start monitoring the reduce task tables

5 5. Finished map tasks upload the results to Blob storage. Add entries to the respective reduce task tables.

6 6. Reduce tasks download the intermediate data products

7 7. Start reducing when all the map tasks are finished and when a reduce task is finished downloading the intermediate data products

Azure MapReduce Architecture • Client API and driver • Map tasks • Reduce tasks • Intermediate data transfer • Monitoring • Configurations

Fault tolerance • Use the visibility timeout of the queues – Currently maximum is 3 hours – Delete the message from the queue only after everything is successful – Execution, upload, update status • Tasks will rerun when timeout happens – Ensures eventual completion – Intermediate data are persisted in blob storage – Retry up to 3 times • Many retries in service invocations

Apache Hadoop Microsoft Dryad [25] Twister [19] Azure Map [24] /(Google MR) Reduce/Twister Programming MapReduce DAG execution, Iterative MapReduce-- will Model Extensible to MapReduce extend to Iterative MapReduce and other MapReduce patterns Data Handling HDFS (Hadoop Shared Directories & Local disks and Azure Blob Storage Distributed File local disks data management System) tools Scheduling Data Locality; Rack Data locality; Data Locality; Dynamic task aware, Dynamic Network Static task scheduling through task scheduling topology based partitions global queue through global run time graph queue optimizations; Static task partitions Failure Handling Re-execution of Re-execution of failed Re-execution of Re-execution of failed tasks; tasks; Duplicate Iterations failed tasks; Duplicate execution execution of slow tasks Duplicate execution of slow tasks of slow tasks Environment Linux Clusters, Windows HPCS cluster Linux Cluster Window Azure Amazon Elastic Map EC2 Compute, Windows Reduce on EC2 Azure Local Development Fabric Intermediate File, Http File, TCP pipes, shared- Publish/Subscribe Files, TCP data transfer memory FIFOs messaging

Why Azure Services • No need to install software stacks – In fact you can’t  – Eg: NaradaBrokering, HDFS, Database • Virtually unlimited scalable distributed services • Zero maintenance – Let the platform take care of you – No single point of failures • Availability guarantees • Ease of development

API • ProcessMapRed(jobid, container, params, numReduceTasks, storageAccount, mapQName, reduceQName,List mapTasks) • Map(key , value, programArgs, Dictionary outputCollector) • Reduce(key, List values, programArgs, Dictionary outputCollector)

Develop applications using Azure MapReduce • Local debugging using Azure development fabric • DistributedCache capability – Bundle with Azure Package • Compile in release mode before creating the package. • Deploy using Azure web interface • Errors logged to a Azure Table

SWG Pairwise Distance Alignment • SmithWaterman-GOTOH • Pairwise sequence alignment – Align each sequence with all the other sequences

Application architecture Block decomposition 1 2 3 4 (1-100) (101-200) (201-300) (301-400) 1 M1 M2 from M6 M3 Reduce 1 (1-100) 2 from M2 M4 M5 from M9 (101-200) Reduce 2 3 M6 from M5 M7 M8 (201-300) Reduce 3 4 from M3 M9 from M8 M10 (301-400) Reduce 4

AzureMR SWG Performance 10k Sequences 9000 8000 Execution Time (s) 7000 Execution Time(s) 6000 5000 4000 3000 2000 1000 0 0 32 64 96 128 160 Number of Azure Small Instances

AzureMR SWG Performance 10k Sequences 7 6 Alignment Time (ms) 5 4 3 2 Time Per Alignment Per Instance 1 0 0 32 64 96 128 160 Number of Azure Small Instances

AzureMR SWG Performance on Different Instance Types 700 600 Execution Time (s) 500 400 Execution Time 300 200 100 0 Small Medium Large ExtraLarge Instance Type

AzureMR SWG Performance on Different Data Sizes 8 Time for an Actual Aligement (ms) 7 6 5 4 Time Per Alignment Per Core (ms) 3 2 1 0 4000 5000 6000 7000 8000 9000 10000 Number of Sequences

Next Steps • In the works – Monitoring web interface – Alternative intermediate data communication mechanisms – Public release • Future plans – AzureTwister • Iterative MapReduce

Thanks!! • Questions? 

References • J. Dean, and S. Ghemawat, “MapReduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107-113., 2008. • J.Ekanayake, H.Li, B.Zhang et al. , “Twister: A Runtime for iterative MapReduce,” in Proceedings of the First International Workshop on MapReduce and its Applications of ACM HPDC 2010 conference June 20-25, 2010, Chicago, Illinois, 2010. • Cloudmapreduce , http://sites.google.com/site/huanliu/cloudmapreduce.pdf • "Apache Hadoop," http://hadoop.apache.org/ • M. Isard, M. Budiu, Y. Yu et al. , "Dryad: Distributed data- parallel programs from sequential building blocks." pp. 59-72.

Acknowledgments • Prof. Geoffrey Fox, Dr. Judy Qiu and the Salsa group • Dr. Ying Chen and Alex De Luca from IBM Almaden Research Center • Virtual School Organizers

Azure MapReduce Thilina Gunarathne Salsa group, Indiana University - PowerPoint PPT Presentation

Azure MapReduce Thilina Gunarathne Salsa group, Indiana University Agenda Recap of Azure Cloud Services Recap of MapReduce Azure MapReduce Architecture Application development using AzureMR Pairwise distance alignment

Using PubSub For Scheduling in Azure SDN Qi Zhang (Microsoft - Azure Networking) Azure

Azure Active Directory Provider The Azure Provider can be used to congure infrastructure in

Lead Azure Architect, MCT(Microsoft Certified Trainer) Azure Talk by Niraj kumar, Cloud Architect!

Niraj Kumar Lead Azure Architect, MCT( Microsoft Certified Trainer) Azure Talk by Niraj kumar,

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Microsoft AZURE Giovanni Gatto Azure Partner Recruiter EMAIL: ggatto@Microsoft.com TWITTER:

Sonnets 1 06.26.13 || English 2322: British Literature: Anglo-Saxon Mid 18th Century || D.

The Roman Road a sermon series January 2016 - February 2017 Video here The Roman Road a sermon

Study of 2D Hubbard model within the Simons Collaboration on the Many Electron Problem Mingpu Qin

CS108 Final Exam: How to Prepare Aaron Stevens 29 April 2009 1 Written Exam Format Short

What can a five-year old Productivity Commission add to a thousand-year old institution? Murray

A Tutorial on Tablet PC Simon Fraser University CMPT 354 Fall 2007 Agenda Tablet PC Overview

20 December 2018 Sharing on School Matters Sharing on School Matters School Programmes School

Adaptive Management: Applying What We Learn Sriju Sharma, Associate Director - Monitoring,

Sambuz

Useful Links

Newsletter

Mail Us