Platforms July 28, 2010 Big Data for Science Workshop Judy Qiu - PowerPoint PPT Presentation

Overview of Cloud Computing Platforms July 28, 2010 Big Data for Science Workshop Judy Qiu xqiu@indiana.edu http://salsahpc.indiana.edu Pervasive Technology Institute School of Informatics and Computing Indiana University SALSA SALSA

Important Trends • In all fields of science and • new commercially throughout life (e.g. web!) supported data center model building on • Impacts preservation, compute grids access/use, programming model Cloud Data Deluge Technologies Multicore/ eScience Parallel • A spectrum of eScience or Computing eResearch applications • Implies parallel computing (biology, chemistry, physics social science and important again humanities …) • Performance from extra cores – not extra clock • Data Analysis speed • Machine learning SALSA

Challenges for CS Research Science faces a data deluge. How to manage and analyze information? Recommend CSTB foster tools for data capture , data curation , data analysis ―Jim Gray’s Talk to Computer Science and Telecommunication Board (CSTB), Jan 11, 2007 There’re several challenges to realizing the vision on data intensive systems and building generic tools (Workflow, Databases, Algorithms, Visualization ). • Cluster/Cloud-management software • Distributed execution engine • Security and Privacy • Language constructs e.g. MapReduce Twister … • Parallel compilers • Program Development tools . . . SALSA

Data We’re Looking at • Public Health Data (IU Medical School & IUPUI Polis Center) (65535 Patient/GIS records / 54 dimensions each) • Biology DNA sequence alignments (IU Medical School & CGB) (several million Sequences / at least 300 to 400 base pair each) • NIH PubChem (Cheminformatics) (60 million chemical compounds/166 fingerprints each) • Particle physics LHC (Caltech) (1 Terabyte data placed in IU Data Capacitor) High volume and high dimension require new efficient computing approaches! SALSA

Data Explosion and Challenges Data is too big and gets bigger to fit into memory For “All pairs” problem O(N 2 ), PubChem data points 100,000 => 480 GB of main memory (Tempest Cluster of 768 cores has 1.536TB) We need to use distributed memory and new algorithms to solve the problem Communication overhead is large as main operations include matrix multiplication (O(N 2 )), moving data between nodes and within one node adds extra overheads We use hybrid mode of MPI and MapReduce between nodes and concurrent threading internal to node on multicore clusters Concurrent threading has side effects (for shared memory model like CCR and OpenMP) that impact performance sub-block size to fit data into cache cache line padding to avoid false sharing SALSA

Gartner 2009 Hype Curve Source: Gartner (August 2009) HPC ? SALSA

Clouds hide Complexity Cyberinfrastructure Is “Research as a Service” SaaS: Software as a Service (e.g. Clustering is a service) PaaS : Platform as a Service IaaS plus core software capabilities on which you build SaaS (e.g. Azure is a PaaS; MapReduce is a Platform) IaaS ( HaaS ): Infrastructure as a Service (get computer time with a credit card and with a Web interface like EC2) SALSA 7

Cloud Computing: Infrastructure and Runtimes • Cloud Infrastructure: outsourcing of servers, computing, data, file space, utility computing, etc. – Handled through (Web) services that control virtual machine lifecycles. • Cloud Runtimes or Platform: tools (for using clouds) to do data- parallel (and other) computations. – Apache Hadoop, Google MapReduce, Microsoft Dryad, Bigtable, Chubby (synchronization) and others – MapReduce designed for information retrieval but is excellent for a wide range of science data analysis applications – Can also do much traditional parallel computing for data-mining if extended to support iterative operations – MapReduce not usually done on Virtual Machines SALSA

Key Features of Cloud Platforms Authentication and Authorization: Provide single sign in to both FutureGrid and Commercial Clouds linked by workflow Workflow: Support workflows that link job components between FutureGrid and Grid/Cloud Commercial Clouds. Trident from Microsoft Research is initial candidate Data Transport: Transport data between job components on FutureGrid and Commercial Clouds respecting custom storage patterns Software as a Service: This concept is shared between Clouds and Grids and can be supported without special attention SQL: Relational Database Program Library: Store Images and other Program material (basic FutureGrid facility) Blob: Basic storage concept similar to Azure Blob or Amazon S3 DPFS Data Parallel File System: Support of file systems like Google (MapReduce), HDFS (Hadoop) or Cosmos (Dryad) with compute-data affinity optimized for data processing Table: Support of Table Data structures modeled on Apache Hbase (Google Bigtable) or Amazon SimpleDB/Azure Table (eg. Scalable distributed “Excel”) Cloud Queues: Publish Subscribe based queuing system Worker Role: This concept is implicitly used in both Amazon and TeraGrid but was first introduced as a high level construct by Azure Web Role: This is used in Azure to describe important link to user and can be supported in FutureGrid with a Portal framework SALSA MapReduce: Support MapReduce Programming model including Hadoop on Linux, Dryad

MapReduce “File/Data Repository” Parallelism Map = (data parallel) computation reading and writing data Instruments Reduce = Collective/Consolidation phase e.g. forming multiple global sums as in histogram MPI and Iterative MapReduce Communication Map Map Map Map Portals Reduce Reduce Reduce Reduce Map 1 Map 2 Map 3 /Users Disks SALSA

MapReduce A parallel Runtime coming from Information Retrieval Data Partitions Map(Key, Value) A hash function maps the results of the map Reduce(Key, List<Value>) tasks to r reduce tasks Reduce Outputs • Implementations support: – Splitting of data – Passing the output of map functions to reduce functions – Sorting the inputs to the reduce function based on the intermediate keys – Quality of services SALSA

Sam’s Problem • Sam thought of “drinking” the apple  He used a to cut the and a to make juice. SALSA

Creative Sam • Implemented a parallel version of his innovation Each input to a map is a list of <key, value> pairs A list of <key, value> pairs mapped into another (<a, > , <o, > , <p, > , …) list of <key, value> pairs which gets grouped by the key and reduced into a list of values Each output of slice is a list of <key, value> pairs (<a’, > , <o’, > , <p’, > ) Grouped by key The idea of Map Reduce in Data Intensive Each input to a reduce is a <key, value-list> (possibly a Computing list of these, depending on the grouping/hashing mechanism) e.g. <ao , ( …)> Reduced into a list of values SALSA

Hadoop & DryadLINQ Apache Hadoop Microsoft DryadLINQ Master Node Data/Compute Nodes Standard LINQ operations Job DryadLINQ operations M M M M Tracker R R R R DryadLINQ Compiler H Directed Data Vertex : 1 2 Name D execution task Acyclic Graph 2 blocks Node F (DAG) based 3 3 4 Edge : communication S execution path flows Dryad Execution Engine • Dryad process the DAG executing vertices on compute • Apache Implementation of Google’s MapReduce • clusters Hadoop Distributed File System (HDFS) manage data • • LINQ provides a query interface for structured data Map/Reduce tasks are scheduled based on data locality in HDFS (replicated data blocks) • Provide Hash, Range, and Round-Robin partition patterns Job creation; Resource management; Fault tolerance& re-execution of failed tasks/vertices SALSA

Reduce Phase of Particle Physics “Find the Higgs” using Dryad Higgs in Monte Carlo • Combine Histograms produced by separate Root “Maps” (of event data to partial histograms) into a single Histogram delivered to Client • This is an example using MapReduce to do distributed histogramming. SALSA

High Energy Physics Data Analysis An application analyzing data from Large Hadron Collider (1TB but 100 Petabytes eventually) Input to a map task: <key, value> key = Some Id value = HEP file Name Output of a map task: <key, value> key = random # (0<= num<= max reduce tasks) value = Histogram as binary data Input to a reduce task: <key, List<value>> key = random # (0<= num<= max reduce tasks) value = List of histogram as binary data Output from a reduce task: value value = Histogram file Combine outputs from reduce tasks to form the final histogram SALSA

Platforms July 28, 2010 Big Data for Science Workshop Judy Qiu - PowerPoint PPT Presentation

Overview of Cloud Computing Platforms July 28, 2010 Big Data for Science Workshop Judy Qiu xqiu@indiana.edu http://salsahpc.indiana.edu Pervasive Technology Institute School of Informatics and Computing Indiana University SALSA SALSA

WILL YOU EAT OR BE EATEN ? Platforms are as old as trains 2 Sometimes platforms go wrong 3

You call it Data Lake; we call it Data Historian Naghman Waheed Data Platforms Lead Brian

Platforms Where is the market going? Adviser lead Platforms: Current state of affairs c.

Powering Compute Powering Compute Platforms in High Platforms in High Efficiency Data

Mobile Phone Platforms and Mobile Phone Platforms and Service Enablers Service Enablers Dr.

Building Open Sour Building Open Source platforms ce platforms on A on AWS WS Julien Simon

ORADEA INDUSTRIAL PLATFORMS ROMANIA Oradea Local Development Agency www.adlo.ro ORADEA

Install new track on Fully operational, December approach platforms 1-8 2018 Realign and

FTC Team By Patti Poston FIRST Senior Mentor Virtual Platforms Virtual Platforms Zoom

Cache Storage Channels Alias-driven Attacks Formally Verified Platforms Formally Verified

What does the title mean? 1. part: R on Different Platforms on Different Platforms What is R ?

@rmchase PEERS INC EXCESS PEOPLE PLATFORMS CAPACITY People & Platforms are Inventing the

Digital platforms and coring strategies for public-private collaboration IN5320 2020

T-PLATFORMS March 3, 2016 Artem Osipov Alexander Daryin GraphHPC-2016 www.t-platforms.com BFS

Find your own Applet! AppletPro helps users efficiently search and manage applets from different

Complementi di Piattaforme Abilitanti Distribuite Distributed Enabling Platforms || MCSN N.

James Silva Lead Dishwasher, Ska Studios Intro Stylistic Action Platformer Sequel to The

Deep learning 13.1. Attention for Memory and Sequence Translation Fran cois Fleuret

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 18 November 2016 Lecture 7

Introduction to GPU Computing Jeff Larkin Cray Supercomputing Center of Excellence

Introduction to read alignment pipelines and gene expression estimates Johan Reimegrd Read

Ra Random matrix analysis for gene co co-ex expres ession ex exper erimen ents in in can

Most Random Gene Expression Signatures are Significantly Associated with Breast Cancer Outcome

Unsupervised joint analysis of arrayCGH, gene expression data and supplementary features

Platforms July 28, 2010 Big Data for Science Workshop Judy Qiu - PowerPoint PPT Presentation

Overview of Cloud Computing Platforms July 28, 2010 Big Data for Science Workshop Judy Qiu xqiu@indiana.edu http://salsahpc.indiana.edu Pervasive Technology Institute School of Informatics and Computing Indiana University SALSA SALSA

WILL YOU EAT OR BE EATEN ? Platforms are as old as trains 2 Sometimes platforms go wrong 3

You call it Data Lake; we call it Data Historian Naghman Waheed Data Platforms Lead Brian

Platforms Where is the market going? Adviser lead Platforms: Current state of affairs c.

Powering Compute Powering Compute Platforms in High Platforms in High Efficiency Data

Mobile Phone Platforms and Mobile Phone Platforms and Service Enablers Service Enablers Dr.

Building Open Sour Building Open Source platforms ce platforms on A on AWS WS Julien Simon

ORADEA INDUSTRIAL PLATFORMS ROMANIA Oradea Local Development Agency www.adlo.ro ORADEA

Install new track on Fully operational, December approach platforms 1-8 2018 Realign and

FTC Team By Patti Poston FIRST Senior Mentor Virtual Platforms Virtual Platforms Zoom

Cache Storage Channels Alias-driven Attacks Formally Verified Platforms Formally Verified

What does the title mean? 1. part: R on Different Platforms on Different Platforms What is R ?

@rmchase PEERS INC EXCESS PEOPLE PLATFORMS CAPACITY People &amp; Platforms are Inventing the

Digital platforms and coring strategies for public-private collaboration IN5320 2020

T-PLATFORMS March 3, 2016 Artem Osipov Alexander Daryin GraphHPC-2016 www.t-platforms.com BFS

Find your own Applet! AppletPro helps users efficiently search and manage applets from different

Complementi di Piattaforme Abilitanti Distribuite Distributed Enabling Platforms || MCSN N.

James Silva Lead Dishwasher, Ska Studios Intro Stylistic Action Platformer Sequel to The

Deep learning 13.1. Attention for Memory and Sequence Translation Fran cois Fleuret

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 18 November 2016 Lecture 7

Introduction to GPU Computing Jeff Larkin Cray Supercomputing Center of Excellence

Introduction to read alignment pipelines and gene expression estimates Johan Reimegrd Read

Ra Random matrix analysis for gene co co-ex expres ession ex exper erimen ents in in can

Most Random Gene Expression Signatures are Significantly Associated with Breast Cancer Outcome

Unsupervised joint analysis of arrayCGH, gene expression data and supplementary features

@rmchase PEERS INC EXCESS PEOPLE PLATFORMS CAPACITY People & Platforms are Inventing the