on a Cluster Hongbo Rong, Frank Schlimbach Programming & - PowerPoint PPT Presentation

Transparently Composing CnC Graph Pipelines on a Cluster Hongbo Rong, Frank Schlimbach Programming & Systems Lab (PSL) Software Systems Group (SSG) 7 th Annual Concurrent Collections Workshop 9/8/2015

Problem A productivity program running on a cluster 2

Problem A productivity program running on a cluster  The programmer is a domain expert, but not a tuning expert 2

Problem A productivity program running on a cluster  The programmer is a domain expert, but not a tuning expert  Call distributed libraries 2

Problem A productivity program running on a cluster  The programmer is a domain expert, but not a tuning expert  Call distributed libraries  Library functions are not composable  Black box  Independent  Context-unaware  Barrier at the end 2

Problem A productivity program running on a cluster  The programmer is a domain expert, but not a tuning expert  Call distributed libraries  Library functions are not composable  Black box  Independent  Context-unaware  Barrier at the end How to compose these non-composable library functions automatically ? 2

Flow Graphs Traditional: Bulk-synchronous Parallel Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel 3

Flow Graphs Traditional: Bulk-synchronous Parallel Pieplined & asynchronous: Communication Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel 3

Basic Idea User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D

Basic Idea User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D Library Add a CnC Graph for each library function A C B D + + C E

Basic Idea User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D Library Add a CnC Graph for each library function A C B D + + C E Compose the corresponding graphs of a sequence of library calls Compiler/ interpreter A B Both graphs use the identical + memory for C C D + E

Basic Idea User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D Library Add a CnC Graph for each library function A C B D + + C E Compose the corresponding graphs of a sequence of library calls Compiler/ interpreter A B Both graphs use the identical + memory for C C D + E Execution Let CnC do the distribution: mpiexec -genv DIST_CNC=MPI – n 1000 ./julia user_script.jl mpiexec -genv DIST_CNC=MPI – n 1000 ./python user_script.py mpiexec -genv DIST_CNC=MPI – n 1000 ./matlab user_script.m

Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D

Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B Multiply Graph 1 C1*

Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B Leverage the library Multiply Graph 1 C1*

Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B Leverage the library Multiply Graph 1 C1* D Multiply Graph 2 E1*

Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B Leverage the library Multiply Graph 1 No barrier/copy/msg between C1* graphs/steps unless required. E.g. No bcast/gather of C. D Multiply Graph 2 E1*

Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B … Leverage the library Multiply Graph 1 No barrier/copy/msg between C1* graphs/steps unless required. E.g. No bcast/gather of C. … D Multiply Graph 2 E1*

Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 Proces ess s 100 A1* B … A100* B Leverage the library Multiply Graph 1 Multiply No barrier/copy/msg between C1* graphs/steps unless required. C100* E.g. No bcast/gather of C. … D D Multiply Multiply Graph 2 E100* E1*

Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 Proces ess s 100 A1* B … A100* B Leverage the library Multiply Graph 1 Multiply No barrier/copy/msg between C1* graphs/steps unless required. C100* E.g. No bcast/gather of C. … D D Multiply Multiply Graph 2 E100* E1* E

Code skeleton User code dgemm(A, B, C) dgemm(C, D, E) 6

Code skeleton User code dgemm(A, B, C) dgemm(C, D, E) Compiler 6

Code skeleton User code dgemm(A, B, C) dgemm(C, D, E) Compiler User code initialize_CnC() dgemm_dgemm(A, B, C, D, E) finalize_Cnc() 6

struct dgemm_dgemm_context { Context Code skeleton item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; User code dgemm_dgemm_context( A, B, C, D, E) { dgemm(A, B, C) create C_collection dgemm(C, D, E) graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } Compiler } User code initialize_CnC() dgemm_dgemm(A, B, C, D, E) finalize_Cnc() 6

struct dgemm_dgemm_context { Context Code skeleton item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; User code dgemm_dgemm_context( A, B, C, D, E) { dgemm(A, B, C) create C_collection dgemm(C, D, E) graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } Compiler } void dgemm_dgemm (A, B, C, D, E) { dgemm_dgemm_context ctxt(A, B, C, D, E ); ctxt.graph1->start(); User code ctxt.graph2->start(); Interface initialize_CnC() ctxt.wait(); dgemm_dgemm(A, B, C, D, E) ctxt.graph2->copyout(); finalize_Cnc() } 6

struct dgemm_dgemm_context { Context Code skeleton item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; User code dgemm_dgemm_context( A, B, C, D, E) { dgemm(A, B, C) create C_collection dgemm(C, D, E) graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } Compiler } void dgemm_dgemm (A, B, C, D, E) { dgemm_dgemm_context ctxt(A, B, C, D, E ); ctxt.graph1->start(); User code ctxt.graph2->start(); Interface initialize_CnC() ctxt.wait(); dgemm_dgemm(A, B, C, D, E) ctxt.graph2->copyout(); finalize_Cnc() } class dgemm_graph { tuner *tunerA, *tunerB, *tunerC, *tunerS; Domain item_collection *A_collection, *B_collection, *C_collection; expert tag_collection tags; written step_collection *multiply_steps; dgemm_graph(_A, _B, _C) { create A/B/C_collection based on A/B/C define dataflow graph } } 6

struct dgemm_dgemm_context { Context Code skeleton item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; User code dgemm_dgemm_context( A, B, C, D, E) { dgemm(A, B, C) create C_collection Host Ho t dgemm(C, D, E) graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } Compiler } la langu nguage age void dgemm_dgemm (A, B, C, D, E) { dgemm_dgemm_context ctxt(A, B, C, D, E ); ctxt.graph1->start(); User code ctxt.graph2->start(); Interface C initialize_CnC() ctxt.wait(); dgemm_dgemm(A, B, C, D, E) ctxt.graph2->copyout(); finalize_Cnc() } class dgemm_graph { tuner *tunerA, *tunerB, *tunerC, *tunerS; Domain item_collection *A_collection, *B_collection, *C_collection; expert tag_collection tags; written step_collection *multiply_steps; dgemm_graph(_A, _B, _C) { create A/B/C_collection based on A/B/C define dataflow graph } } 6

Key points Compiler  Generates a context and an interface for a dataflow  Connects expert-written graphs into a pipeline  Minimizes communication with step tuners (Static scheduling) and item collection tuners (Static data distribution) 7

Key points Compiler  Generates a context and an interface for a dataflow  Connects expert-written graphs into a pipeline  Minimizes communication with step tuners (Static scheduling) and item collection tuners (Static data distribution) Domain-expert written graphs  High-level algorithms for library functions  Input/output collections can be from outside 7

Advantages Useful for any language  Mature work in compiler/interpreter – Dataflow analysis, pattern matching, code replacement Extends a scripting language to distributed computing implicitly  Transparent to users  Transparent to the language  Transparent to libraries Heavy lifting done in CnC and graph writing by domain experts 8

Open questions Minimize communication  Item collections: consumed_on  Step collections: computed_on Scalability Applications  There might not be many long sequences of library calls 9

on a Cluster Hongbo Rong, Frank Schlimbach Programming & - PowerPoint PPT Presentation

Transparently Composing CnC Graph Pipelines on a Cluster Hongbo Rong, Frank Schlimbach Programming & Systems Lab (PSL) Software Systems Group (SSG) 7 th Annual Concurrent Collections Workshop 9/8/2015 Problem A productivity program

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

EDEN CLUSTER STATIONS EDEN CLUSTER STATIONS Density MUNICIPALITY SAPS STATION (inhabitants/km 2

Build Your Cluster with Rocks Build Your Cluster with Rocks Yu Fu Yu Fu University of Florida

What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

Introduction to Cluster Computing Brian Vinter vinter@diku.dk Overview Cluster Computing

Reaching the Goal with the Regensburg Marathon Cluster - A NetBSD Cluster Project - Hubert Feyrer

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

with Cluster Development Ms. Atchaka Sibunruang Minister of Industry 23 November 2015 Concepts

NDC SUPPORT CLUSTER NDC SUPPORT CLUSTER In 2015 the German Federal Environment Ministry (BMUB)

Inaugural Meeting Wolfson Digital Research Cluster What could a Digital Research Cluster do for

Stable Cluster Variables Grace Zhang August 1, 2016 Grace Zhang Stable Cluster Variables

KVM Live Migration Optimization Li, Liang Zhang, Yang Aug 2015 1 Agenda Background

Hardware OS & OS- Application interface Summer 2016 Cornell University 1 Today

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

Security Engineering Chester Rebeiro IIT Madras Examples motivated from Prof. Nickolai Zeldovich

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Machines: Where Next? Murray Cole Machines: Where next? 1 Technological Progress Moores

CENG 4480 L10 Memory 3 Bei Yu Reference : Chapter 11 Memories CMOS VLSI DesignA

Texture mapping World/object coordinates 2D/3D Sources : scanners, raytracers Mapping one

on a Cluster Hongbo Rong, Frank Schlimbach Programming & - PowerPoint PPT Presentation

Transparently Composing CnC Graph Pipelines on a Cluster Hongbo Rong, Frank Schlimbach Programming & Systems Lab (PSL) Software Systems Group (SSG) 7 th Annual Concurrent Collections Workshop 9/8/2015 Problem A productivity program

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

EDEN CLUSTER STATIONS EDEN CLUSTER STATIONS Density MUNICIPALITY SAPS STATION (inhabitants/km 2

Build Your Cluster with Rocks Build Your Cluster with Rocks Yu Fu Yu Fu University of Florida

What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

Introduction to Cluster Computing Brian Vinter vinter@diku.dk Overview Cluster Computing

Reaching the Goal with the Regensburg Marathon Cluster - A NetBSD Cluster Project - Hubert Feyrer

Computing Cluster Usage Visualization Tool Compu&amp;ng Cluster Usage Visualiza&amp;on

Computing Cluster Usage Visualization Tool Compu&amp;ng Cluster Usage Visualiza&amp;on

with Cluster Development Ms. Atchaka Sibunruang Minister of Industry 23 November 2015 Concepts

NDC SUPPORT CLUSTER NDC SUPPORT CLUSTER In 2015 the German Federal Environment Ministry (BMUB)

Inaugural Meeting Wolfson Digital Research Cluster What could a Digital Research Cluster do for

Stable Cluster Variables Grace Zhang August 1, 2016 Grace Zhang Stable Cluster Variables

KVM Live Migration Optimization Li, Liang Zhang, Yang Aug 2015 1 Agenda Background

Hardware OS &amp; OS- Application interface Summer 2016 Cornell University 1 Today

Locality Locality CS 105 Tour of the Black Holes of Computing Principle of Locality: Programs

Security Engineering Chester Rebeiro IIT Madras Examples motivated from Prof. Nickolai Zeldovich

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

Machines: Where Next? Murray Cole Machines: Where next? 1 Technological Progress Moores

CENG 4480 L10 Memory 3 Bei Yu Reference : Chapter 11 Memories CMOS VLSI DesignA

Texture mapping World/object coordinates 2D/3D Sources : scanners, raytracers Mapping one

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

Hardware OS & OS- Application interface Summer 2016 Cornell University 1 Today