mixing cloud and grid resources for many task computing
play

Mixing Cloud and Grid Resources for Many Task Computing David - PowerPoint PPT Presentation

Mixing Cloud and Grid Resources for Many Task Computing David Abramson Monash e-Science and Grid Engineering Lab (MeSsAGE Lab) Faculty of Information Technology Science Director: Monash e-Research Centre ARC Professorial Fellow 1 Introduction


  1. Mixing Cloud and Grid Resources for Many Task Computing David Abramson Monash e-Science and Grid Engineering Lab (MeSsAGE Lab) Faculty of Information Technology Science Director: Monash e-Research Centre ARC Professorial Fellow 1

  2. Introduction • A typical MTC Driving Application • The Nimrod tool family • Things the Grid ignored – Deployment – Deadlines (QoS) • Clusters & Grids & Clouds • Conclusions and future directions

  3. A Typical MTC Driving Application

  4. A little quantum chemistry Wibke Sudholt, Univ Zurich 1 0.9 0.8 0.7 0.6 B 1 , B 2 0.5 0.4 0.3 0.2 A 1 , A 2 0.1 0 0 0.5 1 1.5 2 2.5 3 Radius r/a.u.           2 2 U r A exp B r A exp B r eff 1 1 2 2 HH H H H C C C F H H H H 4

  5. SC03 testbed

  6. The Nimrod Tools Family

  7. Nimrod supporting “real” science • A full parameter sweep is the cross product of all the parameters (Nimrod/G) • An optimization run minimizes some output metric and returns Nimrod/O Results Results Results parameter combinations that do this (Nimrod/O) • Design of experiments limits number of combinations (Nimrod/E) • Workflows (Nimrod/K)

  8. Nimrod Portal Plan File Nimrod/O Nimrod/E Nimrod/G parameter pressure float range from 5000 to 6000 points 4 parameter concent float range from 0.002 to 0.005 points 2 parameter material text select anyof “Fe” “Al” Actuators task main copy compModel node:compModel copy inputFile.skel node:inputFile.skel Grid Middleware node:substitute inputFile.skel inputFile node:execute ./compModel < inputFile > results copy node:results results.$jobname endtask 9

  9. Sent to available machines Prepare Jobs using Portal Results displayed & interpreted 10 Jobs Scheduled Executed Dynamically

  10. Drug Docking Antenna Design Aerofoil Design Aerofoil Design 13

  11. Nimrod/G Architecture Plan File Nimrod/G Client Nimrod/G Client Rsrc. Scheduler Nimrod/G GUI Enfuzion API + Run File Database Creator Generato r Level 3 Agent Job Scheduler Scheduler Level 2 Level 1 DB Server Globus Legion Condor Grid Middleware Actuator Actuator Actuator Grid RM & TS Information G Server(s) Agent Agent Agent RM & TS L Globus enabled node C RM & TS Legion enabled Condor enabled node. node. RM: Local Resource Manager, TS: Trade Server

  12. Nimrod/K Workflows

  13. Nimrod/K Workflows • Nimrod/K integrates Kepler with – Massivly parallel execution mechanism – Special purpose function of Nimrod/G/O/E – General purpose workflows from Kepler – Flexible IO model: Streams to files Authentication GUI …Kepler GUI Extensions… Vergil Documentation Kepler Smart Type Provenance SMS Re-run / Object System Failure Framework Actor&Data Ext Manager SEARCH Recovery Kepler Core Ptolemy Extensions

  14. Kepler Directors • Orchestrate Workflow • Synchronous & Dynamic Data Flow – Consumer actors not started until producer completes • Process Networks – All actors execute concurrently • IO modes produce different performance results • Existing directors don’t support multiple instances of actors. 17

  15. Workflow Threading • Nimrod parameter combinations can be viewed as threads • Multi-threaded workflows allow independent sequences in a workflow to run concurrently – This might be the whole workflow, or part of the workflow • Tokens in different threads do not interact with each other in the workflow

  16. The Nimrod/K director • Implements the Tagged Data Architecture • Provides threading • Maintains copies (clones) of actors • Maintains token tags • Schedules actor’s events Nimrod Director

  17. MTC through Data Flow Execution

  18. Dynamic Parallelism Token Colouring Actor Clone 1 Clone 1 Clone 2 Clone 2 Clone 3 Clone 3

  19. So …

  20. Director controls parallelism • Uses Nimrod to perform the execution

  21. Complete Parameter Sweep • Using a MATLAB actor provided by Kepler • Local spawn • Multiple thread ran concurrently on a computer with 8 cores (2 x quads) • Workflow execution was just under 8 times faster • Remote Spawn • 100’s – 1000’s of remote processes

  22. Parameter Sweep Actor

  23. Partial Parameter Sweep

  24. Nimrod/EK Actors • Actors for generating and analyzing designs • Leverage concurrent infrastructure

  25. Nimrod/E Actors No actor parameters need setting No difference from the parameter sweep actors

  26. Parameter Optimization: Inverse Problems Domain Definer F(x,y,z,w ,…) Points Generator Optimizer Constraint Enforcer Execute Model

  27. Nimrod/OK Workflows • Nimrod/K supports parallel execution • General template for search – Built from key components • Can mix and match optimization algorithms 30

  28. Things the Grid ignored

  29. Resource Scheduling • What’s so hard about scheduling parameter studies? – User has deadline – Grid resources unpredictable • Machine load may change at any time • Multiple machine queues – No central scheduler • Soft real time problem

  30. Computational Economy • Without cost ANY shared system becomes un-manageable • Resource selection on based pseudo money and market based forces • A large number of sellers and buyers (resources may be dedicated/shared) • Negotiation: tenders/bids and select those offers meet the requirement • Trading and Advance Resource Reservation • Schedule computations on those resources that meet all requirements

  31. Soft real-time scheduling problem Linux cluster - Monash (20) Sun - ANL (5) SP2 - ANL (5) SGI - ANL (15) SGI - ISI (10) 12 10 8 Jobs 6 4 2 0 34 0 1 3 4 6 8 9 10 12 14 15 17 19 20 21 22 24 25 27 28 30 31 33 34 36 37 38 40 41 43 44 46 47 49 51 52 54 Time (minutes)

  32. Linux cluster - Monash (20) Sun - ANL (5) SP2 - ANL (5) SGI - ANL (15) SGI - ISI (10) 12 10 8 Jobs 6 4 2 0 0 1 3 4 6 8 9 10 12 14 15 17 19 20 21 22 24 25 27 28 30 31 33 34 36 37 38 40 41 43 44 46 47 49 51 52 54 Time (minutes)

  33. Linux cluster - Monash (5) Sun - ANL (10) SP2 - ANL (10) SGI - ANL (15) SGI - ISI (20) 12 10 8 Jobs 6 4 2 0 0 3 4 7 8 10 13 15 17 19 21 23 26 28 31 32 35 37 39 41 43 46 48 50 53 55 57 60 Time (minutes)

  34. Condor-Monash Linux-Prosecco-CNR Linux-Barbera-CNR Solaris/Ultas2-TITech SGI-ISI Sun-ANL 12 No. of Tasks in Execution 10 8 6 4 2 0 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 Time (in Minute)

  35. Condor-Monash Linux-Prosecco-CNR Linux-Barbera-CNR Solaris/Ultas2-TITech SGI-ISI Sun-ANL 14 12 No. of Tasks in Execution 10 8 6 4 2 0 0 6 12 18 24 30 36 42 48 54 60 66 72 78 84 90 96 102 108 114 Time (in Minute)

  36. Deployment • Has largely been ignored in Grid middleware – Globus supports file transport, execution, data access • Challenges CLIENT Grid Deploy Aware Clients – Deployment Deployment Delegation OGSA-DAI Your Your Archiver Trigger GRAM GTCP Java Index CAS interfaces Java RFT Service Service lacking – Heterogeneity High Performance SERVER Virtualization Globus 4.0 Services 39

  37. Deployment Service Client Machine • Hide the complexity Application Source in installing software . NET Compilers on a remote resource. Application Application Grid Intermediate Code Handle Handle Resource • Use local knowledge Application Binary Execute Install about – the instruction set, Deployment GRAM Service – machine structure, – file system, Installed – I/O system, and Applications – installed libraries .NET Runtime .NET Parallel Virtual Machine Globus/OGSA

  38. Instantiated Configured 6 4 Application Application Un-configured Files User Security Scope Globus User Reliable File Transfer Managed Job Service DistAnt Service Hosting Service (GridFTP) Remote Host (GRAM) Environment 2 5 3 Application Files 6 4 RSL DistAnt Deployment Client Local Host 1 Ant Build File 41

  39. • Our approach is runtime- internal HPC Application • Why do Java & .NET support HPC Application web services, UI, security Managed to Virtual Machine Native Bindings Virtual Machine and other libraries as part of System Libraries System the standard environment? System.HPC Libraries HPC • Functionality is guaranteed Comm HPC Runtime Core Runtime Core Comm • Similarly, we aim to provide Virtualization guaranteed HPC functionality Native OS and Interconnect Native OS and Interconnect Runtime-External Runtime-Internal (Existing Approach) (Our Approach) 42

  40. Clusters & Grids & Clouds

  41. Nimrod over Clusters Jobs / Nimrod experiment Nimrod Actuator, e.g., SGE, PBS, LSF, Condor Local Batch System 44

  42. Nimrod over Grids • Advantages – Wide area elastic computing – Portal based point-of-presence independent of location of computational resources – Grid level security – Computational economy proposed • New scheduling and data challenges – Virtualization proposed (Based on .NET!) • Leveraged Grid middleware – Globus, Legion, ad-hoc standards 45

Recommend


More recommend