Harnessing Harnessing Grid Resources with Grid Resources with - PowerPoint PPT Presentation

Harnessing Harnessing Grid Resources with Grid Resources with Data- -Centric Task Farms Centric Task Farms Data Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago Committee Members: Ian Foster: University of Chicago, Argonne National Laboratory Rick Stevens: University of Chicago, Argonne National Laboratory Alex Szalay: The Johns Hopkins University Candidacy Exam December 12 th , 2007

Outline 1. Motivation and Challenges 2. Hypothesis & Proposed Solution • Abstract Model • Practical Realization 3. Related Work 4. Completed Milestones 5. Work in Progress 6. Conclusion & Contributions 12/20/2007 Harnessing Grid Resources with Data-Centric Task Farms 2

Motivating Example: AstroPortal Stacking Service + + • Purpose + + – On-demand “stacks” of + random locations within + ~10TB dataset + = • Challenge – Rapid access to 10-10K Sloan “random” files S 4 Data Web page – Time-varying load or Web • Solution Service – Dynamic acquisition of compute, storage 12/20/2007 Harnessing Grid Resources with Data-Centric Task Farms 4

Challenge #1: Long Queue Times • Wait queue times are typically longer than the job duration times SDSC DataStar 1024 Processor Cluster 2004 12/20/2007 12/20/2007 Harnessing Grid Resources with Data-Centric Task Farms 5 5

Challenge #2: Slow Job Dispatch Rates • Production LRMs � ~1 job/sec dispatch rates Medium Size Grid Site (1K processors) • What job durations are 100% 90% 80% needed for 90% efficiency: 70% 60% Efficiency – Production LRMs: 900 sec 50% 40% 30% – Development LRMs: 100 sec 20% 10% 0% – Experimental LRMs: 50 sec 0.001 0.01 0.1 1 10 100 1000 10000 100000 Task Length (sec) 1 task/sec (i.e. PBS, Condor 6.8) 10 tasks/sec (i.e. Condor 6.9.2) – 1~10 sec should be possible 100 tasks/sec 500 tasks/sec (i.e. Falkon) 1K tasks/sec 10K tasks/sec 100K tasks/sec 1M tasks/sec Throughput System Comments (tasks/sec) Condor (v6.7.2) - Production Dual Xeon 2.4GHz, 4GB 0.49 PBS (v2.1.8) - Production Dual Xeon 2.4GHz, 4GB 0.45 Condor (v6.7.2) - Production Quad Xeon 3 GHz, 4GB 2 Condor (v6.8.2) - Production 0.42 12/20/2007 12/20/2007 Harnessing Grid Resources with Data-Centric Task Farms 6 6 11 Condor (v6.9.3) - Development Condor-J2 - Experimental Quad Xeon 3 GHz, 4GB 22

Challenge #3: Poor Scalability of Shared File Systems • GPFS vs. LOCAL – Read Throughput 1000000 GPFS R LOCAL R • 1 node: 0.48Gb/s vs. 1.03Gb/s � 2.15x GPFS R+W LOCAL R+W Throughput (Mb/s) 100000 • 160 nodes: 3.4Gb/s vs. 165Gb/s � 48x – Read+Write Throughput: 10000 • 1 node: 0.2Gb/s vs. 0.39Gb/s � 1.95x • 160 nodes: 1.1Gb/s vs. 62Gb/s � 55x 1000 – Metadata (mkdir / rm -rf) 100 • 1 node: 151/sec vs. 199/sec � 1.3x 1 10 100 1000 • 160 nodes: 21/sec vs. 31840/sec � 1516x Number of Nodes 12/20/2007 12/20/2007 Harnessing Grid Resources with Data-Centric Task Farms 7 7

Hypothesis “Significant performance improvements can be obtained in the analysis of large dataset by leveraging information about data analysis workloads rather than individual data analysis tasks.” • Important concepts related to the hypothesis – Workload : a complex query (or set of queries) decomposable into simpler tasks to answer broader analysis questions – Data locality is crucial to the efficient use of large scale distributed systems for scientific and data-intensive applications – Allocate computational and caching storage resources, co-scheduled to optimize workload performance 12/20/2007 Harnessing Grid Resources with Data-Centric Task Farms 9

Proposed Solution: Part 1 Abstract Model and Validation • AMDASK: – An Abstract Model for DAta-centric taSK farms • Task Farm: A common parallel pattern that drives independent computational tasks – Models the efficiency of data analysis workloads for the split/merge class of applications – Captures the following data diffusion properties • Resources are acquired in response to demand • Data and applications diffuse from archival storage to new resources • Resource “caching” allows faster responses to subsequent requests • Resources are released when demand drops • Considers both data and computations to optimize performance • Model Validation – Implement the abstract model in a discrete event simulation – Validate model with statistical methods (R 2 Statistic, Residual Analysis) 12/20/2007 Harnessing Grid Resources with Data-Centric Task Farms 10

Proposed Solution: Part 2 Practical Realization • Falkon: a Fast and Light-weight tasK executiON framework – Light-weight task dispatch mechanism – Dynamic resource provisioning to acquire and release resources – Data management capabilities including data-aware scheduling – Integration into Swift to leverage many Swift-based applications • Applications cover many domains: astronomy, astro-physics, medicine, chemistry, and economics 12/20/2007 Harnessing Grid Resources with Data-Centric Task Farms 11

AMDASK: Base Definitions • Data Stores: Persistent & Transient – Store capacity, load, ideal bandwidth, available bandwidth • Data Objects: – Data object size, data object’s storage location(s), copy time • Transient resources: compute speed, resource state • Task: application, input/output data 12/20/2007 Harnessing Grid Resources with Data-Centric Task Farms 13

AMDASK: Execution Model Concepts • Dispatch Policy – next-available, first-available, max-compute-util, max-cache-hit • Caching Policy – random, FIFO, LRU, LFU • Replay policy • Data Fetch Policy – Just-in-Time, Spatial Locality • Resource Acquisition Policy – one-at-a-time, additive, exponential, all-at-once, optimal • Resource Release Policy – distributed, centralized 12/20/2007 Harnessing Grid Resources with Data-Centric Task Farms 14

AMDASK: Performance Efficiency Model • B: Average Task Execution Time: 1 – K: Stream of tasks ∑ Β = µ κ ( ) Κ – µ(k): Task k execution time | | ∈ Κ k • Y: Average Task Execution Time with Overheads: – ο (k): Dispatch overhead ⎧ 1 ∑ µ κ + κ δ ∈ φ τ δ ∈ Ω [ ( ) o ( )], ( ), ⎪ ⎪ Κ | | – ς ( δ , τ ): Time to get data = κ ∈ Κ Y ⎨ 1 ∑ ⎪ µ κ + κ + ζ δ τ δ ∉ φ τ δ ∈ Ω [ ( ) o ( ) ( , )] , ( ), ⎪ Κ | | ⎩ • V: Workload Execution Time: κ ∈ Κ ⎛ ⎞ B 1 – A: Arrival rate of tasks = ⎜ ⎟ Κ V max , * | | ⎜ ⎟ Τ Α | | ⎝ ⎠ – T: Transient Resources • W: Workload Execution Time with Overheads ⎛ Υ ⎞ 1 = ⎜ ⎟ Κ W max , * | | ⎜ ⎟ Τ Α | | 12/20/2007 Harnessing Grid Resources with Data-Centric Task Farms 15 ⎝ ⎠

AMDASK: Performance Efficiency Model • Efficiency ⎧ Y 1 ≤ 1 , ⎪ V ⎪ | T | A Ε = = E ⎨ Τ ⎛ ⎞ B | | Y 1 W ⎪ > ⎜ ⎟ max , , ⎪ Α ⎝ Y * Y ⎠ | T | A ⎩ • Speedup S = E * T | | • Optimizing Efficiency – Easy to maximize either efficiency or speedup independently – Harder to maximize both at the same time • Find the smallest number of transient resources |T| while maximizing 12/20/2007 Harnessing Grid Resources with Data-Centric Task Farms 16 speedup*efficiency

Performance Efficiency Model Example: 1K CPU Cluster • Application: Angle - distributed data mining • Testbed Characteristics: – Computational Resources: 1024 – Transient Resource Bandwidth: 10MB/sec – Persistent Store Bandwidth: 426MB/sec • Workload: – Number of Tasks: 128K – Arrival rate: 1000/sec – Average task execution time: 60 sec – Data Object Size: 40MB 12/20/2007 Harnessing Grid Resources with Data-Centric Task Farms 17

Performance Efficiency Model Example: 1K CPU Cluster Falkon on ANL/UC TG Site: PBS on ANL/UC TG Site: Peak Dispatch Throughput : 500/sec Peak Dispatch Throughput : 1/sec Scalability : 50~500 CPUs Scalability : <50 CPUs Peak speedup : 623x Peak speedup : 54x 100% 1000 100% 1000 90% 90% 80% 80% 70% 70% 100 100 60% 60% Efficiency Efficiency Speedup Speedup 50% 50% 40% 40% 10 10 30% 30% 20% 20% Efficiency Efficiency Speedup Speedup 10% 10% Speedup*Efficiency Speedup*Efficiency 0% 1 0% 1 1 2 4 8 16 32 64 128 256 512 1024 1 2 4 8 16 32 64 128 256 512 1024 Number of Processors Number of Processors 12/20/2007 Harnessing Grid Resources with Data-Centric Task Farms 18

Harnessing Harnessing Grid Resources with Grid Resources with - PowerPoint PPT Presentation

Harnessing Harnessing Grid Resources with Grid Resources with Data- -Centric Task Farms Centric Task Farms Data Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago Committee Members: Ian Foster:

Sun and Grid John Barr Grid Business Development 07808 328351 john.barr@sun.com Sun and Grid

ON-GRID VS OFF-GRID SOLAR On-Grid Solar is solar generation that is connected to the utility grid

Migrating from Grid to Cloud: Migrating from Grid to Cloud: Migrating from Grid to Cloud:

Harnessing technology for better social outcomes Presented by: Andrew Peckham General Manager -

Harnessing the potential of stem cells Harnessing the potential of stem cells for the treatment

HARNESSING HARNESSING THE THE DA DATA Elizabeth Elizabeth Lukanen, Lukanen, MPH MPH Sta

HARNESSING THE BULL MARKET HARNESSING THE BULL MARKET FOR FREE CASH FLOW FOR FREE CASH FLOW

SEE-GRID Deploying a Grid-enabled eInfrastructure in SE Europe www.see-grid.org Jorge Sanchez,

Modernizing T&D on the Electric Grid 11/29/2011 Mark Nealon System Meter & Smart Grid

Grid Grid to Grid Grid-to to Ports Clock Routing for to-Ports Clock Routing for Ports Clock

Grid/Clo d Comp ting Grid/Clo d Comp ting Grid/Cloud Computing Grid/Cloud Computing over

SEE-GRID-SCI SEE-GRID Infrastructure for Regional eScience www.see-grid-sci.eu International

Grid! Alison Fulford Housekeeping National Grid 2 Introductions National Grid 3 Workplace

One Page Everywhere Fluid, Responsive Design with Semantic.gs The Semantic Grid System Grid

GRID PHD GRID, PHD The Smart Grid Cyber Security and the Future of Keeping the Lights On The

& Grid5000 Grid eXplorer eXplorer Grid Plates-formes de Grilles exprimentales

PRESENTATION TO INQUIRY INTO OP BURNHAM AND RELATED MATTERS 4 APRIL 2019 Good afternoon Sir

Conditions for and effects of CARD cache implementations Gustaf R antil a and Mikael W

SPLIT ARRAY CACHES FOR EMBEDDED APPLICATIONS Euromicro DSD 2010 Alice M. Tokarnia, Marina

PoWA 3 June, 28 2016 - 5432... Meet us! Authors Ronan Dunklau DBA @ Dalibo Open-Source:

Generating Low-Overhead Dynamic Binary Translators Mathias Payer and Thomas R. Gross Department

Tarek Bohsali Microsoft SESSION SUMMARY [PRES ESEN ENTATI TION N TITLE LE] [PRES ESEN

WARM SRAM: A Novel Scheme to Reduce Static Leakage Energy in SRAM Arrays Mahadevan

SALES HISTORY SINCE IPO 35,000,000 2019 19-20 20 in su summa mary 30,000,000 15

Harnessing Harnessing Grid Resources with Grid Resources with - PowerPoint PPT Presentation

Harnessing Harnessing Grid Resources with Grid Resources with Data- -Centric Task Farms Centric Task Farms Data Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago Committee Members: Ian Foster:

Sun and Grid John Barr Grid Business Development 07808 328351 john.barr@sun.com Sun and Grid

ON-GRID VS OFF-GRID SOLAR On-Grid Solar is solar generation that is connected to the utility grid

Migrating from Grid to Cloud: Migrating from Grid to Cloud: Migrating from Grid to Cloud:

Harnessing technology for better social outcomes Presented by: Andrew Peckham General Manager -

Harnessing the potential of stem cells Harnessing the potential of stem cells for the treatment

HARNESSING HARNESSING THE THE DA DATA Elizabeth Elizabeth Lukanen, Lukanen, MPH MPH Sta

HARNESSING THE BULL MARKET HARNESSING THE BULL MARKET FOR FREE CASH FLOW FOR FREE CASH FLOW

SEE-GRID Deploying a Grid-enabled eInfrastructure in SE Europe www.see-grid.org Jorge Sanchez,

Modernizing T&amp;D on the Electric Grid 11/29/2011 Mark Nealon System Meter &amp; Smart Grid

Grid Grid to Grid Grid-to to Ports Clock Routing for to-Ports Clock Routing for Ports Clock

Grid/Clo d Comp ting Grid/Clo d Comp ting Grid/Cloud Computing Grid/Cloud Computing over

SEE-GRID-SCI SEE-GRID Infrastructure for Regional eScience www.see-grid-sci.eu International

Grid! Alison Fulford Housekeeping National Grid 2 Introductions National Grid 3 Workplace

One Page Everywhere Fluid, Responsive Design with Semantic.gs The Semantic Grid System Grid

GRID PHD GRID, PHD The Smart Grid Cyber Security and the Future of Keeping the Lights On The

&amp; Grid5000 Grid eXplorer eXplorer Grid Plates-formes de Grilles exprimentales

PRESENTATION TO INQUIRY INTO OP BURNHAM AND RELATED MATTERS 4 APRIL 2019 Good afternoon Sir

Conditions for and effects of CARD cache implementations Gustaf R antil a and Mikael W

SPLIT ARRAY CACHES FOR EMBEDDED APPLICATIONS Euromicro DSD 2010 Alice M. Tokarnia, Marina

PoWA 3 June, 28 2016 - 5432... Meet us! Authors Ronan Dunklau DBA @ Dalibo Open-Source:

Generating Low-Overhead Dynamic Binary Translators Mathias Payer and Thomas R. Gross Department

Tarek Bohsali Microsoft SESSION SUMMARY [PRES ESEN ENTATI TION N TITLE LE] [PRES ESEN

WARM SRAM: A Novel Scheme to Reduce Static Leakage Energy in SRAM Arrays Mahadevan

SALES HISTORY SINCE IPO 35,000,000 2019 19-20 20 in su summa mary 30,000,000 15

Modernizing T&D on the Electric Grid 11/29/2011 Mark Nealon System Meter & Smart Grid

& Grid5000 Grid eXplorer eXplorer Grid Plates-formes de Grilles exprimentales