Extending Catamount for Multi-Core Processors Cray Users Group - PowerPoint PPT Presentation

Extending Catamount for Multi-Core Processors Cray Users Group Cray Users Group May 9, 2007 John Van Dyke, Courtenay Vaughan, Sue Kelly jpvandy@sandia.gov, ctvaugh@sandia.gov, smkelly@sandia.gov Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. This study was made possible by a special funding by the DOE Office of Science. Part of the testing was conducted at the Oak Ridge National Laboratory.

Catamount for Multi-Core Processors Outline • Overview of Catamount • Requirements for N-way Catamount • Design and implementation • Early dual-core results • Future

Catamount is designed for an MPP environment with functional partitions Compute I/O processors (Linux) Processors (Catamount) Service Processors (Linux) Network I/O Processors (Linux) High High Speed Speed External External Network Network

Overview of Catamount • LWK – Light Weight Kernel • Catamount OS made up of two pieces – Quintessential Kernel (QK) – Process Control Thread (PCT) • Provide functionality necessary to run a scientific calculation. • No disks / no virtual memory / no fork / etc. • Requires high speed network

Overview of Catamount Virtual Node Mode From the Application point of view nearly identical nodes – twice as many -- half the memory From the System point of view, behaves more as master -- slave.

CPU Responsibility Assignments Dual Core Opteron (example) CPU-0 CPU-1 PCT QK QK subset APP-0 APP-1 Seastar Network Interface Chip

N- way Requirements • Support 1, 2, or 4 processors/node – Desirable: Generalize to N processors/node • No performance regression between CVN and N- way Catamount on dual core nodes • MPI and shmem support • Each core has equal access to NIC for sends • Support both generic (host-based) and accelerated (NIC-based) portals

N-way Requirements (2) • Yod – Must be able to specify ppn, processors_per_node, to use – Number of virtual nodes does not have to be multiple of ppn. • Support heterogeneous mode • Scalable to 100,000 nodes; unlimited virtual nodes • Minimize OS memory usage; not scale with machine size

Implications of Requirements • Common app binary on a node • Equal division of heap among virtual nodes • The ppn option is for the job; not the hetero load segment • # nodes with less than ppn processes on it, is less than ppn • Process tied to processor • No OpenMP support • Share mode not supported • First six are already true for current CVN

N-way Changes –Design & Implementation • Remove PCT arrays dimensioned by # of virtual nodes. • Change binary cpu-0 vs. cpu-1 choices to loops over processors • Adapted the PCT – QK interface • Generalize Process Migration • Yod command line -sz/-size/-np= #nodes [ –ppn=# processes_per_node ] [ –total-virtual-nodes=# vn ] • Generalize QK multi-cpu code – Separate entries or paths per cpu – Handling of cpu-id

N-way Changes –Design & Implementation OS memory usage shall not grow with machine size • Remove PCT arrays dimensioned by maximum number of virtual nodes. – Used in job load – Borrow application space during load. • One shared table dimensioned by rank of job for the processes on the node.

N-way Changes –Design & Implementation • Change “2” to “N” • Change binary cpu-0 vs. cpu-1 choices to loops over processors • Add dimension over cpus to a few structures • Flag places that are 4-way, not N-way

N-way Changes –Design & Implementation • Generalize QK multi-cpu code – Number of places with separate entries or paths per cpu – Handling of cpu-id • Flag 4-way code

N-way Changes –Design & Implementation • Adapted the PCT – QK interface – Keep track of which “non-cpu-0” process – Allow passing list of processes/processors

N-way Changes –Design & Implementation Generalize Process Migration • Processes start on cpu-0 and “migrate” to another cpu • Migration is initiated by application (start up library). • N-way more robust (removes race possibility) – Application process requests migration from the PCT – PCT requests migration of all processes • Changes to start-up-library, PCT and QK.

N-way Changes –Design & Implementation User API for requesting nodes • Discontinue use of “-VN” and “-SN” • Use “-sz/-size/-np” for number of nodes (sockets) – This is same number as specified to qsub • Use “-ppn” for number of processes per node • Use “-total-virtual-nodes”, if not a multiple of ppn • Simplest case: all can be omitted and use default

Test Plan • Confirm that there are no regressions in N-way from current Catamount Virtual Node (CVN) – Verify functionality with test suites – Verify performance with applications • Verify N-way functionality and characterize N- way performance – Can use the same tests as above • Start testing early with baselines from DEV • Regular testing on Sandia devHarness systems • Periodic testing on external XT4 systems running DEV

Current Testing • John tests very basic functionality on up to 16 dual core nodes as changes are made to the N-way code base. (Hello World, application core-dump, intra-application signaling, etc.) • Sue verifies functionality with test suites. To date, N-way only tested on 84 single core and 16 dual core nodes. • Courtenay tests performance using real applications. Tested on Jaguar in April. Jaguar has all dual-core nodes. Results follow.

Testing on Jaguar (XT3/XT4) April 23 • Two Applications – CTH, a shock hydrodynamics code – PARTISN, a neutron transport code • Problems were scaled with number of processors • Two series of runs – First with CVN – Second with N-way • (Lower on graph is better performance) • Anomalies all attributed to XT3 – XT4 difference

CTH VN Performance CTH - shaped charge - 80x192x80/soc 9 8.5 8 7.5 Time per Timestep 7 6.5 6 5.5 dev nway 5 4.5 4 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 Number of cores

Partisn VN performance Partisn - sntiming - 48^3/socket 500 450 Transport - dev Normalized Grind Time Diffusion - dev 400 Transport - nway Diffusion - nway 350 300 250 200 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 Number of cores

Testing on Jaguar (XT3/XT4) April 23 Conclusions About April 23 rd Tests Anomalies attributed to XT3 – XT4 difference. XT4 is faster. No significant difference between CVN and N-way dual-core performance.

Future • This is a work in progress – Not been on quad core yet – To do: 1 gigabyte page support • Considering SMP node numbering – Might relax the heterogeneous: only one ppn value • Testing, Testing, Testing – Quad-core functionality, performance, scaling

Extending Catamount for Multi-Core Processors Cray Users Group - PowerPoint PPT Presentation

Extending Catamount for Multi-Core Processors Cray Users Group Cray Users Group May 9, 2007 John Van Dyke, Courtenay Vaughan, Sue Kelly jpvandy@sandia.gov, ctvaugh@sandia.gov, smkelly@sandia.gov Sandia is a multiprogram laboratory operated by

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

A Framework for the Derivation of WCET Analyses for Multi-Core Processors Michael Jacobs

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com>

Exploring Memory Management Strategies in Catamount Kurt Ferreira, Kevin Pedretti, and Ron

Catamount N-Way Performance on XT5 Ron Brightwell, Suzanne Kelly, Jeff Crow Scalable System

Extending ns Extending ns In OTcl In C++ Debugging Padma Haldar USC/ISI 1 2 ns

Extending CSP with tests for availability Gavin Lowe Extending CSP with tests for availability

Lecture 25: Multi-core Processors Todays topics: Writing parallel programs SMT

Design Space Exploration and Dynamic Thermal Management of Multi-core Processors Sarma Vrudhula

Memory Hierarchy Design Issues Memory Hierarchy Design Issues in Many in Many-Core Processors

Multi-Processors and GPU Philipp Koehn 2 May 2018 Philipp Koehn Computer Systems Fundamentals:

Extending Rational Apex Extending Rational Apex Greg Bek Greg Bek gab@rational.com

Extending a CICS web application using JCICS Extending a CICS web application using JCICS

Reading: The Means Reading: The Means of Extending & of Extending & Building Funds of

Seeing Further: Extending Seeing Further: Extending Visualization as a Basis for Visualization

Through the looking glass, and what Joseph found there Joseph Wright L A T EX Project The xfp

Race and Voting in Florida 17.871 Spring 2012 1 Hypothetical Statistics about Voting Pct.

Meet the Presenters Daphne Lainson Ann McCrackin Partner; Smart & Biggar President; Black

Academic Achievement and Prison Incarceration Rates Analyzing the School-to-Prison Pipeline

Time Series Schemas @Percona Live 2017 1 Who Am I? Chris Larsen Maintainer and author for

Company Information Vadxx Energy, LLC Main Office: Cleveland, OH R&D Center:

1 Enhancing the Chapter II procedure Recent measures to: 1) enhance the quality of the

Implementing Perl 6 Jonathan Worthington Dutch Perl Workshop 2008 Implementing Perl 6 I

Extending Catamount for Multi-Core Processors Cray Users Group - PowerPoint PPT Presentation

Extending Catamount for Multi-Core Processors Cray Users Group Cray Users Group May 9, 2007 John Van Dyke, Courtenay Vaughan, Sue Kelly jpvandy@sandia.gov, ctvaugh@sandia.gov, smkelly@sandia.gov Sandia is a multiprogram laboratory operated by

Welcome Welcome Core: Core A Regional Destination Core: Core UL Core: Core Downtown

Caching, Parallelism, Fault Tolerance Marco Serafini COMPSCI 532 Lectures 2-3 Memory Hierarchy

A Framework for the Derivation of WCET Analyses for Multi-Core Processors Michael Jacobs

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL &lt;larkin@cray.com&gt;

Exploring Memory Management Strategies in Catamount Kurt Ferreira, Kevin Pedretti, and Ron

Catamount N-Way Performance on XT5 Ron Brightwell, Suzanne Kelly, Jeff Crow Scalable System

Extending ns Extending ns In OTcl In C++ Debugging Padma Haldar USC/ISI 1 2 ns

Extending CSP with tests for availability Gavin Lowe Extending CSP with tests for availability

Lecture 25: Multi-core Processors Todays topics: Writing parallel programs SMT

Design Space Exploration and Dynamic Thermal Management of Multi-core Processors Sarma Vrudhula

Memory Hierarchy Design Issues Memory Hierarchy Design Issues in Many in Many-Core Processors

Multi-Processors and GPU Philipp Koehn 2 May 2018 Philipp Koehn Computer Systems Fundamentals:

Extending Rational Apex Extending Rational Apex Greg Bek Greg Bek gab@rational.com

Extending a CICS web application using JCICS Extending a CICS web application using JCICS

Reading: The Means Reading: The Means of Extending &amp; of Extending &amp; Building Funds of

Seeing Further: Extending Seeing Further: Extending Visualization as a Basis for Visualization

Through the looking glass, and what Joseph found there Joseph Wright L A T EX Project The xfp

Race and Voting in Florida 17.871 Spring 2012 1 Hypothetical Statistics about Voting Pct.

Meet the Presenters Daphne Lainson Ann McCrackin Partner; Smart &amp; Biggar President; Black

Academic Achievement and Prison Incarceration Rates Analyzing the School-to-Prison Pipeline

Time Series Schemas @Percona Live 2017 1 Who Am I? Chris Larsen Maintainer and author for

Company Information Vadxx Energy, LLC Main Office: Cleveland, OH R&amp;D Center:

1 Enhancing the Chapter II procedure Recent measures to: 1) enhance the quality of the

Implementing Perl 6 Jonathan Worthington Dutch Perl Workshop 2008 Implementing Perl 6 I

Environment (CLE) Performance Jeff Larkin Jeff Kuehn Cray Inc. ORNL <larkin@cray.com>

Reading: The Means Reading: The Means of Extending & of Extending & Building Funds of

Meet the Presenters Daphne Lainson Ann McCrackin Partner; Smart & Biggar President; Black

Company Information Vadxx Energy, LLC Main Office: Cleveland, OH R&D Center: