Topology-aware OpenMP Process Scheduling Peter Thoman, Hans - PowerPoint PPT Presentation

Topology-aware OpenMP Process Scheduling Peter Thoman, Hans Moritsch, and Thomas Fahringer University of Innsbruck (Austria)

Motivation IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Motivation – Hardware Trends  Multi-core, multi-socket NUMA machines are in wide use in HPC  Complex memory hierarchy and topology  Large number of cores in single shared memory system  are existing OpenMP applications and implementations ready? IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Motivation – Hardware Trends  Multi-core, multi-socket NUMA machines are in wide use in HPC  Complex memory hierarchy and topology  Large number of cores in single shared memory system  are existing OpenMP applications and implementations ready? socket memory socket core core shared shared cache cache core core IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Motivation – Hardware Trends  Multi-core, multi-socket NUMA machines are in wide use in HPC  Complex memory hierarchy and topology  Large number of cores in single shared memory system  are existing OpenMP applications and implementations ready? socket socket socket memory memory memory socket socket socket core core core core core core shared shared shared shared shared shared cache cache cache cache cache cache core core core core core core socket socket socket memory memory memory socket socket socket core core core core core core shared shared shared shared shared shared cache cache cache cache cache cache core core core core core core IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Scalability  We profiled individual OpenMP parallel regions in a variety of programs and problem sizes  On a 8-socket quadcore NUMA system (32 cores)  Determine two metrics:  Maximum threadcount  Maximum amount of threads that can be used with some speedup  Optimal threadcount  Maximum amount of threads that allows a speedup within 20% of ideal IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

12 16 20 24 28 32 0 4 8 bt.B_130 Scalability Results lu.C_3085 mg.A_1105 mg.A_961 lu.A_120 gauss.S_40 mg.B_961 mg.A_236 gauss.L_20 mg.C_1091 mg.C_961 mg.B_236 IWOMP 2010, Topology-aware OpenMP Process Scheduling maximum threadcount mg.A_271 cg.A_254 gauss.S_20 mg.C_236 mg.B_1091 cg.B_254 is.A_638 cg.C_254 mg.B_271 is.B_638 cg.A_740 gauss.M_40 mmul.L_18 mg.C_271 mg.C_1105 cg.B_740 cg.A_785 mmul.M_18 lu.C_120 cg.C_740 mg.B_1105 cg.B_785 mg.A_230 lu.A_3049 is.A_652 ft.A_145 ft.A_123 gauss.M_20 optimal threadcount mmul.S_18 cg.C_785 mg.B_230 is.B_652 ft.B_145 ft.B_123 ep.B_144 cg.A_171 cg.A_644 bt.A_149 bt.B_149 lu.C_3049 ep.C_144 mg.C_230 2010-06-15 cg.B_171 cg.B_644 mg.A_1091 cg.C_171 cg.C_644 lu.A_3085 ep.A_144 bt.A_130 gauss.L_40

Motivation – Multi-Process  First idea: run more than one OMP program (job) in parallel 800 700 Total execution time (seconds) 600 500 400 300 200 100 0 1 2 3 4 5 6 7 8 Number of parallel jobs IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Motivation – Multi-Process  Of course it is not always that simple – a different workload: 2500 Total execution time (seconds) 2000 1500 1000 500 0 1 2 3 4 5 6 7 8 Number of parallel jobs IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Algorithm & Implementation IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Multi-Process Scheduling Architecture  Goal:  Facilitate system-wide scheduling of OpenMP programs  Basic Design:  One central control process ( server ), message exchange between server and the OMP runtime of each program  Message protocol:  Upon encountering a OMP parallel region:  OMP processes send a request to server for resources  Includes scalability information for region  Use cores indicated by reply  When leaving region send signal to free cores IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Implementation & Flow  Based on UNIX message queues  Well suited semantically and fast enough (less than 4 microseconds roundtrip on our systems) IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Topology-aware Scheduling Algorithm  Multi-process scheduling ameliorates many-core scalability problems  What about complex memory hierarchy?  Make server topology aware  Base scheduling decisions on  Region scalability  Current system-wide load  System topology  Topology-aware OMP scheduler IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Topology Representation  Distance matrix for all cores in a system  Higher distance amplification factors for higher levels in the memory hierarchy  Example: IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Simple Scheduling  Request from region with given maxcount and optcount: 1. N = optcount + loadfactor * (maxcount - optcount)  loadfactor dependent on amount of free cores Select N-1 cores close to core from which the request 2. originated Slightly more complicated in practice  dealing with case where fewer than N cores available  (decide whether to queue or return smaller amount) IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Fragmentation  Using simple scheduling leads to fragmentation : socket socket memory memory socket socket core core core core shared shared shared shared cache cache cache cache core core core core socket socket memory memory socket socket core core core core shared shared shared shared cache cache cache cache core core core core  Sum of local distance in all 4 processes: 44 IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Improvement: Clustering  Same processes without fragmentation: socket socket memory memory socket socket core core core core shared shared shared shared cache cache cache cache core core core core socket socket memory memory socket socket core core core core shared shared shared shared cache cache cache cache core core core core  Sum of local distance in all 4 processes: 13 IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Clustering Algorithm  Moving threads once started has significant performance impact (caches, pages, etc)  instead change algorithm to discourage fragmentation  Define cores as part of a hierarchy of core sets  When selecting a core from a new set, prefer (in order) A core set containing exactly as many free cores as required 1. A core set containing more free cores than required 2. An empty core set 3.  Further improvement possible by adjusting number of selected cores ( enhanced clustering ) IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Evaluation IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Simulation  Evaluate impact of scheduling enhancements over 10000 semi-random requests  Calculate or measure 5 properties:  Scheduling time required per request  Target miss rate: |#returned_threads - #ideal_threads|  3 distance metrics:  Total distance: from each thread in a team to each other  Weighted distance: distance between threads with close id weighted higher  Local distance: only count distance from each core to next in sequence IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Simulation Results 120% Overhead (µs) 100% Target miss rate 80% 60% Total distance 40% Weighted distance 20% Local distance 0%  Absolute overhead always below 1.4 microseconds  Enhanced clustering reduces local distance by 70% IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Experiments  Hardware:  Sun XFire 4600 M2  8 quad-cores (AMD Barcelona, partially connected, 1-3 hops)  Software  Backend: GCC 4.4.2  “Default” OMP: GOMP  Insieme compiler/runtime r278 IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Small-scale Experiment  Random set of 13 programs tested 1000 GOMP, sequential 900 800 Optimal threadcount, 700 Total Time (seconds) standard OS mapping 600 Our server, no locality 500 information 400 Our server, locality 300 200 Our server, locality + 100 enhanced clustering 0 IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Large-scale Experiment  Random programs chosen from NPB & 2 kernels  Random problem sizes 30000 GOMP sequential 25000 Total Time (seconds) 20000 Our server, no locality 15000 Our server, locality 10000 5000 Our server, locality + clustering 0 IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Power Consumption  Power consumption measured during large-scale experiment:  Topology-aware scheduling (with appropriate thread counts) reduces average power consumption IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15

Topology-aware OpenMP Process Scheduling Peter Thoman, Hans - PowerPoint PPT Presentation

Topology-aware OpenMP Process Scheduling Peter Thoman, Hans Moritsch, and Thomas Fahringer University of Innsbruck (Austria) Motivation IWOMP 2010, Topology-aware OpenMP Process Scheduling 2010-06-15 Motivation Hardware Trends

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1 Overview Why

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

Topological data analysis and topology-based visualization Leila De Floriani Topology-based

Targeting GPUs with OpenMP 4.5 Device Directives James Beyer, NVIDIA Jeff Larkin, NVIDIA OpenMP

OpenMP on GPUs, First Experiences and Best Practices Jeff Larkin, GTC2018 S8344, March 2018 What

How to Get Good Performance by Using OpenMP ! ! Loop optimizations ! ! Measuring OpenMP

From trajectory optimization to inverse KKT and sequential manipulation Marc Toussaint Machine

Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects

CREATIVITA NELLEMBEDDED @ COMPANY OVERVIEW Experience HR Since 1979 206 people SECO

Semiclassical Limit of Large Fermionic Systems Sren Fournais Department of Mathematics, Aarhus

Parents Day Secondary 3 3 March 2018 Objectives of Todays Session To provide an

Jpp PDF time arrival dependence on direction Jordan Seneca August 20 th 2020, update Jordan

Light and clarity in KM3NeT Jordan Seneca My goal for this presentation Inspire, stimulate

Reform Movements 2nd Great Awakening Prison reform Educational reform

Sambuz

Useful Links

Newsletter

Mail Us