Lab 2: OpenMP + NUMA CSE 6230: HPC Tools & Apps Fall 2014 - PowerPoint PPT Presentation

Notes on Lab 2: OpenMP + NUMA CSE 6230: HPC Tools & Apps Fall 2014 — September 9 � Based in part on the LLNL tutorial @ https://computing.llnl.gov/tutorials/openMP/ See also the textbook, Chapters 6—8

Part 0: Code reviews + last shot at redemption! (Share ideas — if you already hit Lab 1, bonus points for you!) � Part 1: Cilk Plus vs. OpenMP — fight! (spawn → omp task, parfor → omp for. Easy! Or is it?) � Part 2: Science experiment: NUMA in action! (Next!)

------------------------------------------------------------- ************************************************************* CPU type: Intel Core Westmere processor NUMA domains: 2 ************************************************************* ------------------------------------------------------------- Hardware Thread Topology Domain 0: ************************************************************* Processors: 0 2 4 6 8 10 12 14 16 18 20 22 Sockets: 2 Memory: 10988.6 MB free of total 12277.8 MB Cores per socket: 6 ------------------------------------------------------------- Threads per core: 2 Domain 1: ------------------------------------------------------------- Processors: 1 3 5 7 9 11 13 15 17 19 21 23 HWThread Thread Core Socket Memory: 10986.1 MB free of total 12288 MB 0 0 0 0 ------------------------------------------------------------- 1 0 0 1 � 2 0 8 0 ************************************************************* 3 0 8 1 Graphical: 4 0 2 0 ************************************************************* 5 0 2 1 Socket 0: 6 0 10 0 +-------------------------------------------------------------------+ 7 0 10 1 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | 8 0 1 0 | | 0 12 | | 8 20 | | 4 16 | | 2 14 | | 10 22 | | 6 18 | | 9 0 1 1 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | 10 0 9 0 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | 11 0 9 1 | | 32kB | | 32kB | | 32kB | | 32kB | | 32kB | | 32kB | | 12 1 0 0 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | 13 1 0 1 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | 14 1 8 0 | | 256kB | | 256kB | | 256kB | | 256kB | | 256kB | | 256kB | | 15 1 8 1 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | 16 1 2 0 | +---------------------------------------------------------------+ | 17 1 2 1 | | 12MB | | 18 1 10 0 | +---------------------------------------------------------------+ | 19 1 10 1 +-------------------------------------------------------------------+ 20 1 1 0 Socket 1: 21 1 1 1 +-------------------------------------------------------------------+ 22 1 9 0 | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | 23 1 9 1 | | 1 13 | | 9 21 | | 5 17 | | 3 15 | | 11 23 | | 7 19 | | ------------------------------------------------------------- | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | Socket 0: ( 0 12 8 20 4 16 2 14 10 22 6 18 ) | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | Socket 1: ( 1 13 9 21 5 17 3 15 11 23 7 19 ) | | 32kB | | 32kB | | 32kB | | 32kB | | 32kB | | 32kB | | ------------------------------------------------------------- | +--------+ +--------+ +--------+ +--------+ +--------+ +--------+ | jinx9 $ /nethome/rvuduc3/local/jinx/likwid/2.2.1-ICC/bin/likwid-topology -g

Performance tuning tip: Exploit non-uniform memory access (NUMA) Socket-0 Socket-1 DRAM DRAM Core0 Core1 0 1 Core2 Core3 2 3 4 Example: Two quad-core CPUs with logically shared but physically distributed memory

Exploiting NUMA: Linux “first-touch” policy a = /* … allocate bu ff er … */; #pragma omp parallel for … schedule(static) for (i = 0; i < n; ++i) { a[i] = /* … initial value … */ ; } #pragma omp parallel for … for (i = 0; i < n; ++i) { a[i] += foo (i); } 5

Exploiting NUMA: Linux “first-touch” policy a = /* … allocate bu ff er … */; #pragma omp parallel for … for (i = 0; i < n; ++i) { a[i] = /* … initial value … */ ; } #pragma omp parallel for … for (i = 0; i < n; ++i) { a[i] += foo (i); } 6

Exploiting NUMA: Linux “first-touch” policy a = /* … allocate bu ff er … */; #pragma omp parallel for … schedule(static) for (i = 0; i < n; ++i) { a[i] = /* … initial value … */ ; } #pragma omp parallel for … schedule(static) for (i = 0; i < n; ++i) { a[i] += foo (i); } 7

Thread binding Key environment variables OMP_NUM_THREADS : Number of OpenMP threads GOMP_CPU_AFFINITY : Specify thread-to-core binding � Consider: 2-socket x 6-core system, main thread initializes data and ‘6’ OpenMP threads operate env OMP_NUM_THREADS=6 GOMP_CPU_AFFINITY=“0 2 4 6 … 22” ./run-program …   (shorthand: GOMP_CPU_AFFINITY=“0-22:2”) env OMP_NUM_THREADS=6 GOMP_CPU_AFFINITY=“1 3 5 7 … 23” ./run-program …   (shorthand: GOMP_CPU_AFFINITY=“1-23:2”) 8

Effective Bandwidth (GB/s) 25 “Triad:” c[i] ← a[i] + s*b[i] ● ● ● ● ● ● ● ● ● 19 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 12 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 11 ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● 9 ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 Sequential OpenMP x 6 OpenMP x 6 OpenMP x 12 OpenMP x 12 Master initializes Master initializes Master initializes First touch Read from socket 0 Read from socket 1 Read from both sockets

What’s ¡Suboptimal? Left alignment Attractive font (sans serif, avoid Arial) Calibri, ¡Helvetica, ¡Gill ¡Sans ¡MT, ¡… DFT 2 n (single precision) on Pentium 4, 2.53 GHz [Gflop/s] Horizontal 7 y-label Spiral SSE 6 5 Main line Intel MKL possibly 4 emphasized (red, thicker) Spiral vectorized No y-axis 3 (superfluous) 2 Spiral scalar 1 Background/grid inverted for 0 better layering 4 5 6 7 8 9 10 11 12 13 n No legend; makes decoding easier Plotting tips http://www.inf.ethz.ch/personal/markusp/teaching/263-2300-ETH-spring14/slides/05-bench-compiler-limitations.pdf

Lab 2: OpenMP + NUMA CSE 6230: HPC Tools & Apps Fall 2014 - PowerPoint PPT Presentation

Notes on Lab 2: OpenMP + NUMA CSE 6230: HPC Tools & Apps Fall 2014 September 9 Based in part on the LLNL tutorial @ https://computing.llnl.gov/tutorials/openMP/ See also the textbook, Chapters 68 Part 0: Code reviews + last shot

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

OpenMP + NUMA CSE 6230: HPC Tools & Apps Fall 2014 September 5 Based in part on the

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

Recap: Pipes main() { char s, buf[1024]; int fds[2]; s = Hello World \n"; / create

Client Design Client Design Srinidhi Varadarajan Topics Topics Concurrency in client

UNP Chapter 4: Elementary TCP Sockets CMPS 105: Systems Programming Prof. Scott Brandt T Th

sts Prr rtt

Building socket-aware BPF programs Joe Stringer Cilium.io Linux Plumbers 2018, Vancouver, BC

DNA Interaction Follow Network Network User-Product Network Nonuniform network comm costs

Layers, Naming, and Sockets CS 118 Computer Network Fundamentals Peter Reiher Lecture 10 CS

CSE 333 Section 8 Client-Side Networking Computer Networks: A 7-ish Layer Cake format/meaning

Lab 2: OpenMP + NUMA CSE 6230: HPC Tools & Apps Fall 2014 - PowerPoint PPT Presentation

Notes on Lab 2: OpenMP + NUMA CSE 6230: HPC Tools & Apps Fall 2014 September 9 Based in part on the LLNL tutorial @ https://computing.llnl.gov/tutorials/openMP/ See also the textbook, Chapters 68 Part 0: Code reviews + last shot

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

Scalable NUMA-aware Blocking Synchronization Primitives Sanidhya Kashyap , Changwoo Min, Taesoo

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

NUMA-aware Reader-Writer Locks Tom Herold, Marco Lamina 04.02.2015 NUMA Seminar Agenda 1.

Automatic NUMA Balancing Rik van Riel, Principal Software Engineer, Red Hat Vinod Chegu, Master

OpenMP + NUMA CSE 6230: HPC Tools &amp; Apps Fall 2014 September 5 Based in part on the

COMP 633 - Parallel Computing Lecture 10 September 15, 2020 CC-NUMA (1) CC-NUMA implementation

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

NUMA Non-Uniform Memory Access Numa becomes more common because memory controllers get close

NUMA Support for Charm++ Does memory affinity matter? Christiane Pousa Ribeiro Maxime Martinasso

FreeBSD and NUMA John Baldwin NYC*BUG June 3, 2015 What is NUMA Non-Uniform Memory

Recap: Pipes main() { char *s, buf[1024]; int fds[2]; s = Hello World \n&quot;; /* create

Client Design Client Design Srinidhi Varadarajan Topics Topics Concurrency in client

UNP Chapter 4: Elementary TCP Sockets CMPS 105: Systems Programming Prof. Scott Brandt T Th

sts Prr rtt

Building socket-aware BPF programs Joe Stringer Cilium.io Linux Plumbers 2018, Vancouver, BC

DNA Interaction Follow Network Network User-Product Network Nonuniform network comm costs

Layers, Naming, and Sockets CS 118 Computer Network Fundamentals Peter Reiher Lecture 10 CS

CSE 333 Section 8 Client-Side Networking Computer Networks: A 7-ish Layer Cake format/meaning

OpenMP + NUMA CSE 6230: HPC Tools & Apps Fall 2014 September 5 Based in part on the

Recap: Pipes main() { char s, buf[1024]; int fds[2]; s = Hello World \n"; / create