A Preliminary Investigation on Optimizing Charm++ for Homogeneous - PowerPoint PPT Presentation

A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines Chao Mei 05/02/2008 The 6 th Charm++ Workshop

Motivation � Clusters are built from multicore chips � 4 cores/node on BG/P � 8 cores/node on Abe (2 Intel quad-core chips) � 16 cores/node on Ranger (4 AMD quad-core chips) � … � Charm has a building version for SMP node for many years � Not tuned � So, what are the issues for getting high performance? Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Start with a kNeighbor benchmark � A synthetic kNeighbor benchmark � Each element communicates with its neighbors in K-stride (wrap- around), and then neighbors send back an acknowledge. � An iteration: all elements finish the above communication � Environment � A smp node with 2 Xeon quadcores, only use 7 cores � Ubuntu 7.04; gcc 4.2 � Charm: net-linux-amd64-smp vs. net-linux-amd64 � 1 element/core, K=3 Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Performance at first glance 1.6 1.4 1.2 iteration time (ms) 1 0.8 0.6 0.4 0.2 0 0 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) Non-SMP SMP Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Outline � Examine the communication model in Charm++ between the Non-SMP and SMP layers � Describe current optimizations for SMP step by step � Talk about a different approach to utilize multicore � Conclude with the future work Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Communication model for the multicore Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Possible overheads in SMP version � Locks � Overusing locks to ensure correctness � Locks in message queues � … � False sharing � Some per thread data structures are allocated together in an array form: e.g. each element in “CmiState state[numThds]” belongs to a thread Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Reducing the usage of locks By examining the source codes, finding overuse of locks � � Narrower sections enclosed by locks 1.6 1.4 1.2 iteration time (ms) 1 0.8 0.6 0.4 0.2 0 0 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) Non-SMP SMP SMP-Relaxed lock Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Overhead in message queues A micro-benchmark � 20000 to show the 1. Each producer 18000 overhead in message produces 10K items per iteration queues 16000 2. One iteration: 14000 � N producers, 1 avg iter tim e (us) consumer cosumes all consumer 12000 items � lock vs. memory 10000 fence + atomic 8000 operation (fetch- 6000 and-increment) 4000 � 1 queue vs. N 2000 queues 0 1 2 3 4 5 6 7 8 number of producers multiQ-fence singleQ-fence+atomic op singleQ-lock Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Applying multi Q + Fence Less than 2% � 0.8 improvement � Much less contention 0.75 compared with the 0.7 micro-benchmark i t e r a t i o n t i m e ( m s ) 0.65 0.6 0.55 0.5 0.45 0.4 0 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) Non-SMP SMP-Relaxed lock SMP-Relaxed lock-multiQ-Fence Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Big overhead in msg allocation � We noticed that: � We used our own default memory module � Every memory allocation is protected by a lock � Provide some useful functionalities in Charm++ system (a historic reason not using other memory modules) � memory footprint information, memory debugger � Isomalloc Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Switching to OS memory module 1.6 1.4 1.2 iteration time (ms) 1 0.8 0.6 0.4 0.2 0 0 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) Non-SMP SMP-Relaxed lock-SingleQ-Fence SMP-Reduced lock overhead We don’t lose the aforementioned functionalities by recent updates ☺ Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Identifying false sharing overhead � Another micro-benchmark � Each element repeatedly sends itself a message, but each time the message is reused (i.e., not allocating a new message) � Benchmark timing of 1000 iterations � Use Intel VTune performance analysis tool � Focusing on the cache misses caused by “Invalidate” in the MESI coherence protocol � Declaring variables with “__thread” specifier will make them thread private Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Performance for the micro-benchmark � Parameters: 1 element/core, 7 cores � Before: 1.236 us per iteration � After: 0.913 us per iteration Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Adding the gains from removing false sharing Around 1% improvement � 1.6 1.4 1.2 iteration time (ms) 1 0.8 0.6 0.4 0.2 0 0 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) Non-SMP SMP SMP-Optimized Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Rethinking communication model � Posix-shared memory layer � No threads, every core still runs a process � Inter-core message passing doesn’t go through NIC, but through memory copy (inter-process communication) Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Performance comparison 1.6 1.4 1.2 iteration time (ms) 1 0.8 0.6 0.4 0.2 0 0 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) Non-SMP SMP-Optimized Posix Shared Memory Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

Future work � Other platform � BG/P � Optimize the posix shared memory version � Effects on real applications � For NAMD, initial result shows that SMP helps up to 24 nodes on Abe � Any other communication models � Adaptive one? Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC

A Preliminary Investigation on Optimizing Charm++ for Homogeneous - PowerPoint PPT Presentation

A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines Chao Mei 05/02/2008 The 6 th Charm++ Workshop Motivation Clusters are built from multicore chips 4 cores/node on BG/P 8 cores/node on Abe (2

Recent Results in Charm Physics Recent Results in Charm Physics Topics Topics Rare Charm

Laboratory Investigation of Laboratory Investigation of Laboratory Investigation of Laboratory

Optimizing monitoring networks for Optimizing monitoring networks for Optimizing monitoring

State of Charm++ Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory

Dynamic Load Balancing in Dynamic Load Balancing in Charm+ + Charm+ + Abhinav S Bhatele

Welcome to the 2017 Charm++ Workshop! Laxmikant (Sanjay) Kale http://charm.cs.illinois.edu

Charm++ Interoperability Nikhil Jain Charm Workshop - 2013 1 Monday, April 15, 13 1

Charm physics and XYZ states at BESIII Evgeny BOGER JINR Dubna On behalf of BESIII

Case Investigation of Avian in Southeast Asia Influenza Overview Initiating an investigation

Optimizing Charm++ over MPI Ralf Gunter , David Goodell, James Dinan, Pavan Balaji April 15, 2013

Optimizing the Site Investigation Process Dan Powell Jody Edwards, PG USEPA Office of Superfund

Combination and QCD Analysis of Charm Production Cross Section Measurements in DIS at HERA Kenan

CHARM Community Health And Resources Management A Scenario Planning Mapping Tool Yu Wen Chou

CHARM: Cassini-Huygens Mission to Saturn 10 th Anniversary!! Titan Highlights Zibi Turtle,

Charm and and bottom bottom Heavy baryon Heavy baryon Charm mass spectrum from from mass

relaxation time on the quenched lattice Atsuro Ikeda, Masayuki Asakawa, Masakiyo Kitazawa Osaka

Optimal Operation of Transient Gas Transport Networks Kai Hoppmann-Baum Combinatorial

A Petri net-based notation for normative modeling: evaluation on deontic paradoxes 1 6 J u n

Goal-Oriented Buffer Management Revisited Andr e Riedel 21 January 2004 Goal-Oriented Buffer

SHARED MEMORY SYSTEMS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of

Java: Learning to Program with Robots Chapter 05: More Decision Making Chapter Objectives

Time Inconsistent Optimal Control and Mean Variance Optimization Tomas Bj ork Stockholm

Explaining Inconsistent Code Muhammad Numair Mansur Introduction 50% of the time in

Incentives and Behavior Prof. Dr. Heiner Schumacher KU Leuven 11. Exploiting Consumers Prof.