A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines Chao Mei 05/02/2008 The 6 th Charm++ Workshop
Motivation � Clusters are built from multicore chips � 4 cores/node on BG/P � 8 cores/node on Abe (2 Intel quad-core chips) � 16 cores/node on Ranger (4 AMD quad-core chips) � … � Charm has a building version for SMP node for many years � Not tuned � So, what are the issues for getting high performance? Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC
Start with a kNeighbor benchmark � A synthetic kNeighbor benchmark � Each element communicates with its neighbors in K-stride (wrap- around), and then neighbors send back an acknowledge. � An iteration: all elements finish the above communication � Environment � A smp node with 2 Xeon quadcores, only use 7 cores � Ubuntu 7.04; gcc 4.2 � Charm: net-linux-amd64-smp vs. net-linux-amd64 � 1 element/core, K=3 Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC
Performance at first glance 1.6 1.4 1.2 iteration time (ms) 1 0.8 0.6 0.4 0.2 0 0 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) Non-SMP SMP Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC
Outline � Examine the communication model in Charm++ between the Non-SMP and SMP layers � Describe current optimizations for SMP step by step � Talk about a different approach to utilize multicore � Conclude with the future work Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC
Communication model for the multicore Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC
Possible overheads in SMP version � Locks � Overusing locks to ensure correctness � Locks in message queues � … � False sharing � Some per thread data structures are allocated together in an array form: e.g. each element in “CmiState state[numThds]” belongs to a thread Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC
Reducing the usage of locks By examining the source codes, finding overuse of locks � � Narrower sections enclosed by locks 1.6 1.4 1.2 iteration time (ms) 1 0.8 0.6 0.4 0.2 0 0 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) Non-SMP SMP SMP-Relaxed lock Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC
Overhead in message queues A micro-benchmark � 20000 to show the 1. Each producer 18000 overhead in message produces 10K items per iteration queues 16000 2. One iteration: 14000 � N producers, 1 avg iter tim e (us) consumer cosumes all consumer 12000 items � lock vs. memory 10000 fence + atomic 8000 operation (fetch- 6000 and-increment) 4000 � 1 queue vs. N 2000 queues 0 1 2 3 4 5 6 7 8 number of producers multiQ-fence singleQ-fence+atomic op singleQ-lock Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC
Applying multi Q + Fence Less than 2% � 0.8 improvement � Much less contention 0.75 compared with the 0.7 micro-benchmark i t e r a t i o n t i m e ( m s ) 0.65 0.6 0.55 0.5 0.45 0.4 0 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) Non-SMP SMP-Relaxed lock SMP-Relaxed lock-multiQ-Fence Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC
Big overhead in msg allocation � We noticed that: � We used our own default memory module � Every memory allocation is protected by a lock � Provide some useful functionalities in Charm++ system (a historic reason not using other memory modules) � memory footprint information, memory debugger � Isomalloc Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC
Switching to OS memory module 1.6 1.4 1.2 iteration time (ms) 1 0.8 0.6 0.4 0.2 0 0 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) Non-SMP SMP-Relaxed lock-SingleQ-Fence SMP-Reduced lock overhead We don’t lose the aforementioned functionalities by recent updates ☺ Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC
Identifying false sharing overhead � Another micro-benchmark � Each element repeatedly sends itself a message, but each time the message is reused (i.e., not allocating a new message) � Benchmark timing of 1000 iterations � Use Intel VTune performance analysis tool � Focusing on the cache misses caused by “Invalidate” in the MESI coherence protocol � Declaring variables with “__thread” specifier will make them thread private Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC
Performance for the micro-benchmark � Parameters: 1 element/core, 7 cores � Before: 1.236 us per iteration � After: 0.913 us per iteration Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC
Adding the gains from removing false sharing Around 1% improvement � 1.6 1.4 1.2 iteration time (ms) 1 0.8 0.6 0.4 0.2 0 0 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) Non-SMP SMP SMP-Optimized Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC
Rethinking communication model � Posix-shared memory layer � No threads, every core still runs a process � Inter-core message passing doesn’t go through NIC, but through memory copy (inter-process communication) Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC
Performance comparison 1.6 1.4 1.2 iteration time (ms) 1 0.8 0.6 0.4 0.2 0 0 2000 4000 6000 8000 10000 12000 14000 16000 msg size (byte) Non-SMP SMP-Optimized Posix Shared Memory Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC
Future work � Other platform � BG/P � Optimize the posix shared memory version � Effects on real applications � For NAMD, initial result shows that SMP helps up to 24 nodes on Abe � Any other communication models � Adaptive one? Chao Mei (chaomei2@ uiuc.edu) Parallel Programming Lab, UIUC
Recommend
More recommend