Using MuMMI to Model and Optimize Energy and Performance Xingfu Wu and Valerie Taylor Texas A&M University Scalable Tools Workshop 2015, Lake Tahoe, CA August 3, 2015 http://www.mummi.org
Outline n Recent Development of MuMMI n Power Measurement Tools n Hardware Performance Counter Tools n Performance Counter-based Modeling n Performance Counter-guided Optimization http://www.mummi.org
MuMMI (Multiple Metrics Modeling Infrastructure) Project Application E-AMOM PAPI PowerPack EMON API Database Multicore/Heterogeneous System for Execution http://www.mummi.org
MuMMI Database Schema Application Application Executable Run Performance Coupling User Modules Power Inputs Counters Function Module_Info Systems Functions Performance Compilers Counters Power Control Resource Basic Unit Flow Model Performance Connection Template Function_Info Sys_Comp Data Structure Model_Info Library Sys_Comm Performance http://www.mummi.org
Data Collection: MAIDE System Source code MAIDE Instrumented source code Compiler Call Graph Power and HW Counters Instrumented executable Performance, HW counters, Power and Energy Data SOAP Server with Perl Script MuMMI Database http://www.mummi.org
Power Measurement Tools n IBM EMON API on BlueGene P/Q (MonEQ) n Intel RAPL n NVIDA MLPM n PowerMon2 (RENCI) n PowerPack (VT) http://www.mummi.org
PowerPack Schema (Virginia Tech) Power sampling frequency: 1 sample per second http://scape.cs.vt.edu/software/powerpack-2-0/ http://www.mummi.org
PowerPack http://www.mummi.org
IBM EMON API Power sampling frequency: ~2 samples per second Source: IBM http://www.mummi.org
IBM EMON API http://www.mummi.org
IBM EMON API Power per node card for GTC on 16384 nodes of ANL BGQ Mira 2000 1800 Node_Card Chip_Core 1600 DRAM 1400 Network 1200 SRAM Power (W) Optics 1000 PCIexpress 800 Link_Chip_Core 600 400 200 0 0 10 20 30 40 50 60 70 80 90 100 Time (seconds) http://www.mummi.org
IBM EMON API Average Power per node for GTC on 16384 nodes of ANL BGQ Mira Node Power CPU Power Memory Power Network 60 50 40 Power (W) 30 20 10 0 0 10 20 30 40 50 60 70 80 90 100 Time (seconds) http://www.mummi.org
Performance Counter Tools n Perf_events (Linux) n HPM (IBM) n perfmon (Linux) n PAPI (UTK) http://www.mummi.org
Outline n Recent Development of MuMMI n Power Measurement Tools n Hardware Performance Counter Tools n Performance Counter-based Modeling n Performance Counter-guided Optimization http://www.mummi.org
Overview: Performance Counter-based Modeling Recommendations HPC Application for Improvements Predicted runtime Runtime and Power Modeling Application or Metric PAPI Function-level Runtime and f(C 1 ,C 2 , … ,C n ) Power (CPU,mem) Spearman Regression Model PCA Counters Application or function-level Predicted power Performance (node, CPU, mem) Counters (C i ) Four metrics: runtime, node power, CPU power, memory power http://www.mummi.org
Four Models of Parallel eq3dyna on SystemG http://www.mummi.org
Four Models of Parallel eq3dyna on ANL Mira http://www.mummi.org
Prediction Error Rates on ANL Mira http://www.mummi.org
Outline n Recent Development of MuMMI n Power Measurement Tools n Hardware Performance Counter Tools n Performance Counter-based Modeling n Performance Counter-guided Optimization http://www.mummi.org
Overview: Counter-Guided Optimizations Runtime System Power CPU Power Memory Power f 1 (C 11 , C 12 , … ,C 1n ) f 2 (C 21 , C 22 , … ,C 2m ) f 3 (C 31 , C 32 , … ,C 3s ) f 4 (C 41 , C 42 , … ,C 4r ) Ranking counters based on coefficient percentage Rank(C 11 , C 12 , … ,C 1n ), Rank(C 21 , C 22 , … ,C 2m ) Rank(C 31 , C 32 , … ,C 3s ), Rank(C 41 , C 42 , … ,C 4r ) Ranking counters with percentage (>1%) (rank from the highest to the lowest) C 1 , C 2 , … ,C k Pair-wise spearman correlation analysis Final counters (rank from the highest to the lowest) C 1 , C 2 , … ,C j (j < k) http://www.mummi.org
Counter Ranking Runtime System Power CPU Power Memory Power f 1 (C 11 , C 12 , … ,C 1n ) f 2 (C 21 , C 22 , … ,C 2m ) f 3 (C 31 , C 32 , … ,C 3s ) f 4 (C 41 , C 42 , … ,C 4r ) Ranking counters based on coefficient percentage Rank(C 11 , C 12 , … ,C 1n ), Rank(C 21 , C 22 , … ,C 2m ) Rank(C 31 , C 32 , … ,C 3s ), Rank(C 41 , C 42 , … ,C 4r ) For example, given a parallel aerospace simulation PMLB: Runtime Node Power CPU Power Memory Power TLB_IM: 64.29% VEC_INS: 76.64% VEC_INS: 99.15% VEC_INS: 83.91% TLB_DM: 14.03% CA_SHR: 22.45% BR_NTK: 0.81% CA_CLN: 13.74% L2_ICM: 10.49% L1_TCM: 0.89% RES_STL: 0.04% BR_NTK: 0.98% L1_ICM: 9.75% RES_STL: 0.02% L1_TCM: 0.92% L2_ICA: 1.40% RES_STL: 0.18% BR_INS: 0.03% BR_TKN: 0.16% SR_INS: 0.01% L1_ICA: 0.11% http://www.mummi.org
Counter Ranking for Original PMLB on SystemG 100 PAPI_L1_ICA 90 PAPI_BR_TKN PAPI_CA_CLN 80 PAPI_BR_NTK Coefficient Percentage (%) 70 PAPI_RES_STL PAPI_L1_TCM 60 PAPI_CA_SHR PAPI_VEC_INS 50 PAPI_SR_INS 40 PAPI_BR_INS PAPI_L2_ICA 30 PAPI_L1_ICM 20 PAPI_L2_ICM PAPI_TLB_DM 10 PAPI_TLB_IM 0 Runtime System Power CPU Power Memory Power Models http://www.mummi.org
Counter Ranking Runtime System Power CPU Power Memory Power f 1 (C 11 , C 12 , … ,C 1n ) f 2 (C 21 , C 22 , … ,C 2m ) f 3 (C 31 , C 32 , … ,C 3s ) f 4 (C 41 , C 42 , … ,C 4r ) Ranking counters based on coefficient percentage Rank(C 11 , C 12 , … ,C 1n ), Rank(C 21 , C 22 , … ,C 2m ) Rank(C 31 , C 32 , … ,C 3s ), Rank(C 41 , C 42 , … ,C 4r ) Ranking counters with percentage (>1%) (from the highest to the lowest) C 1 , C 2 , … ,C k Runtime TLB_IM: 64.29% TLB_DM: 14.03% Node Power CPU Power Memory Power L2_ICM:10.49% VEC_INS: 76.64% VEC_INS: 99.15% VEC_INS: 83.91% L1_ICM: 9.75% CA_SHR: 22.45% CA_CLN: 13.74% L2_ICA: 1.40% TLB_IM, VEC_INS, TLB_DM, L2_ICM, L1_ICM, L2_ICA, CA_SHR, CA_CLN http://www.mummi.org
Correlation Analysis Using Pair-wise Spearman n TLB_IM: Occurred in Runtime TLB_DM: Corr Value=0.89217296 : Occurred in Runtime BR_NTK: Corr Value=0.83305966 : Occurred in CPU, Memory L2_ICM: Corr Value=0.88451013 : Occurred in Runtime Final counters: TLB_IM and VEC_INS L1_ICM: Corr Value=0.96934866 : Occurred in Runtime for optimization focus L2_ICA: Corr Value=0.97044335 : Occurred in Runtime BR_TKN: Corr Value=0.88122605 : Occurred in Memory BR_INS: Corr Value=0.88122605 : Occurred in Runtime n VEC_INS: Occurred in System, CPU, Memory http://www.mummi.org
Performance for PMLB with 128x128x128 on SystemG Original Optimized 256 128 Time (s)(log2) 64 32 16 1 2 4 8 16 32 64 128 Number of Cores http://www.mummi.org
Node Power Comparison on SystemG Original Optimized 350 340 330 320 Power per node (W) 310 300 290 280 270 260 250 1 2 4 8 16 32 64 128 Number of Cores http://www.mummi.org
Energy Comparison for PMLB on SystemG Original Optimized 65536 32768 Energy per node (J)(log2) 16384 8192 4096 1 2 4 8 16 32 64 128 Number of Cores (log2) http://www.mummi.org
Counter Ranking for Original PMLB on Mira 100 PAPI_FDV_INS 90 PAPI_FML_INS 80 PAPI_RES_STL Coefficient Percentage (%) PAPI_VEC_INS 70 PAPI_FP_INS PAPI_SR_INS 60 PAPI_BR_NTK PAPI_BR_MSP 50 PAPI_L1_ICM 40 PAPI_HW_INT 30 20 10 0 Runtime System Power CPU Power Memory Power Models http://www.mummi.org
Performance Comparison on Mira Orignial Optimized 256 128 64 Time (s) (log2) 32 16 8 4 Number of Nodes X Number of Threads per Node http://www.mummi.org
System Power Comparison on Mira Orignial Optimized 61 59 57 Power per node (W) 55 53 51 49 47 45 Number of Nodes X Number of Threads per Node http://www.mummi.org
Energy Comparison for PMLB with 512x512x512 on Mira Orignial Optimized 16384 8192 Energy per Node (I) (log2) 4096 2048 1024 512 256 Number of Nodes X Number of Threads per Node http://www.mummi.org
Counter Ranking on SystemG Counter Ranking for Original eq3dyna on SystemG 100 PAPI_SR_INS 90 PAPI_FDV_INS PAPI_L2_STM 80 PAPI_FML_INS 70 PAPI_L1_TCA Coefficient Percentage (%) PAPI_RES_STL 60 PAPI_TLB_DM 50 PAPI_L1_STM 40 PAPI_L2_TCW PAPI_BR_NTK 30 PAPI_FP_INS 20 PAPI_L2_DCW PAPI_L2_ICA 10 PAPI_L1_ICM 0 Runtime System Power CPU Power Memory Power Models http://www.mummi.org
Energy Comparison of eq3dyna on SystemG Original Optimized 512000 256000 128000 Energy per Node (J) (log2) 64000 32000 16000 8000 4000 1 2 4 8 16 32 64 128 256 Number of Cores (log2) http://www.mummi.org
Counter Ranking on ANL BGQ Mira Counter Ranking for Original eq3dyna on ANL BGQ Mira 100 PAPI_L1_DCM 90 PAPI_SR_INS 80 PAPI_BR_NTK PAPI_L1_STM 70 Coefficient Percentage (%) PAPI_RES_STL 60 PAPI_LD_INS PAPI_BR_MSP 50 PAPI_VEC_INS 40 30 20 10 0 Runtime System Power CPU Power Memory Power Models http://www.mummi.org
Energy Comparison for eq3dyna with 100m on ANL BG/Q Mira Orignial Optimized 60000 50000 40000 Energy per node (J) 30000 20000 10000 0 32x16 32x32 32x64 64x16 64x32 64x64 128x16 128x32 128x64 192x16 192x32 192x64 256x16 256x32 256x64 Number of Nodes x Nunber of Threads per node (max number of threads per core is 4) http://www.mummi.org
Recommend
More recommend