Simulating Stencil-based Application on Future Xeon-Phi Processor PMBS workshop at SC’15 Chitra Natarajan Carl Beckmann Anthony Nguyen Intel Corporation Intel Corporation Intel Corporation Mauricio Araya-Polo Tryggve Fossum Detlef Hohl Shell Intl. E&P Inc. Intel Corporation Shell Intl. E&P Inc
Introduction • Software/Hardware Co-design • Simulate high-value software portfolio ahead of hardware availability • Collaborative effort to influence both future software and hardware development • Target Software: Stencil-based O&G hydrocarbon exploration application • Target Hardware: Xeon Phi processor • Outline • Stencil-based O&G hydrocarbon exploration application • Knights Landing (KNL) Xeon Phi processor • Cycle-Accurate Models (CAM) & Fast-Abstract Models (FAM) • Correlation of CAM to real system for an existing processor (Xeon SNB) • Correlation of FAM to CAM for KNL • CAM/FAM KNL simulation results 2
O&G Hydrocarbon Exploration Target Application Data acquisition, on/off shore Seismic Imaging, Wave Equations (Du, Fletcher, and Fowler, EAGE 2010) VTI assumption. 2 2 2 2 p p p q 2 V V x z 2 2 2 2 t x y z 2 2 2 2 q p p q 2 V V n z 2 2 2 2 t x y z V V 1 2 , V V 1 2 n z x z 3
O&G Hydocarbon Exploration Target Application 1. MPI+X model, in this work X=OpenMP, and only 1-process behavior is analyzed 2. Wave equation PDE solved explicitly, stencil-based code, high-order 24-24-16 3. Implemented as two major loops: loop1 (sweeping Z) & loop2 (sweeping X & Y) 4. Key issues: data dependency (memory bound) and low data reuse loop 2 loop 1 4
5
6
Cycle Accurate Model (CAM) vs. Fast Abstract Model (FAM) Cycle Accurate Model (CAM) Fast Abstract Model (FAM) • Cycle accurate performance model • Do not model in cycle accurate detail • Validated extensively against silicon • Correlated against CAM • Developed by product design teams • Accuracy vs. CAM ~ +/- 20% over a wide across generations over many years range of ST workloads • Slow simulation speed • Trades accuracy for speed • ~1K instructions per real second • ~ 100K – 10M instructions per second • Difficult to simulate more than a few 10’s • Can simulate 10’s of billions of of million instructions per test instructions per test • Difficult to scale to > few 10’s of threads • Simulates multiple cores and threads • Primarily used trace-driven method • Methodologies supported • Execution-driven method added • Trace-driven • Uses Intel SDE as functional emulator • Execution-driven 7
Xeon SNB E5-2690 EMON CPI Data for 20 Timesteps • CPI (Cycles Per Instruction) • Can clearly observe the 20 time steps, with ~2/3 rd of each at CPI of ~0.53x and ~1/3 rd at ~0.46x • The 2 CPI levels reflect the 2 loops per time step 8
CAM Model to Real System Correlation on Xeon SNB • Representative Simpoints-based tracing resulted in 5 regions/traces • As expected, 2 traces dominate corresponding to the 2 loops with ~70% and ~29% weights • 20 time step execution resulted in ~138.6B instructions • Good correlation of CAM simulation data to real system measurement data • CPI & LLC MPI (Last-Level Cache Misses Per Instruction) within 2%, overall runtime within 3% 9
FAM vs. CAM correlation for KNL Configuration simulated: Metrics compared: Xeon Phi “Knights Landing” core IPC 1 to 8 cores L1 and L2 cache miss rates 2 cores per tile Speedup 1 to 4 SMT threads per core FAM vs. CAM for Loop1 CAM Loop1 Speedup FAM Loop1 Speedup 1.60 8 8 1.40 7 7 6 1.20 6 5 1.00 5 Speedup Speedup 4 0.80 4 3 0.60 3 2 0.40 2 1 0.20 1 0 0.00 0 1 2 4 6 8 tpc 1 2 4 1 2 4 1 2 4 1 2 4 1 2 4 1 2 4 6 8 1 2 3 4 #cores 1 2 4 6 8 1 2 3 4 1smt 2smt 4smt Total IPC L1D mpki L2 mpki 1smt 2smt 4smt • Correlation typically in the ~20% range for 1T, but worsens with SMT • FAM vs. CAM speedup trends are similar to each other 10
Tile scaling study on CAM VTI loop1 scaling 16.00 speedup relative to 1 tile DDR only 14.00 • Cycle accurate model experiments 12.00 10.00 • 1 to 16 tiles (2 to 32 cores) 8.00 • Execution driven 6.00 • Cache sharing modeled accurately 4.00 2.00 • Two main loops simulated partially 0.00 • Only 3 loop iterations per thread due 0 2 4 6 8 10 12 14 16 number of tiles to simulation time limits • More than enough to warm up L2 DDR only DDR only (tiny) MCDRAM-only caches MCDRAM-only (tiny) ideal • Stencils-per-second figure of merit VTI loop2 scaling 16.00 • Measured time to complete fixed speedup relative to 1 tile DDR only 14.00 amount of work 12.00 10.00 8.00 • DDR-only: Tile scaling limited to ~4 due to BW limits 6.00 • 4.00 MCDRAM-only: Tile scaling quite good for the full 2.00 range that could be simulated 0.00 0 2 4 6 8 10 12 14 16 number of tiles 11 DDR only DDR only (tiny) MCDRAM-only MCDRAM-only (tiny) ideal
VTI hand optimized loop1 448x95x446 Hand optimization study 3.50 speedup over compiled 3.00 on CAM 2.50 2.00 1.50 1.00 • Loop 1 0.50 • 1-D vertical 16 th -order stencil 0.00 tpc 1 2 4 1 2 4 1 2 4 1 2 4 • Compiled code performed poorly cpt 1 1 1 2 2 2 1 1 1 2 2 2 • 1.5B stencils/s theoretical roofline tl 1 1 1 1 1 1 2 2 2 2 2 2 • Sims showed ~25% of theoretical base nts pre nts_pre • Inefficient use of cache & vectors VTI hand optimized loop1, 448 x Y x 446 • Hand optimized code 1.20 • Vectorize in x direction 1.00 relative performance • Stripmine loop in z direction 0.80 • Better reuse in AVX registers 0.60 • Less L1 cache bandwidth 0.40 • Achieved upto 3.0x speedup 0.20 • Z array size hazard observed! 0.00 89 90 91 92 93 94 95 96 97 98 99 100 Y problem size 12 base nts pre nts_pre
Impact of Memory Technologies Study using FAM Speedup with 10 time steps of the small input 18 16 14 12 10 8 6 4 2 0 1 2 3 4 6 8 12 16 1 2 3 4 6 8 12 16 1 2 3 4 6 8 12 16 1 2 3 4 6 8 12 16 DDRonly MCDcache1GB MCDcache16GB MCDRAMonly • When working set (4GB) fits in MCDRAM (16GB), scaling for MCDRAM-as-cache approaches MCDRAM-only 13
Memory Technology Study using FAM : Memory Bandwidth Utilization Memory BW utilization of 10 steps 100 90 80 70 60 50 40 30 20 10 0 1 2 3 4 6 8 12 16 1 2 3 4 6 8 12 16 1 2 3 4 6 8 12 16 1 2 3 4 6 8 12 16 DDRonly MCDcache1GB MCDcache16GB IPMonly DRAM Cache BW util DRAM BW util • When working set (4GB) fits in MCDRAM cache (16GB), DDR is accessed only once, so DDR BW not an issue 14
Conclusion & Future Work • Conclusion • Initial software/hardware co-design effort results presented • Used existing hardware for CAM model correlation & CAM/FAM models of future hardware • Co-design improved mutual understanding & optimization of software with hardware • Enabled code hand optimization performance study ahead of hardware • Enabled studying impact of new hardware memory features on target application ahead of hardware • Future Work • Study multi-node distributed memory scenario for the target application • Co-design other future products – software & hardware 15
Acknowledgments • Intel Corporation & Shell International • For allowing the work to be shared • CAM & FAM modeling teams • For developing the models and supporting our use of them 16
Backup 17
Cycle Accurate Model (CAM) • Cycle accurate performance model • Developed by product design teams across generations over many years • Validated against silicon • Slow simulation speed • Approx. 1,000 simulated instructions per real second • Difficult to simulate more than a few tens of million instructions per experiment • Difficult to scale to more than a few tens of threads • Primarily used by product teams with trace-driven methodology • Execution-driven methodology added in this project • Uses Intel SDE as functional emulator 18
Fast Accurate Model (FAM) • Fast multithreaded performance model • Simulates multiple cores and threads • Simulator runs multithreaded • Approx. 100k – 10M instructions per second, depending on detail • Trades accuracy for speed, correlated against CAM • Does not model in cycle accurate detail • Accuracy vs. CAM typically within +/- 20% over a wide range of ST workloads • Methodologies supported • Trace-driven • Execution-driven 19
Recommend
More recommend