Simulating Stencil-based Application on Future Xeon-Phi Processor - PowerPoint PPT Presentation

Simulating Stencil-based Application on Future Xeon-Phi Processor PMBS workshop at SC’15 Chitra Natarajan Carl Beckmann Anthony Nguyen Intel Corporation Intel Corporation Intel Corporation Mauricio Araya-Polo Tryggve Fossum Detlef Hohl Shell Intl. E&P Inc. Intel Corporation Shell Intl. E&P Inc

Introduction • Software/Hardware Co-design • Simulate high-value software portfolio ahead of hardware availability • Collaborative effort to influence both future software and hardware development • Target Software: Stencil-based O&G hydrocarbon exploration application • Target Hardware: Xeon Phi processor • Outline • Stencil-based O&G hydrocarbon exploration application • Knights Landing (KNL) Xeon Phi processor • Cycle-Accurate Models (CAM) & Fast-Abstract Models (FAM) • Correlation of CAM to real system for an existing processor (Xeon SNB) • Correlation of FAM to CAM for KNL • CAM/FAM KNL simulation results 2

O&G Hydrocarbon Exploration Target Application  Data acquisition, on/off shore  Seismic Imaging, Wave Equations (Du, Fletcher, and Fowler, EAGE 2010) VTI assumption.       2 2 2 2 p p p q      2 V   V     x z 2 2 2 2   t x y z       2 2 2 2 q p p q      2 V   V     n z 2 2 2 2   t x y z       V V 1 2 , V V 1 2 n z x z 3

O&G Hydocarbon Exploration Target Application 1. MPI+X model, in this work X=OpenMP, and only 1-process behavior is analyzed 2. Wave equation PDE solved explicitly, stencil-based code, high-order 24-24-16 3. Implemented as two major loops: loop1 (sweeping Z) & loop2 (sweeping X & Y) 4. Key issues: data dependency (memory bound) and low data reuse loop 2 loop 1 4

Cycle Accurate Model (CAM) vs. Fast Abstract Model (FAM) Cycle Accurate Model (CAM) Fast Abstract Model (FAM) • Cycle accurate performance model • Do not model in cycle accurate detail • Validated extensively against silicon • Correlated against CAM • Developed by product design teams • Accuracy vs. CAM ~ +/- 20% over a wide across generations over many years range of ST workloads • Slow simulation speed • Trades accuracy for speed • ~1K instructions per real second • ~ 100K – 10M instructions per second • Difficult to simulate more than a few 10’s • Can simulate 10’s of billions of of million instructions per test instructions per test • Difficult to scale to > few 10’s of threads • Simulates multiple cores and threads • Primarily used trace-driven method • Methodologies supported • Execution-driven method added • Trace-driven • Uses Intel SDE as functional emulator • Execution-driven 7

Xeon SNB E5-2690 EMON CPI Data for 20 Timesteps • CPI (Cycles Per Instruction) • Can clearly observe the 20 time steps, with ~2/3 rd of each at CPI of ~0.53x and ~1/3 rd at ~0.46x • The 2 CPI levels reflect the 2 loops per time step 8

CAM Model to Real System Correlation on Xeon SNB • Representative Simpoints-based tracing resulted in 5 regions/traces • As expected, 2 traces dominate corresponding to the 2 loops with ~70% and ~29% weights • 20 time step execution resulted in ~138.6B instructions • Good correlation of CAM simulation data to real system measurement data • CPI & LLC MPI (Last-Level Cache Misses Per Instruction) within 2%, overall runtime within 3% 9

FAM vs. CAM correlation for KNL Configuration simulated: Metrics compared: Xeon Phi “Knights Landing” core IPC 1 to 8 cores L1 and L2 cache miss rates 2 cores per tile Speedup 1 to 4 SMT threads per core FAM vs. CAM for Loop1 CAM Loop1 Speedup FAM Loop1 Speedup 1.60 8 8 1.40 7 7 6 1.20 6 5 1.00 5 Speedup Speedup 4 0.80 4 3 0.60 3 2 0.40 2 1 0.20 1 0 0.00 0 1 2 4 6 8 tpc 1 2 4 1 2 4 1 2 4 1 2 4 1 2 4 1 2 4 6 8 1 2 3 4 #cores 1 2 4 6 8 1 2 3 4 1smt 2smt 4smt Total IPC L1D mpki L2 mpki 1smt 2smt 4smt • Correlation typically in the ~20% range for 1T, but worsens with SMT • FAM vs. CAM speedup trends are similar to each other 10

Tile scaling study on CAM VTI loop1 scaling 16.00 speedup relative to 1 tile DDR only 14.00 • Cycle accurate model experiments 12.00 10.00 • 1 to 16 tiles (2 to 32 cores) 8.00 • Execution driven 6.00 • Cache sharing modeled accurately 4.00 2.00 • Two main loops simulated partially 0.00 • Only 3 loop iterations per thread due 0 2 4 6 8 10 12 14 16 number of tiles to simulation time limits • More than enough to warm up L2 DDR only DDR only (tiny) MCDRAM-only caches MCDRAM-only (tiny) ideal • Stencils-per-second figure of merit VTI loop2 scaling 16.00 • Measured time to complete fixed speedup relative to 1 tile DDR only 14.00 amount of work 12.00 10.00 8.00 • DDR-only: Tile scaling limited to ~4 due to BW limits 6.00 • 4.00 MCDRAM-only: Tile scaling quite good for the full 2.00 range that could be simulated 0.00 0 2 4 6 8 10 12 14 16 number of tiles 11 DDR only DDR only (tiny) MCDRAM-only MCDRAM-only (tiny) ideal

VTI hand optimized loop1 448x95x446 Hand optimization study 3.50 speedup over compiled 3.00 on CAM 2.50 2.00 1.50 1.00 • Loop 1 0.50 • 1-D vertical 16 th -order stencil 0.00 tpc 1 2 4 1 2 4 1 2 4 1 2 4 • Compiled code performed poorly cpt 1 1 1 2 2 2 1 1 1 2 2 2 • 1.5B stencils/s theoretical roofline tl 1 1 1 1 1 1 2 2 2 2 2 2 • Sims showed ~25% of theoretical base nts pre nts_pre • Inefficient use of cache & vectors VTI hand optimized loop1, 448 x Y x 446 • Hand optimized code 1.20 • Vectorize in x direction 1.00 relative performance • Stripmine loop in z direction 0.80 • Better reuse in AVX registers 0.60 • Less L1 cache bandwidth 0.40 • Achieved upto 3.0x speedup 0.20 • Z array size hazard observed! 0.00 89 90 91 92 93 94 95 96 97 98 99 100 Y problem size 12 base nts pre nts_pre

Impact of Memory Technologies Study using FAM Speedup with 10 time steps of the small input 18 16 14 12 10 8 6 4 2 0 1 2 3 4 6 8 12 16 1 2 3 4 6 8 12 16 1 2 3 4 6 8 12 16 1 2 3 4 6 8 12 16 DDRonly MCDcache1GB MCDcache16GB MCDRAMonly • When working set (4GB) fits in MCDRAM (16GB), scaling for MCDRAM-as-cache approaches MCDRAM-only 13

Memory Technology Study using FAM : Memory Bandwidth Utilization Memory BW utilization of 10 steps 100 90 80 70 60 50 40 30 20 10 0 1 2 3 4 6 8 12 16 1 2 3 4 6 8 12 16 1 2 3 4 6 8 12 16 1 2 3 4 6 8 12 16 DDRonly MCDcache1GB MCDcache16GB IPMonly DRAM Cache BW util DRAM BW util • When working set (4GB) fits in MCDRAM cache (16GB), DDR is accessed only once, so DDR BW not an issue 14

Conclusion & Future Work • Conclusion • Initial software/hardware co-design effort results presented • Used existing hardware for CAM model correlation & CAM/FAM models of future hardware • Co-design improved mutual understanding & optimization of software with hardware • Enabled code hand optimization performance study ahead of hardware • Enabled studying impact of new hardware memory features on target application ahead of hardware • Future Work • Study multi-node distributed memory scenario for the target application • Co-design other future products – software & hardware 15

Acknowledgments • Intel Corporation & Shell International • For allowing the work to be shared • CAM & FAM modeling teams • For developing the models and supporting our use of them 16

Backup 17

Cycle Accurate Model (CAM) • Cycle accurate performance model • Developed by product design teams across generations over many years • Validated against silicon • Slow simulation speed • Approx. 1,000 simulated instructions per real second • Difficult to simulate more than a few tens of million instructions per experiment • Difficult to scale to more than a few tens of threads • Primarily used by product teams with trace-driven methodology • Execution-driven methodology added in this project • Uses Intel SDE as functional emulator 18

Fast Accurate Model (FAM) • Fast multithreaded performance model • Simulates multiple cores and threads • Simulator runs multithreaded • Approx. 100k – 10M instructions per second, depending on detail • Trades accuracy for speed, correlated against CAM • Does not model in cycle accurate detail • Accuracy vs. CAM typically within +/- 20% over a wide range of ST workloads • Methodologies supported • Trace-driven • Execution-driven 19

Simulating Stencil-based Application on Future Xeon-Phi Processor - PowerPoint PPT Presentation

Simulating Stencil-based Application on Future Xeon-Phi Processor PMBS workshop at SC15 Chitra Natarajan Carl Beckmann Anthony Nguyen Intel Corporation Intel Corporation Intel Corporation Mauricio Araya-Polo Tryggve Fossum Detlef Hohl

Outline Background 1 Xeon Phi Architecture 2 Programming Xeon Phi TM 3 Native Mode Offload

XEON PHI BASICS Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Xeon Phi Basics Reusing this

Stencil Buffer Algorithms CS418 Computer Graphics John C. Hart Stencil Buffer

Precision solder paste stencil for fine pitch printing applications www.microstencil.com

AsHES 2014 XSW: Accelerating Biological Database Search on Xeon Phi School of Computer Science

OPTIMISING PARALLEL PROGRAMS ON XEON PHI Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc

Optimizing Codes For Intel Xeon Phi Brian Friesen NERSC 2017 July 26 Cori What is different

PCS SERVICE FOR SALE FOR SALE Used PHI 660 Scanning Auger PHI 660 Scanning Auger Used

Omega Psi Phi Fraternity, Inc. Eta Delta Delta Chapter The History of Omega Psi Phi Omega

THE PHI PROJECT THE FINANCIAL IMPACT OF BREACHED PROTECTED HEALTH INFORMATION A

The Ritual Review of Phi Sigma Pi National Honor Fraternity Phi Sigma Pi National Honor

Communicating Phi Sigma Pis Mission and Identity Objectives Review Phi Sigma Pis

Creative surprises from Undercover 29.09.2017 1 Stencil set 2 sticker sheets, stencil, 10

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc

Towards Direct Visualization on CPU and Xeon Phi Aaron Knoll SCI Institute, University of Utah

Pattern Selection Problems in Multivariate Time-Series using Equation Discovery Arne Koopman,

Hydraulic Stimulation and Geophysical Fracture Monitoring in the -Project J. Orzol (1) , R. Jung

Geometric Algorithms Well-Separated Pair Decomposition & Spanners Motivation Connect a set

FOSS Assistant Professor Lyles School of Civil Engineering Geospatial Data Science

Status of Geophysics Interferometer KIW3@Academia Sinica, 1 Kouseki Miyo Mar 21, 2017 *

Aqua Program Status Claire L. Parkinson Aqua Project Scientist AIRS Science Team Meeting

Utilization of Geostationary Satellite Observations for Air Quality Modeling During 2013

2 nd Year Report Stefan Grimm Structure Introduction Detectors Fisher-matrix

Simulating Stencil-based Application on Future Xeon-Phi Processor - PowerPoint PPT Presentation

Simulating Stencil-based Application on Future Xeon-Phi Processor PMBS workshop at SC15 Chitra Natarajan Carl Beckmann Anthony Nguyen Intel Corporation Intel Corporation Intel Corporation Mauricio Araya-Polo Tryggve Fossum Detlef Hohl

Outline Background 1 Xeon Phi Architecture 2 Programming Xeon Phi TM 3 Native Mode Offload

XEON PHI BASICS Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Xeon Phi Basics Reusing this

Stencil Buffer Algorithms CS418 Computer Graphics John C. Hart Stencil Buffer

Precision solder paste stencil for fine pitch printing applications www.microstencil.com

AsHES 2014 XSW: Accelerating Biological Database Search on Xeon Phi School of Computer Science

OPTIMISING PARALLEL PROGRAMS ON XEON PHI Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc

Optimizing Codes For Intel Xeon Phi Brian Friesen NERSC 2017 July 26 Cori What is different

PCS SERVICE FOR SALE FOR SALE Used PHI 660 Scanning Auger PHI 660 Scanning Auger Used

Omega Psi Phi Fraternity, Inc. Eta Delta Delta Chapter The History of Omega Psi Phi Omega

THE PHI PROJECT THE FINANCIAL IMPACT OF BREACHED PROTECTED HEALTH INFORMATION A

The Ritual Review of Phi Sigma Pi National Honor Fraternity Phi Sigma Pi National Honor

Communicating Phi Sigma Pis Mission and Identity Objectives Review Phi Sigma Pis

Creative surprises from Undercover 29.09.2017 1 Stencil set 2 sticker sheets, stencil, 10

SODA: Stencil with Optimized Dataflow Architecture Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou

Autotuning OpenCL Workgroup Size for Stencil Patterns Chris Cummins http://chriscummins.cc

Towards Direct Visualization on CPU and Xeon Phi Aaron Knoll SCI Institute, University of Utah

Pattern Selection Problems in Multivariate Time-Series using Equation Discovery Arne Koopman,

Hydraulic Stimulation and Geophysical Fracture Monitoring in the -Project J. Orzol (1) , R. Jung

Geometric Algorithms Well-Separated Pair Decomposition &amp; Spanners Motivation Connect a set

FOSS Assistant Professor Lyles School of Civil Engineering Geospatial Data Science

Status of Geophysics Interferometer KIW3@Academia Sinica, 1 Kouseki Miyo Mar 21, 2017 *

Aqua Program Status Claire L. Parkinson Aqua Project Scientist AIRS Science Team Meeting

Utilization of Geostationary Satellite Observations for Air Quality Modeling During 2013

2 nd Year Report Stefan Grimm Structure Introduction Detectors Fisher-matrix

Geometric Algorithms Well-Separated Pair Decomposition & Spanners Motivation Connect a set