Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer Haohuan Fu haohuan@tsinghua.edu.cn CESS, Tsinghua University National Supercomputing Center in Wuxi Oct 5 th 2016 @ CCDSC
Sunway TaihuLight: an Overview Homegrown many-core processor: SW26010 • 260 cores per chip • 3 Tflops The first system in the world that provides over 100 Pflops performance with over 10 million cores • theoretical peak 125 Pflops, 2.5 times improvement over before • LINPACK performance 93 Pflops, 3 times improvement over before High efficiency of the overall system • 6.05 Gflops/Watt, 3 to 6 times improvement over Tianhe-2, Titan, and K Three full-scale applications elected as 2016 Gordon Bell finalists
SW26010: General Architecture Memory Memory iMC iMC Core Group 0 Core Group 1 Memory Level 8*8 CPE Mesh PPU PPU Computing Row Core Communication 8*8 CPE 8*8 CPE Bus MPE MPE Mesh Mesh LDM Level Registers NoC Data Transfer LDM Network 8*8 CPE 8*8 CPE MPE MPE Register Level Mesh Mesh Transfer Agent (TA) PPU PPU Control Column iMC iMC Core Group 3 Core Group 2 Network Communication Bus Computing Level Memory Memory
Earth System Modeling and HPC: the Current Computational Challenges
More and more component models ocean- atmosphere marine ocean space atmosphere boundary biology model model weather ocean-ice land-atmosphere atmospheric coupler boundary boundary chemistry land land model ice model dynamic ice biology ice-land boundary hydrological solid earth process
Increase in Spatial and Temporal Resolution to be Cloud-Resolving and Eddy-Resolving
Simulation of more and more detailed physics processes Simulation of Cloud Droplet Formation
Online Ensembles TH240_N_111 TH240_CAM Simulation of Cloud Droplet Formation TH240_ATMP3 TH240_BCC
The Gap between Software and Hardware 100P • millions lines of legacy code • poor scalability • written for multi-core, rather than many-core 100T China’s supercomputers China’s models • heterogeneous systems • pure CPU code with many-core chips • scaling to hundreds or • millions of cores thousands of cores 10
Our Research Goals • highly scalable framework that can efficiently utilize many-core processors • automated tools to deal with the legacy code 100T~1P China’s supercomputers China’s models • heterogeneous systems • pure CPU code with many-core chips • scaling to hundreds or • millions of cores thousands of cores 11
Our Research Goals • highly scalable framework that can efficiently utilize many-core processors • automated tools to deal with the legacy code 100T~1P China’s supercomputers China’s models • heterogeneous systems • pure CPU code with many-core chips • scaling to hundreds or • millions of cores thousands of cores 12
Example: Highly-Scalable Atmospheric Simulation Framework Yang, Chao Institute of Software, CAS cube-sphere grid or computational mathematics cloud resolving other grid explicit, implicit, or Wang, Lanning semi-implicit Beijing Normal University method climate modeling Application Algorithm Xue, Wei Tsinghua University computer science Architecture Fu, Haohuan Tsinghua University Sunway, GPU, MIC, geo-computing FPGA C/C++, Fortran, MPI, CUDA, Java, … The “Best” Computational Solution 13
[2014 IPDPS]: 2D SWE model 1.6m CPU-MIC cores 1.63 Pflops on Tianhe-2 [2013 PPoPP]: 2D SWE model 0.8m CPU-GPU cores 0.8 Pflops on Tianhe-1A [2016 SC]: 3D Nonhyd model 10.6m Sunway cores 8 Pflops on TiahuLight [2013 FPL]: [2014 TC]: 2D SWE on one FPGA chip 3D Nonhyd model a further 6~10x improvement 1.2m CPU-MIC cores on performance and power 1.74 Pflops on Tianhe-2 efficiency
Our Research Goals • highly scalable framework that can efficiently utilize many-core accelerators • automated tools to deal with the legacy code 100T~1P China’s supercomputers China’s models • heterogeneous systems • pure CPU code with many-core chips • scaling to hundreds or • millions of cores thousands of cores 15
Earth System Modeling and HPC: Our Efforts on Refactoring CAM
THE CESM PROJECT F 算例(大气 + 陆面) G 算例(海洋 + 海冰) B 算例(全耦合) • Four component models, millions lines of code • Large-scale run on Sunway TaihuLight • 24,000 MPI processes • Over one million cores • 10-20x speedup for kernels Tsinghua + BNU • 2-3x speedup for the entire model
Major Challenges a high complexity in application, and a heavy legacy in the code base (millions lines of code) an extremely complicated MPMD program with no hotspots (or hundreds of hotspots) misfit between the in-place design philosophy and the new architecture lack of people with interdisciplinary knowledge and experience
Workflow of CAM Pass tracers (u, v) to dynamics CAM Phy_run Phy_run Dyn_run initial 1 2 Pass state Pass state variables and variables tracers After initialization, the physics and the dynamics are executed in turn during each simulation time-step.
Porting of CAM: General Idea n Entire code base: 530, 000 lines of code n Components with regular code patterns q e.g. the CAM-SE dynamic core q manual OpenACC parallelization and optimization on code and data structures n Components with irregular and complex code patterns q e.g. the CAM physics schemes q loop transformation tool to expose the right level of parallelism and code size q memory footprint analysis and reduction tool
Refactoring the Euler Step Euler_step : do ie = nets, nete do ie = nets, nete do k = 1, nlev do q = 1, qsize dp(k) = func_1() do ie = nets, nete do k = 1, nlev do q = 1, qsize compute Q min/max values for lim8 …. Qtens(k,q,ie) = func_2(dp(k)) compute Biharmonic mixing term f end do end do end do end do end do end do end do do ie = nets, nete do ie = nets, nete 2D advection step do ie = nets, nete do k = 1, nlev data packing do k = 1, nlev dp0 = func_3() end do do q = 1, qsize dpdiss = func_4() qmin(k,q,ie) = … do q = 1, qsize Bonundary exchange qmax(k,q,ie) = … Qtens(k,q,ie) = end do func_5(dp0, dpdiss) Data extracting end do end do 1 end do end do end do do k = 1, nlev do ie = nets, nete dp_star(k) = func_8(dp(k)) do k = 1, nlev end do dp(k) = func_5() Vstar(k) = func_6() do k = 1, nlev end do Qtens(k,q,ie) = do q = 1, qsize func_9(dp_star(k)) do k = 1, nlev end do Qtens(k,q,ie) = end do func_7(dp(k), Vstar(k)) Data packing end do end do 2
Refactoring the Euler Step do ie = nets, nete do ie = nets, nete do q = 1, qsize do ie = nets, nete do ie = nets, nete do k = 1, nlev do k = 1, nlev do k = 1, nlev do q = 1, qsize dp(k) = func_1() …. do q = 1, qsize do k = 1, nlev do q = 1, qsize end do Qtens(k,q,ie) = …. Qtens(k,q,ie) = end do func_2(func_1()) end do func_2(dp(k)) end do end do end do end do end do end do end do do ie = nets, nete end do end do do k = 1, nlev do ie = nets, nete optimized: dp0 = func_3() do ie = nets, nete do k = 1, nlev do ie = nets, nete dpdiss = do k = 1, nlev do q = 1, qsize do k = 1, nlev func_4() do q = 1, qsize Qtens(k,q,ie) = do q = 1, qsize do q = 1, qsize qmin(k,q,ie) = … qmin(k,q,ie) = … Qtens(k,q,ie) qmax(k,q,ie) = … func_5(func_3(),func_4()) qmax(k,q,ie) = … = func_5(dp0, dpdiss) end do end do end do end do end do end do end do end do end do end do end do end do do k = 1, nlev do k = 1, nlev dp_star(k) = do ie = nets, nete Qtens(k,q,ie) = func_8(dp(k)) do k = 1, nlev func_9(func_8(func_5())) end do do ie = nets, nete dp(k) = func_5() end do do q = 1, qsize Vstar(k) = func_6() end do do k = 1, nlev do k = 1, nlev end do Data packing Qtens(k,q,ie) Qtens(k,q,ie) = end do = func_7(func_5(),func_6()) do q = 1, qsize end do do k = 1, nlev func_9(dp_star(k)) Qtens(k,q,ie) = 3 end do func_7(dp(k), Vstar(k)) end do 2 end do Data packing end do
Recommend
More recommend