Highly efficient and highly scalable global atmospheric modeling software for emerging heterogeneous supercomputers

While the atmospheric models were one of the oldest software tools that run on supercomputers, with a heavy code legacy accumulated in the last few decades, it is becoming more and more challenging for atmospheric models to gain performance benefits from emerging heterogeneous supercomputers with many-core processors or accelerators. Take China as an example, while we already have 100-Pflops supercomputers, the atmospheric models are still pure-CPU code, and can only scale to a few thousand processes.

As shown in Fig. 1, to remove the performance barrier between the application software and the underlying hardware, we perform two major lines of research: (1) to design highly scalable atmospheric dynamic solvers that can both scale better and make better utilization of heterogeneous architectures; (2) to develop paradigm and tools to migrate existing atmospheric models, which accumulates decades’ scientific discoveries, to emerging heterogeneous supercomputers.

Development of highly scalable solver from scratch

The first part of the research line is demonstrated on the left half of Fig. 1, which is a long-term collaboration project with Prof. Chao Yang from Institute of Software, Chinese Academy of Sciences, Prof. Wei Xue from Department of Computer Science, Tsinghua University, and Prof. Lanning Wang from College of Global Change and Earth System Science, Beijing Normal University. The formation of such a multi-disciplinary team is key to this long line of successful research achievements.

With my collaborators making contributions on atmospheric models, numerical schemes, and MPI schemes, I am mainly focused on developing a generalized partition scheme that support various heterogeneous architectures, and deriving efficient designs of the solver on many-core or reconfigurable accelerators.

We started with a 2D SWE (Shallow Water Equations) solver on Tianhe-1A [PPoPP13], and achieved a sustained double-precision performance of 581 Tflops by efficiently using both the CPU and GPU resources on 3,750 nodes.

The 2D SWE solver was later extended to the hybrid CPU-FPGA platform [FPL13]. We derive an algorithm that applies single and multiple FPGAs to compute the upwind stencil for the global shallow water equations. Through mixed-precision arithmetic, we manage to build a fully pipelined upwind stencil design on a single FPGA, which can perform 428 floating-point and 235 fixed-point operations per cycle. The algorithm using four FPGAs provides 330 times speedup over a 6-core CPU; it is also 14 times faster and 9 times more power efficient than the hybrid CPU-GPU node in Tianhe-1A. As the very first publication about using mixed-precision design for atmospheric models, this paper [FPL13] was later selected as one of the 27 Significant Papers out of the 1,765 research papers in the first 25 years of FPL (International Conference on Field Programmable Logic and Applications). The mixed-precision concept is also accepted by a number of climate research groups, with positive citations in atmospheric modeling journals such as Monthly Weather Review, Journal of Advances in Modeling Earth Systems (JAMES), Theoretical and Computational Fluid Dynamics, Quarterly Journal of the Royal Meteorological Society, Proceedings of the Royal Society A, etc.

The 2D SWE solver was also scaled to 8,644 nodes of Tianhe-2, and achieved 3.74 Pflops performance using CPUs and MICs [IPDPS14a].

Combining these different efforts on GPU (Graphic Processing Unit), MIC (Many Integrated Core), and FPGA (Field Programmable Gate Arrays) cards, we also derive a generalized approach for solving 2D SWE equations on various heterogeneous supercomputers, and form a picture about both the potential performance benefits and the programming efforts involved [Plos ONE17].

With an efficient 2D SWE solver, we then extend to the more challenging 3D Euler equations. Taking a similar approach, we achieve efficient 3D Euler solvers for both MIC [TC15] and FPGA accelerators [FPL14, IEEE Micro17].

The most recent progress of this work targets Sunway TaihuLight, the world’s fastest supercomputer with over 10 million cores [SC16a]. In terms of the numerical scheme, we also divert from the explicit schemes in previous publications to a fully-implicit scheme. we develop an ultra-scalable fully-implicit solver for stiff time-dependent problems arising from the hyperbolic conservation laws in nonhydrostatic atmospheric dynamics. In the solver, we propose a highly efficient hybrid domain-decomposed multigrid preconditioner that can greatly accelerate the convergence rate at the extreme scale. The fully implicit solver successfully scales to the entire system of the Sunway TaihuLight supercomputer with over 10.5M heterogeneous cores, sustaining an aggregate performance of 7.95 Pflops in double-precision, and enables fast and accurate atmospheric simulations at the 488-m horizontal resolution (over 770 billion unknowns) with 0.07 simulated-years-per-day. This is, to our knowledge, the largest fully implicit simulation to date. This work won the 2016 ACM Gordon Bell Prize, with myself as the third and the co-corresponding author.

Tools for the migration of existing code

While the above research efforts mostly solve the problem by making new algorithms or designs from scratch for the emerging hardware architectures, for the second part shown on the right half of Fig. 1, we face the even more challenging problem of redesigning and refactoring existing modeling software with millions lines of code for the heterogeneous architecture of new supercomputers, such as Sunway TaihuLight.

This is a tough project that started from the summer of 2015, and still continues with the model development projects in my department. With a team of over 20 students and professors (from Tsinghua and Beijing Normal University) led by me, we ported, redesigned, and scaled the Community Atmosphere Model (CAM) to the full system of the Sunway TaihuLight, and achieved peta-scale climate modeling performance [SC16b, SC17a].

Through careful refactoring and redesigning, we can further improve the performance of a 260-core Sunway processor to the range of 28 to 184 Intel CPU cores, and to achieve a sustainable double-precision HOMME performance of 3.3 PFlops for a 750 m global simulation when using 10,075,000 cores. CAM on Sunway enables us to perform the simulation of complete lifecycle of hurricane Katrina, and achieve close-to-observation simulation results for both path and strength. Our most recent work in 2017 [SC17a] was selected as one of the three Gordon Bell Prize finalists in the world, with myself as the lead author.

Key Publications for Atmospheric Modeling Software

[SC17a] Haohuan Fu, Junfeng Liao, Nan Ding, Xiaohui Duan, and et al., “Redesigning CAM-SE for Peta-Scale Climate Modeling Performance and Ultra-High Resolution on Sunway TaihuLight”, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC17), 12 pages, one out of the 3 Gordon Bell finalists, 2017.

[IEEE Micro17] Lin Gan, Haohuan Fu*, Wayne Luk, Chao Yang, and Wei Xue, "Solving Mesoscale Atmospheric Dynamics Using a Reconfigurable Dataflow Architecture", IEEE Micro 37.4 (2017): 40-50. [Plos ONE17] Haohuan Fu, Lin Gan, Chao Yang, Wei Xue, Lanning Wang, Xinliang Wang, Xiaomeng Huang, Guangwen Yang, “Solving global shallow water equations on heterogeneous supercomputers”, PLoS ONE 12(3):e0172583, March 2017.

[SC16a] Yang Chao*, Xue Wei*, Haohuan Fu*, et al., “10M-Core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics”, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), pp. 57-68, Salt Lake City, Utah, US, 2016 (winner of the Gordon Bell Prize of 2016).

[SC16b] Haohuan Fu, Junfeng Liao, Wei Xue, Lanning Wang and et al., “Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer”, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), pp. 969-980, Salt Lake City, Utah, US, 2016.

[TC15] Wei Xue, Chao Yang, Haohuan Fu, and et al., “Ultra-scalable CPU-MIC Acceleration of Mesoscale Atmospheric Modeling on Tianhe-2”, IEEE Transaction on Computers, vol.64, no.8, pp.2382-2393, Aug. 1, 2015.

[TRETS15] Lin Gan, Haohuan Fu*, Wayne Luk, and et al., “Solving the Global Atmospheric Equations through Heterogeneous Reconfigurable Platforms”, ACM Transaction on Reconfigurable Technology and Systems, vol. 8, no. 2, article no. 11, pp. 1-16, April 2015.

[IPDPS14a] Wei Xue, Chao Yang, Haohuan Fu, and et al., “Enabling and Scaling a Global Shallow-Water Atmospheric Model on Tianhe-2”, in Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), pp. 745-754, 2014.

[FPL14] Lin Gan, Haohuan Fu, Chao Yang, and et al., “A Highly-Efficient and Green Data Flow Engine for Solving Euler Atmospheric Equations”, in Proceedings of the 24th International Conference on Field Programmable Logic and Applications (FPL), pp. 1-6, 2014.

[FPL13] Lin Gan, Haohuan Fu, Wayne Luk, and et al., “Accelerating Solvers for Global Atmospheric Equations Through Mixed-Precision Data Flow Engine”, in Proceedings of the 23rd International Conference on Field Programmable Logic and Applications (FPL), 6 pp., 2013(Selected as one of the 27 Significant Papers of the 1,765 research papers in the first 25 years of FPL).

[PPoPP 13] Chao Yang, Wei Xue, Haohuan Fu, Lin Gan, Linfeng Li, Yangtong Xu, Yutong Lu, Jiachang Sun, Guangwen Yang, and Weimin Zheng, “A Peta-Scalable CPU-GPU Algorithm for Global Atmospheric Simulations”, in Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 1-12, Shenzhen, 2013.