TY - GEN AU - H. Xiao AU - Michail Diamantakis AU - S. Saarinen AB -
The ECMWF Integrated Forecast System (IFS) cloud microphysics scheme has been adapted for a GPU architecture. Hybrid OpenMP and OpenACC within a single node, hybrid MPI and OpenACC over multiple nodes as well as different algorithmic and code optimization methods were employed to study the performance impact. The roofline model was used to conduct a performance analysis and the CLAW compiler has been explored as a tool for automatic code adaptation. For a very large number of grid columns, the double precision performance of the 4 GK210 GPUs of a single node was slightly better than the performance of two 12-core CPUs (contained in the node) in terms of the total run rime. However, without taking into account the GPU data transfer and other overheads, the actual calculation time for the same large size problem was reduced to approximately one quarter of the CPU time giving a speed up factor of 4. Comparing the performance of a single GPU with a single CPU, the obtained speed up factor is approximately 2. A further 40% gain can be achieved with single precision. The obtained GPU speed up factor depends a lot on the workload given to a GPU; for small or moderate size problems (number of grid columns) the above mentioned speed up factors cannot be achieved.
BT - ECMWF Technical Memoranda DA - 2017 DO - 10.21957/g9mjjlgeq LA - eng M1 - 805 N2 -The ECMWF Integrated Forecast System (IFS) cloud microphysics scheme has been adapted for a GPU architecture. Hybrid OpenMP and OpenACC within a single node, hybrid MPI and OpenACC over multiple nodes as well as different algorithmic and code optimization methods were employed to study the performance impact. The roofline model was used to conduct a performance analysis and the CLAW compiler has been explored as a tool for automatic code adaptation. For a very large number of grid columns, the double precision performance of the 4 GK210 GPUs of a single node was slightly better than the performance of two 12-core CPUs (contained in the node) in terms of the total run rime. However, without taking into account the GPU data transfer and other overheads, the actual calculation time for the same large size problem was reduced to approximately one quarter of the CPU time giving a speed up factor of 4. Comparing the performance of a single GPU with a single CPU, the obtained speed up factor is approximately 2. A further 40% gain can be achieved with single precision. The obtained GPU speed up factor depends a lot on the workload given to a GPU; for small or moderate size problems (number of grid columns) the above mentioned speed up factors cannot be achieved.
PB - ECMWF PY - 2017 T2 - ECMWF Technical Memoranda TI - An OpenACC GPU adaptation of the IFS cloud microphysics scheme UR - https://www.ecmwf.int/node/17320 ER -