TY - GEN
T1 - Evaluation of a directive-based GPU programming approach for high-order unstructured mesh computational fluid dynamics
AU - Puri, Kunal
AU - Singh, Vikram
AU - Frankel, Steven
N1 - Publisher Copyright:
© 2017 Copyright held by the owner/author(s).
PY - 2017/6/26
Y1 - 2017/6/26
N2 - In this work we evaluate the effectiveness of using OpenACC as a paradigm for the auto-parallelization of a high-order unstructured CFD code on Graphics Processing Units (GPUs). This is in lieu of hand-written CUDA or OpenCL code for the algorithms that have to be separately maintained and tuned to available hardware. Specifically, we compare the performance of using OpenACC-2.5 for Fortran with the commercial PGI compiler suite and OpenCL code running on the Nvidia Kepler series (K20, K40) GPU accelerators. Our results show that the (double precision) GPU accelerated code for both approaches is 2 ∼ 3 times faster than the optimized counterpart on the CPU (running with an OpenMP model). We find that sparse matrix vector multiplication with OpenCL is faster than using OpenACC with CuBLAS. While it is in general possible to write an optimized code using OpenCL (or CUDA) that outperforms OpenACC, we find that the directive based approach offered by OpenACC results in a flexible, unified and hence smaller code-base that is easier to maintain, is readily portable and promotes algorithm development.
AB - In this work we evaluate the effectiveness of using OpenACC as a paradigm for the auto-parallelization of a high-order unstructured CFD code on Graphics Processing Units (GPUs). This is in lieu of hand-written CUDA or OpenCL code for the algorithms that have to be separately maintained and tuned to available hardware. Specifically, we compare the performance of using OpenACC-2.5 for Fortran with the commercial PGI compiler suite and OpenCL code running on the Nvidia Kepler series (K20, K40) GPU accelerators. Our results show that the (double precision) GPU accelerated code for both approaches is 2 ∼ 3 times faster than the optimized counterpart on the CPU (running with an OpenMP model). We find that sparse matrix vector multiplication with OpenCL is faster than using OpenACC with CuBLAS. While it is in general possible to write an optimized code using OpenCL (or CUDA) that outperforms OpenACC, we find that the directive based approach offered by OpenACC results in a flexible, unified and hence smaller code-base that is easier to maintain, is readily portable and promotes algorithm development.
KW - CFD
KW - GPU
KW - Open ACC
KW - Open CL
UR - http://www.scopus.com/inward/record.url?scp=85025815559&partnerID=8YFLogxK
U2 - 10.1145/3093172.3093229
DO - 10.1145/3093172.3093229
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85025815559
T3 - PASC 2017 - Proceedings of the Platform for Advanced Scientific Computing Conference
BT - PASC 2017 - Proceedings of the Platform for Advanced Scientific Computing Conference
T2 - Platform for Advanced Scientific Computing Conference, PASC 2017
Y2 - 26 June 2017 through 28 June 2017
ER -