The OpenACC parallel programming standard emerged late last year with the goal of making it easier for developers to tap graphics process units (GPUs) to accelerate applications. The scientific and technical programming community is a key audience for this development. Jeffrey Vetter, professor at Georgia Tech’s College of Computing and leader of the Future Technologies Group at Oak Ridge National Laboratory, recently discussed the standard. He is currently project director for the National Science Foundation’s (NSF) Track 2D Experimental Computer Facility, a cooperative effort that involves the Georgia Institute of Technology and Oak Ridge, among other institutions. Track 2D’s Keeneland Project employs GPUs for large-scale heterogeneous computing.
Q: What problems does OpenACC address?
Jeffrey Vetter: We have this Keeneland Project — 360 GPUs deployed in its initial delivery system. We are responsible for making that available to users across NSF. The thinking behind OpenACC is that all of those people may not have the expertise or funding to write CUDA code or OpenCL code for all of their scientific applications.
Some science codes are large, and any rewriting of them — whether it is for acceleration or a new architecture of any type — creates another version of the code and the need to maintain that software. Some of the teams, like the climate modeling team, just don’t want to do that. They have validated their codes. They have a verification test that they run, and they don’t want to have different versions of their code floating around.
It is a common problem in software engineering: People branch their code to add more capability to it, and at some point they have to branch it back together again. In some cases, it causes conflict.
OpenACC really lets you keep your applications looking like normal C or C++ or Fortran code, and you can go in and put the pragmas in the code. It’s just an annotation on the code that’s available to the compiler. The compiler takes that and says, “The user thinks this particular block or structured loop is a good candidate for acceleration.”
Q: What’s the impact on scientific/technical users?
J.V.: We have certain groups of users that are very sophisticated and willing to do most anything to port their code to a GPU — write new version of code, sit down with an architecture expert and optimize it.
But some don’t want to write any new code other than putting pragmas in the code. They really are conservative in that respect. A lot of the large codes out there used by DOE labs just haven’t been ported to GPUs because there’s uncertainty over what sort of performance improvement they might see, as well as a lack of time to just go and explore that space.
What we are trying to do is broaden the user base on the system and make GPUs, and in fact other types of accelerators, more relevant for other users who are more conservative.
After a week of just going through the OpenACC tutorials, users should be able to go in and start experimenting with accelerating certain chunks of their applications. And those would be people who don’t have experience in CUDA or OpenCL.
Q: Does OpenACC have sufficient support at this point?
J.V.: PGI, CAPS and Cray: We expect they will start adhering to OpenACC with not too much trouble. What’s less certain is how libraries and performance analysis tools and debugging tools will work with the new standard. One thing that someone needs to make happen is to ensure that there is really a development environment around OpenACC.
OpenMP was a decade ago — they had the same issue. They had to create the specification and the pragmas and other language constructs, and people had to create the runtime system that executes the code and does the data movement.
Q: What types of applications need acceleration?
J.V.: Generally, we have been looking at applications that have this high computational intensity. You have things like molecular dynamics and reverse time migration and financial modeling — things that basically have the characteristic that you take a kernel and put it in a GPU and it runs there for many iterations, without having to transfer data off the GPU.
OpenACC itself is really targeted at kernels that have a structured block or a structured loop that is regular. That limits the applicability of the compiler to certain applications. There will be applications with unstructured mesh or loops that are irregular in that they have conditions or some type of compound statements that make it impossible for the compiler to analyze. Users will have to unroll those loops so an OpenACC compiler has enough information to generate the code.
Some kernels are not going to work well on OpenACC whether you work manually or with a compiler. There isn’t any magic. I’m supportive, but trust and verify.