GPGPU: the workhorse for demanding computations

If runtime is a problem for your application, then the use of a general-purpose graphics processing unit (GPGPU) can be a solution. GPGPUs can sometimes speed up computations by an order of magnitude. As an extra advantage they use far less power per computation. But exploiting their computational power is not always easy and only works for applications that meet specific requirements.

What is a GPGPU?

A GPGPU is a more generic form of a Graphics Processing Unit (GPU). That is a type of processor that was originally designed for graphics tasks. Whereas the usual processor in a PC is designed to do well for a wide range of computing tasks, GPUs are designed specifically for graphics algorithms. The basic trick is that such algorithms have a high degree of parallelism. Therefore, a GPU consists of many small processing units that execute the graphics computations in parallel.

But graphics is not the only category of algorithms that allow parallel computing. Many heavy computing tasks, like weather forecasting and structural mechanics simulation, are based on the manipulation of matrices and vectors which also works well on GPUs. It is not surprising, therefore, that GPU’s have become the workhorse for large-scale computing.

To make GPUs suitable for more general, non-graphics algorithms, a more generic programming interface was developed. This, in turn, led to optimizations in the hardware to match these more generic tasks, turning it into a General Purpose GPU or GPGPU. For example, some GPGPUs now support double precision computing, which is not a requirement for graphics processing.

The hardware

The number of processing units in a GPGPU can go up to more than 10.000 cores on high-end graphics cards. But it is important to remember that these are not independent: multiple cores run in lockstep, all doing the same computation simultaneously, albeit each on a different data element. In other words, it’s a data parallel computation, not a task parallel one.

Just as important as the number of processing units is the memory in the GPU. That is usually relatively limited, going to about some tens of Gigabytes. In fact, the memory architecture is primarily designed to allow high throughput rather than storing large volumes of data. The idea is that the data only resides on the GPGPU when it is actually working on it.

Providers

At this moment, there are in essence three providers of GPGPUs. The market leader is Nvidia with a market share of some 80%. AMD comes second with 17% of the market and the small rest is with Intel. In the last few years, the demand for GPUs has exploded due to the developments in AI for which they are also extremely well suited. Suppliers are currently struggling to meet the demand, leading to long delivery times and high prices for the top models. Using GPUs that are offered on a public cloud can be a good way to go.  All major public clouds offer GPUs as an option. See for example how to do this on Azure.

Programming GPU’s

There are several options for programming GPUs. The most widely used is CUDA, a product of Nvidia. Its popularity is obviously linked to the popularity of the Nvidia GPGPUs. CUDA is essentially an extension of C++: extra constructs and keywords in a C++ program will execute parts of the code on a GPGPU. A similar extension is available for Fortran.

The second most popular option is OpenCL, which is designed to be vendor independent. This sounds like a nice feature, but it is important to realize that CUDA is a product from Nvidia and is therefore optimized for their GPGPUs. So, if you are not actually planning to move to GPGPUs from other vendors, then CUDA is still the better option.

In many cases, an application needs to exploit several levels of parallelism: from the low-level parallelism offered by GPGPUs to task-parallelism that is offered by multicore CPU’s. Such applications will typically combine CUDA or OpenCL with higher level frameworks like OpenMP or MPI. This can also be used for running code on multiple GPGPUs both to get an extra degree of parallelism and to circumvent the memory restrictions of GPGPUs.

An elegant way to get around the platform dependence and multilevel issues are frameworks like OCCA and Kokkos. These are application programming interfaces (APIs) that allow a generic expression of parallelism which is then compiled to the underlying platform. In many cases, this gives you a platform-independent code that performs well enough on all types of GPGPUs, even if it will never beat a highly optimized computational kernel built directly in CUDA.

Will it work for your application?

Whether a GPGPU is a solution for your runtime issues depends highly on your application. For one, it only works if your runtime is dominated by computations that are highly data parallel. If your application spends, say, half of the total runtime in computations that can be parallelized, you will never get more than two times faster, even if all the parallel computations would take no time at all. To reach a factor of ten in speedup, at least 90% of your application should be parallelizable.

A second aspect that is important for the performance gain is the degree to which data needs to be moved between the main memory and the GPGPU onboard memory. If there is too much data transfer, using a GPGPU will not result in the desired performance benefits. Ideally, there will be only data transfer at the start and end of a simulation, with all the computations in between using only the GPGPU memory.

Even if these are straightforward guidelines, it is often hard to assess what a GPGPU will do for the performance. Parts of the application may not be parallel now but can perhaps be replaced by alternative algorithms. It can be worthwhile to use GPGPU-optimized libraries like cuBLAS for basic computing tasks rather than hand coded software. Memory access patterns can be improved by rearranging computations. So, some degree of refactoring of the code can dramatically improve the GPGPU performance. At VORtech, we help clients with this in the form of consultancy and software development. Scroll down for more information.

Examples

Most of our work for clients is confidential so unfortunately, we cannot share details about the applications that we have ported to GPGPUs. But there are many impressive examples of GPGPU applications around. Here are a few.

In the field of biotechnology, this commercial brochure from Nvidia is worth checking out. It shows the performance gains for several biotech applications in going from a multicore CPU to one or more GPGPUs. Clearly, the performance enhancements vary between applications. For many, the speedup is in less than ten, but there are also applications that go from 3.5 days of computing to only four hours.

Another computation-heavy field is numerical weather prediction. This article gives a very good discussion of the pros and cons. The performance gain by going to a GPGPU is not necessarily huge (about 4 times) but the benefits are in the fact that the same compute power can be obtained with only a quarter of the hardware. That is where the cost-saving is, not only in terms of hardware management costs but also in terms of energy efficiency.

Another commercial article, from Ansys Fluent, shows really impressive results: with 8 GPGPU’s the speed for an aerodynamics computation is 32 times faster than on several already impressive multicore processors. Again, the power consumption when using GPGPUs is only a quarter of that of a CPU-based system of similar performance.

Finally, most well-known finite element packages are also capable of using GPGPUs. For example, Abaqus reports a performance gain of a factor of four for a model of 3 million nodes and 10 million degrees of freedom.

Can we help you unlock the power of GPGPUs?

One of our core competences as a company is improving the computational speed of the applications of our clients. For this, we offer a range of expertise, from efficient programming through smart algorithms to full-scale optimization for high performance computing systems. All this in close collaboration with the client’s developers so that they can continue the development of the software without us after we are done.

How we support you

Our support can take different forms. A typical engagement starts with a model scan. Such a model scan can be targeted to any issue that the client may have with their software. When it’s about performance, we usually do a short study of the characteristics of the application to assess the potential for speedup. Our recommendations can be implemented by the client’s own developers, but we are available to help.

Often, one of the recommendations is a change of algorithm. In some cases, this is already enough to get a significant performance gain even without going to other hardware. However, such a change of algorithm is rarely a matter of just plugging it in. It usually takes quite some tweaking. Our experts have the knowledge and experience to do this efficiently and effectively, so clients tend to ask us to do this algorithm change for them. Again, we make sure that the client’s developers understand the result, so that they can build on the new algorithm themselves.

If changing the algorithm is not enough or not feasible to get the desired performance, then it comes down to expert programming to make the most of the hardware. These days, the target hardware is often a GPGPU. Having our experts work on this guarantees that the code is configurable to the specifics of any GPGPU-card and remains accessible for non-specialist developers.

Feel free to contact us for an exploratory talk. Even if no project comes out of it, we always enjoy sharing our knowledge.