GPGPU

il_gpgpu01_sm.pngThe acronym GPGPU stands for General Purpose GPU, that is using a GPU (Graphics Processing Unit) not just as a graphics processor but as a general purpose processor, taking advantage of the hundreds or even thousands of processing cores available on modern GPUs to accomplish computations with massive parallelism and speed.   Introduced and steadily expanded by NVIDIA, the use of GPUs for general purpose calculation allows desktop computers to do in seconds what would take hours or days without GPU.

 

It is not an exaggeration to say that GPGPU technology could well be the most revolutionary thing to happen in computing since the invention of the microprocessor. It's that fast, that inexpensive and has that much potential. GPGPU is so important that all Manifold users should insist that the computer hardware they procure can be used for GPGPU parallel processing by Manifold, especially considering the absurdly low price of such GPU add-ins compared to the phenomenal performance gains they can provide.

Fermi or Greater Required

GPGPU capability within Manifold requires installation of an NVIDIA GPU of at least Fermi architectural generation or more recent.   Such GPUs are both ubiquitous and very inexpensive, so much so that many reasonably recent computers will already include an NVIDIA GPU that can be utilized by Manifold for GPGPU parallelism.

 

Parallel Processing

Manifold is inherently a parallel processing system.   Whenever it makes sense to do so, Manifold will automatically utilize multiple processors or multiple processor cores by parallelizing a task into multiple threads for execution on more than one core. Given hyperthreading plus multi-core CPUs it is now routine to encounter desktop systems with 8, 16, 32 or even more CPU cores available.

 

In addition to this basic, parallel processing capability using multiple CPU cores Manifold also includes the ability to utilize massively parallel multiprocessing utilizing GPUs, potentially launching tasks on thousands of processing cores at once for true supercomputer computational performance, far beyond what can be achieved with CPUs.

 

Manifold automatically parallelizes and dispatches as many tasks as make sense to GPGPU, with automatic fallback to parallelized tasks dispatched to multiple CPU cores if a GPU is not available.  

 

About GPGPU

The California graphics chip company NVIDIA has long been known for producing outstanding graphics processors (GPUs) that have become popular as the basis for graphics cards.    Huge financial investments by NVIDIA and other GPU makers have been made possible by the runaway popularity and size of the computer gaming market.   

 

As an entertainment blockbuster, computer gaming is a far larger business financially than the worldwide movie industry.   Over the years that vast flow of money has financed ever more computationally demanding games, which in turn have financed the creation of ever more powerful graphics processors to handle the ever greater resolution and ever more complex calculations required to give modern computer games a greater sense of reality and action.

 

The speed and processing power required cannot be achieved using CPUs but require supercomputer architectures in which hundreds or thousands of processing units work together.  In the quest for maximum speed, NVIDIA GPUs have evolved far beyond single processors. Modern NVIDIA GPUs are not single processors but rather are parallel supercomputers on a chip that consist of very many, very fast processing cores with potentially thousands of processing cores per chip.   They are so much faster at processing than whatever CPU is running Windows on the motherboard that in comparison the CPU seems no more capable than a digital coffee pot.

 

As GPUs increased in power it very quickly became obvious to computer scientists that programs other than graphics microcode for games could be uploaded into a GPU for parallel execution by the thousands of processing units in the GPU.  Although the market impetus behind the creation of such supercomputers on a GPU chip were the computational demands of the PC gaming market, the scientific computing community began using GPUs for general purpose computing having nothing to do with games.   That GPU cards were absurdly inexpensive (compared to supercomputers) because of the vast economies of scale of computer gaming was icing on the cake.

 

It turns out that many mathematical computations, such as matrix multiplication and transposition, which are required for complex visual and physics simulations in games are also exactly the same computations that must be performed in a wide variety of computing applications, including GIS, data mining, simulations, image processing, complex statistical analyses and many other tasks where conventional CPUs are too slow.

 

At first it was almost an accidental discovery that programs other than graphics microcode could be uploaded for execution into the many processing units of a modern GPU.  But once it was realized there might be a market for such an approach that could provide far faster performance than running programs on the main CPU, NVIDIA took the chance of supporting the trend by investing resources into ensuring that NVIDIA GPUs could be used for GPGPU applications and by supporting such use with software and with architectural features in their GPUs to support GPGPU use.

 

NVIDIA created the CUDA (Compute Unified Device Architecture) interface library to allow applications developers to write code that can be uploaded into an NVIDIA GPU card for massively parallel execution by the many processing cores in the GPU. The CUDA library allows applications developers to write applications that will work with a very wide variety of NVIDIA GPUs, and it ensured that NVIDIA chips got off to an early lead among GPU vendors for use in GPGPU applications.

 

GPGPU offers such tremendous performance gains that all Manifold products now are designed to exploit GPGPU whenever feasible.   Manifold automates this process at a breadth and depth never before seen in a commercial product, with automatic use of GPGPU throughout Manifold and Manifold SQL.  If we have a reasonably recent  NVIDIA GPU installed in our system, Manifold can take advantage of the phenomenal power of massively parallel processing to execute many tasks at much greater speed.

 

Because NVIDIA technology benefits from enormous economies of scale in the gaming market, GPGPU-capable cards have become very inexpensive for the performance they provide with a wide range of GPGPU-capable graphics cards that can be purchased at various prices and performance levels.  It is easy and inexpensive to choose a card with the balance between performance and cost desired (more stream processors running at faster clock rate with more memory gives better performance).

 

Based on experience from GPGPU-enabled Manifold products it is clear that GPGPU will revolutionize computation.  GPGPU processing is so fast that developers routinely say GPGPU renders the main processor almost superfluous, as if even the fastest multi-core Intel chip is relegated to being nothing but an accessory processor to handle the keyboard and mouse. That is not hyperbole given that GPUs can routinely run many computations hundreds of times faster than even the fastest Intel CPUs.

 

Manifold and GPGPU

The first appearance of GPGPU code in Manifold products was in Manifold System Release 8.00, which has a limited but nonetheless extremely powerful ability to utilize GPGPU without requiring users to write low-level CUDA code or otherwise deal with the intimidating complexity of parallel programming.

 

Manifold 8 includes a Surface - Transform dialog that enables users to write expressions which perform computations on surfaces using straightforward expression syntax that is similar to how expressions can be written in SQL.  Expressions written in the Surface - Transform dialog in 8 can utilize a wide range of Manifold 8 functions and operators, including 38 functions that were parallelized to utilize GPGPU automatically for computations if an NVIDIA GPU is available.   

 

The Surface - Transform dialog in Manifold 8 takes an expression that can reference one or more surfaces, parses that expression, evaluates it using CPU computations together with automatic dispatch to GPGPU if functions supported for GPGPU are utilized in the expression and then saves the result into a new or existing surface.

 

Since GPGPUs can perform computations for hundreds or thousands of pixels at once, the Surface - Transform tool in Manifold 8 provides a significant performance benefit over doing the same computations on CPU, performing some computations hundreds of times faster on GPGPU than possible on the CPU.

 

The Surface - Transform tool in Manifold 8  works best when subexpressions are relatively bulky functions such as Aspect or Slope. That is because each subexpression in the tool must copy data to GPGPU device memory, perform computations in GPGPU and then copy the result back from GPGPU device memory into main memory.   GPGPUs operate on data within the GPU device's local memory so there is an explicit copy step in both directions between the GPU device and the main computer memory used by the CPU when subexpressions are evaluated by the Surface - Transform tool in Manifold 8.  

 

The overhead in Manifold 8 to and from GPGPU for small operations such as + or sin is so big that it does not make sense for the Surface - Transform tool to do them in GPGPU. It's faster just to do small operations on the main CPU.  Therefore GPGPU in Manifold 8 is used only for more complex functions such as filters where the gain from using GPGPU is beyond the break-even time required to move data to and from the GPGPU device.

 

Manifold 8 makes no use of GPGPU besides the Surface - Transform dialog and its ability to utilize GPGPU within a set of 38 functions that operate on surfaces.   There is no general GPGPU utilization within SQL in Manifold 8, for example, and no use of GPGPU for other tasks such as reprojecting images, manipulating vector drawings and so on.

 

In contrast, the Radian engine and Manifold products such as Manifold System GIS that are based on Radian are  designed to utilize GPGPU automatically within SQL.   That implies effective use of GPGPU throughout the entire system since SQL in Manifold underlies virtually everything in Manifold.

 

To use GPGPU effectively throughout the entire system Manifold must significantly reduce the overhead of utilizing GPGPU in virtually all situations, enabling expressions with many small calls to be accelerated as well as expressions with only a few bulky calls. Being able to accelerate a vast range of small calls has many big benefits, most importantly allowing the acceleration of bulky expressions that are constructed in queries which in turn utilize many small calls.   This allows users to write bulky, complex queries in SQL that can benefit from GPGPU performance even if within the overall complexity of such queries there are no significantly complex or bulky built-in functions that are used.

 

To accomplish such unprecedented effectiveness in situations large and small, GPGPU utilization within Manifold went through numerous iterations during development with each iteration improving on the previous one.  Highlights included:

 

 

The result of such continuous improvement is that now when we write something like SELECT tilea + tileb * 5 + tilec * 8 FROM ..., the Manifold engine takes the expression with three additions and two multiplications, generates GPGPU code for that function in a Just In Time (JIT) manner, uploads the resulting code to GPGPU, then uses that code to execute the computations.

 

To save execution time and boost efficiency, JIT code generation for GPGPU functions is cache-friendly for the driver. Running the same query again, or even running different queries for which the GPGPU expressions are sufficiently similar to each other, will engage the  compilation cache maintained by the driver.

 

Automatic GPGPU Utilization

GPGPU acceleration works everywhere in Manifold SQL where worthwhile work arises: in the SELECT list, in WHERE, in EXECUTE, ...everywhere. For example, if we add to a table a computed field that combines multiple tiles together, that computed field will use GPGPU.  If we do some tile math in a FUNCTION, that FUNCTION will use GPGPU as well.

 

If we save the project using that computed field or FUNCTION into a Manifold .map file and then bring that .map file onto a machine running Manifold that has no GPGPU, the computed field will be executed by Manifold automatically falling back to using Manifold's CPU parallelism, taking advantage of as many CPU cores are available using CPU core parallelism instead of GPGPU.    If we bring the .map file back onto a machine that has a GPGPU Manifold will automatically use the GPGPU.  

 

Other optimizations play along transparently. If a particular subexpression inside of an expression that runs on GPGPU is a constant in the context of that expression, it will only be evaluated once. If an expression that can run on GPGPU refers to data from multiple tables and has parts that only reference one of these tables, the join optimizer will split the GPGPU expression into pieces according to dependencies and will run these pieces separately and at different times, minimizing work. A SELECT with more than one thread will run multiple copies of GPGPU expressions simultaneously. There are many other similar optimizations automatically integrated with GPGPU utilization.

 

Note that some operations are so trivial in terms of computational requirements it makes no sense to dispatch them to GPGPU, the classic case being scalars (trivial) as opposed to tiles (more bulk).  CASE expressions, conditionals and similar constructions or functions that operate on scalar values stay on the CPU while functions that operate on tile values generally go to GPGPU unless they use tiles in a trivial fashion, such as making a simple comparison.  

 

Some examples:

 

 

CASE conditions are scalar, so they stay on CPU.   When CASE is used with tiles whether it is faster to dispatch the task to GPGPU depends on exactly how the tiles are used.  Some examples where vXX are scalar values and tXX are tiles:

 

CASE WHEN v=2 THEN t1 ELSE t2 END

 

In the above not much is being done with the tiles so the entire construction stays on CPU.

 

CASE v WHEN 3 THEN TileAbs(t1)+ t2*t3 + TileSqrt(t4) ELSE t1 END

 

In the above, the expression in THEN will go to GPGPU while the rest of CASE will stay on CPU.

 

CASE WHEN t1 < t2 THEN 0 ELSE 8 END

 

In the above the comparison in WHEN does use tiles but it uses them like raw binary values, similar to how ORDER works, so it is more efficient to leave it on CPU.

 

Windows and GPGPU

While the automatic parallelization and dispatch of tasks to GPGPU is certainly cool, doing so with the automatic breadth and depth of Manifold is entirely unprecedented in a commercial software product.   As a new technology extensively implemented on Windows for the first time, Manifold's extensive use of GPGPU has revealed some interesting new effects in how Windows interacts with GPGPU computation.

 

The key new effect of interest is that Windows does not always play well with GPU when the same GPU is used both for display functions as well as for GPGPU computations.  When the same GPU is used both for display and GPGPU computation,  if the code dispatched to the GPU for GPGPU computations either fails or runs for a long time, a Windows watchdog service that monitors how long it takes the GPU to process display requests will  reboot the graphics stack, flushing the GPGPU computation.  That will then cause Manifold to fall back to re-doing the calculation on parallelized CPU cores as if GPGPU had not been available.   That's a "fail safe" action by Manifold but one which can result in much longer computation time than expected given the slower performance of CPU cores in tasks that can be faster run in GPGPU.

 

While bugs that cause failure of GPGPU code should, of course, either be eliminated during pre-production testing or become increasingly rare as they are eliminated in updates, the possibility of long-running GPGPU tasks will always remain because computation times depend upon the amount of data involved and the complexity of computation.

 

If we have only a single GPU in our system it will be used both for display by Windows and for GPGPU computations by Manifold.  Manifold and GPGPU calculations in general are so fast that Windows rebooting the graphics stack due to longer running computations on GPGPU should be rare, so rare that such effects will never be encountered by almost all users.  There are ways of turning the watchdog service off in Windows, but doing so may result in the display becoming obviously less responsive when long-running GPGPU tasks are being executed.

 

It should be emphasized that performance-robbing interactions between Windows using a GPU for display and code using the same GPU for GPGPU computations should be rare.  It also makes sense to expect that anyone running such sophisticated tasks that they intensively utilize GPGPU is unlikely to have only a single GPU in their systems, given the insignificant cost (relatively speaking) to configure a system with multiple GPUs.  If we have multiple GPUs in a system we can use PRAGMA to specify which are used for GPGPU to choose a GPU not used by Windows for display.  But that ends up requiring a pragma directive for each such query.

 

A brute force strategy to force Windows not to touch GPUs utilized for computation is to install an inexpensive, non-GPGPU capable card to run displays and to plug in additional GPGPU-capable cards for computations.  For example, we could use an older, pre-Fermi NVIDIA card or an AMD card or a built-into-the-motherboard Intel graphics chip for our display, none of which will be recognized as GPGPU capable by Manifold, and then we could plug in one or more Fermi or later NVIDIA GPU cards to run GPGPU calculations.   Windows will be able to then happily watch the non-GPGPU graphics unit it is using for display while the GPGPU devices can take as long as they want to do their work.

 

Such a strategy may be taking too much of a "belt and suspenders" approach to eliminating the possibility that Windows will interfere with long-running GPGPU calculations on the card used for displays, but it could be a useful plan if massive GPGPU utilization is expected.

Notes

Use the latest code  - As always, it is a good idea when using GPGPU devices to update the video driver for the device from NVIDIA.   This will ensure using the latest iteration of NVIDIA updates for the device, including for GPGPU.

 

Will

 

See Also

 

PRAGMA

 

NVIDIA Home Page