GPGPU

il_gpgpu01_sm.pngThe acronym GPGPU stands for General Purpose GPU, that is using a GPU (Graphics Processing Unit) not just as a graphics processor but as a general purpose processor, taking advantage of the hundreds or even thousands of processing cores available on modern GPUs to accomplish computations with massive parallelism and speed.   Introduced and steadily expanded by NVIDIA, the use of GPUs for general purpose calculation allows desktop computers to do in seconds what would take hours or days without GPU.

 

It is not an exaggeration to say that GPGPU technology could well be the most revolutionary thing to happen in computing since the invention of the microprocessor. It's that fast, that inexpensive and has that much potential. GPGPU is so important that all Manifold users should insist that the computer hardware they procure includes a recent NVIDIA GPU so it can be used for GPGPU parallel processing by Manifold,.   That is especially important considering the absurdly low price of such GPU add-ins compared to the phenomenal performance gains they can provide.

Fermi or Greater Required

GPGPU capability within Manifold requires installation of an NVIDIA GPU of at least Fermi architectural generation or more recent, which effectively includes all NVIDIA GPUS sold in recent years.  Such GPUs are both ubiquitous and very inexpensive, so much so that many reasonably recent computers will already include an NVIDIA GPU that can be utilized by Manifold for GPGPU parallelism.   

 

Even a very inexpensive NVIDIA GPU will provide dramatic performance gains in tasks which are speeded up by GPU parallelism.  When Manifold does demos of GPGPU speed increases, audiences seem to expect that spending a thousand dollars on an exotic, high-end GPU will have dramatic effects.   But what is really impressive is to see how a $100 entry-level NVIDIA GPU allows Manifold to do some tasks 100 times faster than non-parallel GIS products.  

64-bit Windows Required

To ensure compatibility with NVIDIA hardware, Manifold uses NVIDIA drivers and software such as NVIDIA CUDA.   In 2018 NVIDIA has discontinued GPGPU support for 32-bit Windows editions.   Therefore, 32-bit versions of Release 9 will also no longer support GPGPU.   Please run 64-bit Windows to utilize GPGPU within Manifold.

Is a GPU Installed?

We usually know if we have an NVIDIA GPU card installed in our systems, but sometimes we might not remember how many cards are installed in particular system or what they are.   When using portable devices that have modern GPU integrated onto them we might want to confirm what GPU can be found.  If a GPGPU capable device is installed, Manifold will report it in Help - About dialog as a "CUDA" device in the GPU line.   For example, if we have two cards installed they would be reported as  GPU: CUDA (2 Devices).

 

For more detail, we can use GPU functions in the Command Window.  

 

il_cmd_window_gpgpu01_03.png

 

We choose View - New Command Window - SQL to launch the Command Window.   If we cannot remember the name of the GPU functions we enter GPU into the filter box to reduce the long list of functions to only the GPU functions.  We can then double-click on the SystemGpgpus function to use it.   We enter:

 

? CALL SystemGpgpus()

 

... and then we press the ! run button in the main toolbar.   We wrote ? CALL because the function returns a table, so we evaluate that using CALL.

 

il_cmd_window_gpgpu01_04.png

 

The Results table reports all of the GPGPU-capable GPUs in our system and what each GPU reports about what it is.    In the above illustration, we see we have two GPGPU capable GPUs in our system.   Since GPU plug-in cards typically have one GPU per card the table normally reports the number of GPGPU capable cards in the system.

Parallel Processing

Manifold is inherently a parallel processing system.   Whenever it makes sense to do so, Manifold will automatically utilize multiple processors or multiple processor cores by parallelizing a task into multiple threads for execution on more than one core. Given hyperthreading plus multi-core CPUs it is now routine to encounter desktop systems with 8, 16, 32 or even more CPU cores available.

 

In addition to this basic, parallel processing capability using multiple CPU cores Manifold also includes the ability to utilize massively parallel multiprocessing utilizing GPUs, potentially launching tasks on thousands of processing cores at once for true supercomputer computational performance, far beyond what can be achieved with CPUs.

 

Manifold automatically parallelizes and dispatches as many tasks as make sense to GPGPU, with automatic fallback to parallelized tasks dispatched to multiple CPU cores if a GPU is not available.  

About GPGPU

The California graphics chip company NVIDIA has long been known for producing outstanding graphics processors (GPUs) that have become popular as the basis for graphics cards.    Huge financial investments by NVIDIA and other GPU makers have been made possible by the runaway popularity and size of the computer gaming market.   

 

As an entertainment blockbuster, computer gaming is a far larger business financially than the worldwide movie industry.   Over the years that vast flow of money has financed ever more computationally demanding games, which in turn have financed the creation of ever more powerful graphics processors to handle the ever greater resolution and ever more complex calculations required to give modern computer games a greater sense of reality and action.

 

The speed and processing power required cannot be achieved using CPUs but require supercomputer architectures in which hundreds or thousands of processing units work together.  In the quest for maximum speed, NVIDIA GPUs have evolved far beyond single processors. Modern NVIDIA GPUs are not single processors but rather are parallel supercomputers on a chip that consist of very many, very fast processing cores with potentially thousands of processing cores per chip.   They are so much faster at processing than whatever CPU is running Windows on the motherboard that in comparison the CPU seems no more capable than a digital coffee pot.

 

As GPUs increased in power it very quickly became obvious to computer scientists that programs other than graphics microcode for games could be uploaded into a GPU for parallel execution by the thousands of processing units in the GPU.  Although the market impetus behind the creation of such supercomputers on a GPU chip were the computational demands of the PC gaming market, the scientific computing community began using GPUs for general purpose computing having nothing to do with games.   That GPU cards were absurdly inexpensive (compared to supercomputers) because of the vast economies of scale of computer gaming was icing on the cake.

 

It turns out that many mathematical computations, such as matrix multiplication and transposition, which are required for complex visual and physics simulations in games are also exactly the same computations that must be performed in a wide variety of computing applications, including GIS, data mining, simulations, image processing, complex statistical analyses and many other tasks where conventional CPUs are too slow.

 

At first it was almost an accidental discovery that programs other than graphics microcode could be uploaded for execution into the many processing units of a modern GPU.  But once it was realized there might be a market for such an approach that could provide far faster performance than running programs on the main CPU, NVIDIA took the chance of supporting the trend by investing resources into ensuring that NVIDIA GPUs could be used for GPGPU applications and by supporting such use with software and with architectural features in their GPUs to support GPGPU use.

 

NVIDIA created the CUDA (Compute Unified Device Architecture) interface library to allow applications developers to write code that can be uploaded into an NVIDIA GPU card for massively parallel execution by the many processing cores in the GPU. The CUDA library allows applications developers to write applications that will work with a very wide variety of NVIDIA GPUs, and it ensured that NVIDIA chips got off to an early lead among GPU vendors for use in GPGPU applications.

 

GPGPU offers such tremendous performance gains that all Manifold products now are designed to exploit GPGPU whenever feasible.   Manifold automates this process at a breadth and depth never before seen in a commercial product, with automatic use of GPGPU throughout Manifold and Manifold SQL.  If we have a reasonably recent  NVIDIA GPU installed in our system, Manifold can take advantage of the phenomenal power of massively parallel processing to execute many tasks at much greater speed.

 

Because NVIDIA technology benefits from enormous economies of scale in the gaming market, GPGPU-capable cards have become very inexpensive for the performance they provide with a wide range of GPGPU-capable graphics cards that can be purchased at various prices and performance levels.  It is easy and inexpensive to choose a card with the balance between performance and cost desired (more stream processors running at faster clock rate with more memory gives better performance).

 

Based on experience from GPGPU-enabled Manifold products it is clear that GPGPU will revolutionize computation.  GPGPU processing is so fast that developers routinely say GPGPU renders the main processor almost superfluous, as if even the fastest multi-core Intel chip is relegated to being nothing but an accessory processor to handle the keyboard and mouse. That is not hyperbole given that GPUs can routinely run many computations hundreds of times faster than even the fastest Intel CPUs.

Manifold and GPGPU

The first appearance of GPGPU code in Manifold products was in Manifold System Release 8.00, which has a limited but nonetheless extremely powerful ability to utilize GPGPU without requiring users to write low-level CUDA code or otherwise deal with the intimidating complexity of parallel programming.

 

Manifold 8 includes a Surface - Transform dialog that enables users to write expressions which perform computations on surfaces using straightforward expression syntax that is similar to how expressions can be written in SQL.  Expressions written in the Surface - Transform dialog in 8 can utilize a wide range of Manifold 8 functions and operators, including 38 functions that were parallelized to utilize GPGPU automatically for computations if an NVIDIA GPU is available.   

 

The Surface - Transform dialog in Manifold 8 takes an expression that can reference one or more surfaces, parses that expression, evaluates it using CPU computations together with automatic dispatch to GPGPU if functions supported for GPGPU are utilized in the expression and then saves the result into a new or existing surface.

 

Since GPGPUs can perform computations for hundreds or thousands of pixels at once, the Surface - Transform tool in Manifold 8 provides a significant performance benefit over doing the same computations on CPU, performing some computations hundreds of times faster on GPGPU than possible on the CPU.

 

The Surface - Transform tool in Manifold 8  works best when subexpressions are relatively bulky functions such as Aspect or Slope. That is because each subexpression in the tool must copy data to GPGPU device memory, perform computations in GPGPU and then copy the result back from GPGPU device memory into main memory.   GPGPUs operate on data within the GPU device's local memory so there is an explicit copy step in both directions between the GPU device and the main computer memory used by the CPU when subexpressions are evaluated by the Surface - Transform tool in Manifold 8.  

 

The overhead in Manifold 8 to and from GPGPU for small operations such as + or sin is so big that it does not make sense for the Surface - Transform tool to do them in GPGPU. It's faster just to do small operations on the main CPU.  Therefore GPGPU in Manifold 8 is used only for more complex functions such as filters where the gain from using GPGPU is beyond the break-even time required to move data to and from the GPGPU device.

 

Manifold 8 makes no use of GPGPU besides the Surface - Transform dialog and its ability to utilize GPGPU within a set of 38 functions that operate on surfaces.   There is no general GPGPU utilization within SQL in Manifold 8, for example, and no use of GPGPU for other tasks such as reprojecting images, manipulating vector drawings and so on.

 

In contrast, the Radian engine and Manifold products such as Manifold System GIS that are based on Radian are  designed to utilize GPGPU automatically within SQL.   That implies effective use of GPGPU throughout the entire system since SQL in Manifold underlies virtually everything in Manifold.

 

To use GPGPU effectively throughout the entire system Manifold must significantly reduce the overhead of utilizing GPGPU in virtually all situations, enabling expressions with many small calls to be accelerated as well as expressions with only a few bulky calls. Being able to accelerate a vast range of small calls has many big benefits, most importantly allowing the acceleration of bulky expressions that are constructed in queries which in turn utilize many small calls.   This allows users to write bulky, complex queries in SQL that can benefit from GPGPU performance even if within the overall complexity of such queries there are no significantly complex or bulky built-in functions that are used.

 

To accomplish such unprecedented effectiveness in situations large and small, GPGPU utilization within Manifold went through numerous iterations during development with each iteration improving on the previous one.  Highlights included:

 

 

The result of such continuous improvement is that now when we write something like SELECT tilea + tileb * 5 + tilec * 8 FROM ..., the Manifold engine takes the expression with three additions and two multiplications, generates GPGPU code for that function in a Just In Time (JIT) manner, uploads the resulting code to GPGPU, then uses that code to execute the computations.

 

To save execution time and boost efficiency, JIT code generation for GPGPU functions is cache-friendly for the driver. Running the same query again, or even running different queries for which the GPGPU expressions are sufficiently similar to each other, will engage the  compilation cache maintained by the driver.

Automatic GPGPU Utilization

GPGPU acceleration works everywhere in Manifold SQL where worthwhile work arises: in the SELECT list, in WHERE, in EXECUTE, ...everywhere. For example, if we add to a table a computed field that combines multiple tiles together, that computed field will use GPGPU.  If we do some tile math in a FUNCTION, that FUNCTION will use GPGPU as well.

 

If we save the project using that computed field or FUNCTION into a Manifold .map file and then bring that .map file onto a machine running Manifold that has no GPGPU, the computed field will be executed by Manifold automatically falling back to using Manifold's CPU parallelism, taking advantage of as many CPU cores are available using CPU core parallelism instead of GPGPU.    If we bring the .map file back onto a machine that has a GPGPU Manifold will automatically use the GPGPU.  

 

Other optimizations play along transparently. If a particular subexpression inside of an expression that runs on GPGPU is a constant in the context of that expression, it will only be evaluated once. If an expression that can run on GPGPU refers to data from multiple tables and has parts that only reference one of these tables, the join optimizer will split the GPGPU expression into pieces according to dependencies and will run these pieces separately and at different times, minimizing work. A SELECT with more than one thread will run multiple copies of GPGPU expressions simultaneously. There are many other similar optimizations automatically integrated with GPGPU utilization.

 

Note that some operations are so trivial in terms of computational requirements it makes no sense to dispatch them to GPGPU, the classic case being scalars (trivial) as opposed to tiles (more bulk).  CASE expressions, conditionals and similar constructions or functions that operate on scalar values stay on the CPU while functions that operate on tile values generally go to GPGPU unless they use tiles in a trivial fashion, such as making a simple comparison.  

 

Some examples:

 

 

CASE conditions are scalar, so they stay on CPU.   When CASE is used with tiles whether it is faster to dispatch the task to GPGPU depends on exactly how the tiles are used.  Some examples where vXX are scalar values and tXX are tiles:

 

CASE WHEN v=2 THEN t1 ELSE t2 END

 

In the above not much is being done with the tiles so the entire construction stays on CPU.

 

CASE v WHEN 3 THEN TileAbs(t1)+ t2*t3 + TileSqrt(t4) ELSE t1 END

 

In the above, the expression in THEN will go to GPGPU while the rest of CASE will stay on CPU.

 

CASE WHEN t1 < t2 THEN 0 ELSE 8 END

 

In the above the comparison in WHEN does use tiles but it uses them like raw binary values, similar to how ORDER works, so it is more efficient to leave it on CPU.

Windows and GPGPU

While the automatic parallelization and dispatch of tasks to GPGPU is certainly cool, doing so with the automatic breadth and depth of Manifold is entirely unprecedented in a commercial software product.   As a new technology extensively implemented on Windows for the first time, Manifold's extensive use of GPGPU has revealed some interesting new effects in how Windows interacts with GPGPU computation.

 

The key new effect of interest is that Windows does not always play well with GPU when the same GPU is used both for display functions as well as for GPGPU computations.  When the same GPU is used both for display and GPGPU computation,  if the code dispatched to the GPU for GPGPU computations either fails or runs for a long time, a Windows watchdog service that monitors how long it takes the GPU to process display requests will  reboot the graphics stack, flushing the GPGPU computation.  That will then cause Manifold to fall back to re-doing the calculation on parallelized CPU cores as if GPGPU had not been available.   That's a "fail safe" action by Manifold but one which can result in much longer computation time than expected given the slower performance of CPU cores in tasks that can be faster run in GPGPU.

 

While bugs that cause failure of GPGPU code should, of course, either be eliminated during pre-production testing or become increasingly rare as they are eliminated in updates, the possibility of long-running GPGPU tasks will always remain because computation times depend upon the amount of data involved and the complexity of computation.

 

If we have only a single GPU in our system it will be used both for display by Windows and for GPGPU computations by Manifold.  Manifold and GPGPU calculations in general are so fast that Windows rebooting the graphics stack due to longer running computations on GPGPU should be rare, so rare that such effects will never be encountered by almost all users.  There are ways of turning the watchdog service off in Windows, but doing so may result in the display becoming obviously less responsive when long-running GPGPU tasks are being executed.

 

It should be emphasized that performance-robbing interactions between Windows using a GPU for display and code using the same GPU for GPGPU computations should be rare.  It also makes sense to expect that anyone running such sophisticated tasks that they intensively utilize GPGPU is unlikely to have only a single GPU in their systems, given the insignificant cost (relatively speaking) to configure a system with multiple GPUs.  If we have multiple GPUs in a system we can use PRAGMA to specify which are used for GPGPU to choose a GPU not used by Windows for display.  But that ends up requiring a pragma directive for each such query.

 

A brute force strategy to force Windows not to touch GPUs utilized for computation is to install an inexpensive, non-GPGPU capable card to run displays and to plug in additional GPGPU-capable cards for computations.  For example, we could use an older, pre-Fermi NVIDIA card or an AMD card or a built-into-the-motherboard Intel graphics chip for our display, none of which will be recognized as GPGPU capable by Manifold, and then we could plug in one or more Fermi or later NVIDIA GPU cards to run GPGPU calculations.   Windows will be able to then happily watch the non-GPGPU graphics unit it is using for display while the GPGPU devices can take as long as they want to do their work.

 

Such a strategy may be taking too much of a "belt and suspenders" approach to eliminating the possibility that Windows will interfere with long-running GPGPU calculations on the card used for displays, but it could be a useful plan if massive GPGPU utilization is expected.

Notes

Does GPU parallelism speed up everything?  - No.  It only speeds up computational tasks that are big enough to be worth dispatching to the GPU.    It does nothing at all for tasks that involve no computation.   Consider a thought experiment, for example:  suppose we want to copy a 10 GB file in Windows from one disk drive to another disk drive.   We could have a hundred GPUs in our system and the job will not go any faster, because it involves no computation.  It simply involves moving bytes between disk drives, which is a task that basically requires waiting around for bytes to be read from a terribly slow, rotating disk platter and then written onto the destination, terribly slow disk platter.   Reading a large shapefile, for example, is not going to go any faster with GPU because that task also is all about waiting to get information off disk.  There is no thought involved for the processor and no computation to speed up, just the very slow wait for bytes to come in from disk.

 

In contrast, a complex calculation doing much sophisticated mathematics for ever pixel in a raster data set quite likely will gain significant performance from using GPU.  The rule of thumb is that if the job is top-heavy with lots of computation, GPU parallelism will help.   However, even then, the competition for GPU is CPU parallelism.   

 

Keep in mind that Manifold automatically parallelizes both for CPU and GPU.    Modern CPUs will often provide eight, twelve or more cores which can execute complex calculations with astonishing speed.  When all of a CPU's cores are engaged in a parallelized task by Manifold, a modern eight-core CPU providing sixteen hypercores can execute many tasks so fast that it will be done before the job could be dispatched to GPU.  In that case, Manifold will run the job using fully parallel CPU since dispatching it to GPU would be slower.  

 

GPUs are so inexpensive it always is a good idea to toss an NVIDIA GPU into our systems.  If we have bigger computational tasks, we can install more than one GPU and we can spend a bit more for a faster GPU with more CUDA cores in it.  For most people, there is not much point in buying the latest, most expensive GPU since even inexpensive GPUs are so fast the number of jobs that will go significantly faster on a super-expensive GPU are few and far between.   Even with very computationally intensive jobs usually a mid-range GPU is plenty.

 

Is an Intel graphics chip or AMD chip GPGPU capable?  -  Not within Manifold, and not within almost all other GPGPU-capable applications.   AMD and Intel both make fine products, including graphics chips.   However NVIDIA pulled ahead of both AMD and Intel in GPGPU capabilities by making a big early investment, years before AMD and Intel,  in supporting GPGPU on NVIDIA GPUs with the CUDA library and with specific features within GPU silicon to support GPGPU operation.   As a result, by probably a thousand to one margin, applications which do GPGPU run on NVIDIA GPUs but not on AMD or Intel GPUs.   AMD and Intel are now playing catch-up, but given how far behind they are in GPGPU applications it is unclear when or if it will be possible for Manifold to provide GPU parallelism using AMD or Intel GPUs.   For now, if we have an AMD or Intel graphics chip we cannot use it for GPU parallelism with Manifold.

 

Use the latest code  - As always, it is a good idea when using GPGPU devices to update the video driver for the device from NVIDIA.  This will ensure using the latest iteration of NVIDIA updates for the device, including for GPGPU.

 

Should I spend more on GPU, on CPU, or on other hardware? -   Always have at least one GPGPU capable card in your computer that delivers a few hundred CUDA cores.   GPU computation is so fast that very quickly other parts of the system will become the bottleneck, so it usually makes no sense to spend many thousands on one or more GPU cards to plug into a  four core CPU machine that has limited memory and slow hard disk.    A single, mid-range GPU card costing $200 to $600 has so much GPGPU power that spending more is not necessary for almost all GIS work.    Even after we invest into an eight to twelve core CPU with plenty of memory and multi-terabyte SSD drives, in most GIS work we might not see any difference between spending $500 on a GPU card and $2000 on a GPU card.    Specialized applications that do very intensive mathematical calculations, of course, will show greater differences sooner.  Because Manifold is very efficient at wringing the most out of GPGPU, the rule of thumb is to not overspend on GPU while under-spending on many-core CPU, SSD and main memory.

 

Must I use Quadro or Tesla?  - NVIDIA's higher end cards are sold under Quadro and Tesla branding, indicating features such as ECC memory for automatic error checking and correction.  Such versions cost more than the analogous cards sold in  NVIDIA's "gaming" brands such as GeForce and Titan.  Manifold is happy to work with all NVIDIA cards that are designed to be plugged into standard PCs and are supported with Windows CUDA drivers.   

 

What does the SystemGpgpuCount() function do?  - The other function shown in the Command Window illustration above reports the number of GPGPU-capable GPUs in the system.

 

il_cmd_window_gpgpu01_01.png

 

In the Command window we enter:

 

? SystemGpgpuCount()

 

... and then we press the ! run button in the main toolbar.   We wrote ? without a CALL because the function returns a number, not a table, so it is evaluated without using CALL.

 

il_cmd_window_gpgpu01_02.png

 

The report is a straightforward count of how many GPUs we have that are usable for GPGPU.  

 

See Also

Command Window

 

PRAGMA

 

NVIDIA Home Page