The graphics cards that we use for gaming/visual enhancement has two basic components: a Graphics Processing Unit (GPU) and off-chip DRAM. GPUs are designed for compute intensive jobs, where CPUs are two slow. On the other hand CPUs are designed for data caching and controlling, where GPUs are useless.
GPUs in general have a highly parallel architecture and in particular some of NVIDIA’s GPUs have 240 cores per processor (compare this with modern CPUs: 2, 4 or 8 cores). With such a parallel architecture, GPUs provide excellent computational platform, not only for graphical applications but any application where we have significant data parallelism. The GPUs thus are not limited to its use as a graphics engine but as parallel computing architecture capable of performing floating point operations at the rate of Tera bytes/s. People have realized the potential of GPUs for highly computational tasks, and have been working in general purpose computation on GPUs (GPGPU) for a long time. However, life before NVIDIA’s Compute Unified Device Architecture (CUDA) was extremely difficult for the programmer, since the programmers need to call graphics API (Open GL, Open MP, Open CV etc.). This also has a very slow learning rate. CUDA solved all these problems by providing a hardware abstraction, hiding the inner details of the GPUs, and the programmer is freed from the burden of learning graphics programming. CUDA is C language with some extensions for processing on GPUs. The user writes a C code, while the compiler bifurcates the code into two portions. One portion is delivered to CPU (because CPU is best for such tasks), while the other portion, involving extensive calculations, is delivered to the GPU(s), that executes the code in parallel. Because C is a familiar programming language, CUDA results in very steep learning curve and hence it is becoming a favorite tool for accelerating various applications. NVIDIA’s CUDA SDK is being employed in a plethora of fields right from the computational finance to Neural network and fuzzy logic to simulations for Nanotechnology.
CUDA has several advantages over traditional general purpose computation on GPUs (GPGPU) using graphics APIs.
· Scattered reads – code can read to arbitrary addresses in memory.
· It is high level-basically an extension to C language. So the learning rate
is much higher as compared to the traditional GPGPU.
Shared memory – CUDA exposes a fast-shared memory region (16KB
in size) that can be shared amongst threads. This can be used as a
user-managed cache, enabling higher bandwidth than is possible using
texture lookups.
· Faster downloads and readbacks to and from the GPU
· Full support for integer and bit wise operations
In short CUDA lets you exploit these tiny supercomputers i.e GPUs, that ships with your graphics cards, and lets you accelerate your applications significantly ,some time as high as 100 times and even more depending upon how smartly you have exploited the resources of GPUs
So why should one use CUDA?
Though GPUs have way more cores than CPU, it is not the main reason for using GPUs. In-fact, the typical clock speed of a GPU core is way less than the CPU clock speeds of today. Secondly, most financial problems are very sequential. However, they are more repetitive, i.e., pricing a single security is sequential but you can price more securities with more cores. GPUs power is really in their ability to handle floating point more efficiency and more importantly, the SIMD support (single instruction multiple data). Suppose, you have to add two vectors, a CPU will take linear time to execute the add operation because you will have a loop in your code to add each element separately. On the other hand, GPU’s support vector add instructions which can typically add up to 128 elements in constant time.
But all this power comes at a cost.
1) You loose portability. GPU code is very much tied to vendor and hardware specific
2) Programming paradigm is different. Once you are on a GPU, OS has very little role in resource management. So, applications have to manage resources like cores and several types of memory and registers on GPUs themselves and also make sure that they are not stepping on each other’s resources
3) The amount of memory on GPU is limited. So your data structures have to be more compact and less fragmented and the application on the CPU will have to move bits and pieces to the GPU and drive the algorithm.
4) Unless you are developing everything from scratch, integrating with existing code is going to be tricky and painful.
For more information see http://www.nvidia.com/objects/cuda_home.html