I finally got to write some code today! Matrix multiplication is the "hello world" of parallel computation, so I figured that would be a good place to start. I think I should take this time to explain what matrix multiplication is and why it is a good function to port to a multi threaded platform like CUDA. For simplicity I will be multiplying square matrices of equal size.
First let me show an iterative implementation of matrix multiplication.
//C = A * B
//size is the width or height of the square matrix
for (int i = 0; i < size; i++)
{
for (int j = 0; j
{
float sum = 0;
for( int k = 0; k < size; k++)
{
float a = A[i * size + k]
float b = B[k * size + j]
sum += a * b;
}
C[i * size + j] = sum;
}
}
Now for the CUDA implementation.
\
int tx = threadIdx.x;
int ty = threadIdx.y;
for (int k = 0; k < size; k++)
{
float a = A[ty * size + k];
float b = B[k * size + tx];
float c += a * b;
}
C[ty * size + tx] = c;
The first thing you will notice that the CUDA implementation is only one loop where the iterative implementation has three. The other thing you will notice is that there are two more variables int the CUDA code( tx and ty). They store the reference variables from threadIdx.x and threadIdx.y. This is how a CUDA thread distinguishes itself from the others. Since we use this index value to access the matrix, each thread will calculate the product of a single element of the product matrix C. The CUDA function is called a kernels, while the other is just a normal C function for the CPU or "host".
So this raises some more interesting questions. How does the compiler know what type of function you are writing and how are threads organized on a CUDA device?
CUDA introduces three new keywords to the C programming language; _device_, _global_, and _host_. These are used in the function declaration and tell the compiler what device will execute the code. _device_ is executed on the the CUDA device and called by the CUDA device. _global_ is executed on the CUDA device and called by the host, this is used to declare a kernel. All functions are defaulted as _host_ functions and are executed on the CPU and called by the CPU. If you don't use one of these keywords then the compiler treats it as a _host_ function. This makes code easier to port over to CUDA.
Soooooo what about these thread things? Well it goes like this... A grid is a two dimensional grouping of blocks.One grid per device (I think...). Blocks are 3 dimensional groups of threads. You can have up to 512 threads per block and like 65,000 blocks per grid dimension. How you organize these is up to you.
You organize these blocks when you invoke a kernel. Here is the code for kernel invocation.
CudaFunctionName<<<dimGrid,dimBlock
The new CUDA syntax is those "<<<,>>>" that you see above. This is how you pass the kernel configuration to the device. The two variables provide the dimensions of the grid and block. This is basically saying I want X number of blocks with Y number of threads for this kernel.
I've also been learning how to allocate and copy memory in CUDA and was going to explain that here. My post is getting a bit long so I think I will save that for next time. So until then I will leave you with this beautiful image of a Mandelbrot set I made using the CUDA SDK code examples.