Wednesday, June 9, 2010

Complexity Squared

  I have made some real progress climbing Mt. CUDA this week. I actually compiled and ran my first CUDA application.

  I decided to keep things simple by multiplying a pair of 2X2 matrices using a single block of 4 threads. One thread for each element of the product matrix. I could almost hear the CUDA device laugh as I wrote the code because this really isn't a very challenging problem for a CUDA solution. This is however a problem that can be solved using parallel methods so I thought, as do the books I'm reading, this was a good place to start. 

  Everything was pretty straight forward until I was ready to compile the code. For my first approach I just copied the build script that came with the CUDA SDK in an attempt to use it in my own code. It failed miserably and I soon found myself trying to learn how to write build scripts. Thankfully, I realized that this is really a pointless endeavor at this stage in the game because of the simplicity of the programs I am writing. This made me look at the nvcc compiler documentation and I learned that all I needed to do was write a simple command in the terminal and I could compile my code.

nvcc MatrixMult.cu -o MatrixMult

  I entered the above command and it worked! Here is a picture to prove it.

 Everything is working so I decided to make things a little more complex by multiplying a pair of randomly generated 256X256 matrices using a 4X4 grid of 4X4 blocks. This gives me 16 threads on 16 blocks for a grand total of 256 threads. This is a little bit more reasonable problem for CUDA but nowhere near the millions of threads that I have access to. 

  The added complexity of using more than one block of threads required me to do some extra math to derive the index of each element in the matrices. It is illustrated below:

//broken code
__global__ void MatrixMulKernel(int* Md, int* Nd, int* Pd, int Width)
{
//2D thread ID
int tx = threadIdx.x;
int ty = threadIdx.y;
int bx = blockIdx.x;
int by = blockIdx.y;

int row = (by * blockDim.y + ty);
int col = (bx * blockDim.x + tx);

//Pvalue stores the Pd element that is computed by the thread
int Pvalue = 0;

for (int k = 0; k < Width; k++)
{
Pvalue += Md[row * Width + k]* Nd[k * Width + col];
}
//Write the matrix to device memory each thread writes one element
Pd[row * Width + col] = Pvalue;
}

  You may notice that it says "//broken code" at the top of the code block. This is because this code is NOT working properly. I will show a screen shot of some output produced by this code and you may notice that only the products in the top left quadrant of the matrix "look" like reliable data. Keep in mind that my seed data are values from 0 to 4 for each element. I did this to keep the elements of the product matrix relatively small. I also shrunk the size of the matrix (after noticing a problem) to 32X32 in order to keep the complexity small. Now that only the top left quadrant "looks" like reliable data and the rest of the matrix has values far too small for them to be correct, this tells me a clue. I believe that only the block (0,0) is working correctly. I even checked some of the values by hand to prove this and found that some but not all values are correct. I checked element [0,0](correct) and [15,15](incorrect). This makes me think that I have an index derivation error in my code. Here is the screen shot... I will let my readers contemplate it as I do.


   So this is where I am. I am debugging some slightly more complex code. Hopefully it helps me discover new insights in this powerful technology. Thanks for reading!


*UPDATE*
 As I was proofreading my blog entry I think I have discovered the problem. With my current implementation of a 4X4 grid of 4X4 blocks the biggest matrix I can compute is 16X16 not 32X32 or 256X256! This is why only the 16X16 was computed in a manor that "looked" correct. The reason that the values were off in that block of the matrix is because I use the width of the matrix during the compilation and it was throwing everything off. I was falling off of the the thread array, so to speak. This is why it is important to take a step back, breathe, and look at everything from a fresh perspective. I was so worried about syntax and index derivation that I forgot to how to do some simple math. Garbage in garbage out, as they say. 

1 comment:

  1. Hey Stephen,
    Very good job, indeed! I am pleased to see your progress.
    I sent you an email with some papers that deal with CUDA applications for Numerical Relativity. Look over them at your convenience. The Star symposium deadline has been extended to July 9. However we need to be on track.

    ReplyDelete