Wednesday, May 26, 2010

Let's get parallel.


   I finally got to write some code today! Matrix multiplication is the "hello world" of parallel computation, so I figured that would be a good place to start. I think I should take this time to explain what matrix multiplication is and why it is a good function to port to a multi threaded platform like CUDA.  For simplicity I will be multiplying square matrices of equal size.
    First let me show an iterative implementation of matrix multiplication.
//C = A * B
//size is the width or height of the square matrix
for (int i = 0; i < size; i++)
{
  for (int j = 0; j
  {
      float sum = 0;
      for( int k = 0; k < size; k++)
     {
        float a = A[i * size + k]
        float b = B[k * size + j]
        sum += a * b;
     }
     C[i * size + j] = sum;
  }
}
Now for the CUDA implementation.
\
int tx = threadIdx.x;
int ty = threadIdx.y;

for (int k = 0; k < size; k++)
{
   float a = A[ty * size + k];
   float b = B[k * size + tx];
   float c += a * b;
}
C[ty * size + tx] = c;

 The first thing you will notice that the CUDA implementation is only one loop where the iterative implementation has three. The other thing you will notice is that there are two more variables int the CUDA code( tx and ty). They store the reference variables from threadIdx.x and threadIdx.y. This is how a CUDA thread distinguishes itself from the others. Since we use this index value to access the matrix, each thread will calculate the product of a single element of the product matrix C. The CUDA function is called a kernels, while the other is just a normal C function for the CPU or "host".

So this raises some more interesting questions. How does the compiler know what type of function you are writing and how are threads organized on a CUDA device?

CUDA introduces three new keywords to the C programming language; _device_, _global_, and _host_. These are used in the function declaration and tell the compiler what device will execute the code. _device_ is executed on the the CUDA device and called by the CUDA device. _global_ is executed on the CUDA device and called by the host, this is used to declare a kernel. All functions are defaulted as _host_ functions and are executed on the CPU and called by the CPU. If you don't use one of these keywords then the compiler treats it as a _host_ function. This makes code easier to port over to CUDA.

Soooooo what about these thread things? Well it goes like this... A grid is a two dimensional grouping of blocks.One grid per device (I think...).  Blocks are 3 dimensional groups of threads. You can have up to 512 threads per block and like 65,000 blocks per grid dimension. How you organize these is up to you.

You organize these blocks when you invoke a kernel. Here is the code for kernel invocation.

CudaFunctionName<<<dimGrid,dimBlock>>>("function parameters");

The new CUDA syntax is those "<<<,>>>" that you see above. This is how you pass the kernel configuration to the device. The two variables provide the dimensions of the grid and block. This is basically saying I want X number of blocks with Y number of threads for this kernel.

I've also been learning how to allocate and copy memory in CUDA and was going to explain that here. My post is getting a bit long so I think I will save that for next time. So until then I will leave you with this beautiful image of a Mandelbrot set I made using the CUDA SDK code examples.

Monday, May 24, 2010

It's alive!


  Looks like everything is compiled and working! I made a very amateur mistake apparently. I got in a hurry when I was installing the CUDA toolkit and forgot to login as Super User.

  For my next task I will be going through all of the code examples that came with the CUDA SDK. I also plan on installing the NetBeans IDE using this guide I ran acros on the Nvidia forums.

http://forums.nvidia.com/index.php?showtopic=57297

 The guide is for Ubuntu but it should be possible on Fedora.

 I also need to start thinking about setting up a VPN so I can work from home since I'm only in the lab two days a week.

Until next time...

Tuesday, May 18, 2010

Baby Steps


It looks like I made some progress today. Fedora 10 is installed and all of the appropriate packages and drivers are up and running. I ran into a little problem when I edited the /.bash_profile for the appropriate PATH, but finally got them all sorted out.
The CUDA toolkit and SDK were also installed without a hiccup. However, when I tried to compile the source code examples that came with the SDK I got two errors:

make[1]: *** [obj/i386/release/cutil.cpp.o] Error 1

make: *** [lib/libcutil.so] Error 2

This may indicate that something did not install correctly but I'm not sure. I plan on doing some research this week before I go back on Monday to start writing some code. I thought that it may be the version of gcc that I am running but I have the default version of 4.3.2 installed which should work.
I will also be reading a very interesting text book about CUDA. This is the link to the amazon page to buy it if you are interested.


Doh!

So I figured out what the problem was. I had installed Fedora 12 on accident. I'm pretty sure that I got the ISO from the version 10 section of the mirror but who knows! The good news is that I found the problem and that my friends is progress.

Monday, May 17, 2010

Let's build something!


Today is my first day on the job! I thought today was going to be a pretty straightforward day, until I tried to install the CUDA driver, but more on that later.
Basically for today I wanted to get the CUDA system up and running. This required me to take a desktop computer system and add a CUDA compatible GPU. This went pretty smoothly, I added a BFG Nvida GTX 280 and a 500 watt power supply. I even managed to take a picture so enjoy that!
After the hardware was installed I had to install the operating system. For this I chose Fedora 10 x86_64. This was the OS of choice mostly because my professor is also building a CUDA system and he is using the same OS. I want to be able to pick his brain when I get stuck and or confused. The OS install went smoothly.
The rest of the day was spent trying to get the CUDA driver from Nvidia to install. This thing has been a pain. Nvidia supplies you with a .run Shell script to install the driver from a compressed file. It must be compiled on the system so that it works correctly with the linux kernel. I have been getting an error that says the nvidia module can not be loaded. The good news is I think I have found a solution and will let you all know how it goes tomorrow.

Hello World!

Welcome to CudaCrunch, your window into the world of a fledgling CUDA developer. I am in fact a student of computer science at Marshall University. Well that's all well and good but what in the world is CUDA anyways? CUDA, or Compute Unified Device Architecture, is a technology being developed by Nvidia to solve a little problem with modern computers.
In the past if you wanted to build a faster computer you simply made a faster CPU. This was usually accomplished by adding more transistors to a processor and increasing the clock speed. There is even a fancy law to predict this growth called Moore's law. It states that the number of transistors that can be cheaply placed on a processor will double every 20 months. This law is breaking down and not really true anymore. This size of the transistor is so small that it can't really get any smaller. So what do we do?
Well, we play video games! Seriously, Nvidia began as a high performance video card manufacturer. They produced a new type of computation device called the GPU, or Graphics Processing Unit, that was sold to PC Gamers! They weren't the only company on the block and still have competitors but they are responsible for CUDA so they are all I will talk about here.
So, what does a GPU do differently than a CPU. Well lots of things, but to keep it simple lets just think of it like this. A GPU is just a bunch of processors on a single chip. We can't put more transistors in the processor, but we can put more processors on the the chip! Through some clever engineering that I don't completely understand, ...yet, we can produce applications that can use all of these compute devices at the same time in parallel.
This new fact is good for scientist and the like. Parallel processing allowed computers to draw the complex 3D scenes in a video games up to 60 times per second! Now that is great and I have a lot of fun playing video games, but it seems like a waste to use all that power just for entertainment. Hence CUDA, the way to use all that power for computing complex problems. It is allowing us to have something as powerful as a modern supercomputer contained in something the size of a small workstation.
The following posts will be a window into my new job as a CUDA programmer. I hope you all can learn as I do and help me out when you see me getting confused or lost. =)